ZFS RAIDZ vs. traditional RAID

How does ZFS RAIDZ compare to its corresponding traditional RAID when it comes to data recovery? For discussion of performance, disk space usage, maintenance and stuff you should look elsewhere. I only cover data recovery side of things.

Conceptual differences

  • Traditional RAID is separated from the filesystem. In traditional systems, one can mix and match RAID levels and filesystems.
  • Traditional RAID can be implemented in hardware. However, there is no hardware controller implementing RAIDZ.
  • RAIDZ is integrated with ZFS and cannot be used with any other filesystem.
  • ZFS uses an additional level of checksums to detect silent data corruption, when the data block is damaged, but the hard drive does not flag it as bad. ZFS checksums are not limited to RAIDZ. ZFS uses checksums with any level of redundancy, including single-drive pools.

Equivalent RAID levels

As far as disk space goes, RAIDZn uses n drives for redundancy. Therefore

  • RAIDZ (sometimes explicitly specified as RAIDZ1) is approximately the same as RAID5 (single parity),
  • RAIDZ2 is approximately the same as RAID6 (dual parity),
  • RAIDZ3 is approximately the same as (mostly hypothetical) RAID7 (triple parity).

Disk space overhead is not exactly the same. RAIDZ is much more complicated than traditional RAID, and its disk space usage calculation is also complicated. Various factors affect RAIDZ overhead, including average file size.

On-disk layout

RAIDZ layout is not like any other RAID.

Disk 1Disk 2Disk 3Disk 4 Disk 1Disk 2Disk 3Disk 4
123P P1P2
56P4 P34X
9P78 P567
P101112 8P910

In RAID5, blocks are placed in a regular pattern. You only need to know the block number (address) to figure out where and on which disk the block is stored, and where its corresponding parity block is. Also, with N disks, exactly one parity block is stored for every N-1 data blocks.

In RAIDZ, each recordsize block of data is compressed first. Then, compressed data is distributed across the disks, along with a parity block. So, one needs to consult filesystem metadata for each file to find out where the file records are, and where correpsonding parities are. If data compresses to only one sector, one sector of data will be stored along with one sector of parity. Therefore, there is no fixed proportion of parity to the data. Moreover, sometimes padding is inserted to better align blocks on disks (denoted by X in the above example), which may increase overhead.

Recovery from drive failures

Simplifying, there are two types of drive failures.

  • Fail-stop, when the drive either fails in its entirety, or certain sectors are reporting errors when read.
  • Silent data corruption, when the drive returns incorrect data without any warning and without any method to discern that the data is in fact incorrect.

The distinction is important, because parity RAID can reconstruct one bad data block for each available parity block, but only if you know which block is bad.

With fail-stop failures, ZFS RAIDZn is identical to its corresponding traditional RAID. When drive fails completely, you know that all blocks stored on that drive are now bad and you need to reconstruct them.

With silently corrupt data, RAIDZn can reconstruct damaged data, thanks to the extra checksum provided by ZFS. Having no extra help, traditional RAID does not recover from silent data corruption, because it does not know which block to reconstruct.

Recovery from loss of metadata

Recovery of RAIDZn is very different from traditional RAID if the RAID metadata, such as block size and disk order, is lost.

Traditional RAID is regular. Once you know block size, RAID level, and disk order, you can convert any array data block address to its corresponding disk and address on disk, and also determine where the corresponding parity block is. Because the block and parity patterns are regular, filesystem-agnostic statistical analysis tools (like Klennet RAID Viewer) are very effective with traditional RAID. In most cases, one can figure out the layout without any knowledge of the filesystem in use.

RAIDZn is not regular. It does not have fixed block size, and there is no set pattern of data and parity blocks. The physical layout depends on what data is written to disks and in what order. So, there is no way one can determine the layout without including filesystem into analysis. This, while doable, greatly increases computational requirements. It also prevents analysis by filesystem-agnostic RAID analysis tools.

Write hole

Write hole is a failure mode of the traditional parity RAID. In traditional RAID5 or RAID6, parity blocks must always match their corresponding data blocks. However, parity and data blocks are written to different disks. If power fails mid-write, it is possible that some disks complete their writes and some others don't. Therefore, some disks will contain old (pre-update) data and some will contain new (post-update) data. In this case, parity no longer matches the data. Even worse, it is not possible to tell which one is correct (without utilizing some kind of external checksum). This is called write hole and it cannot be completely fixed in traditional RAID without introducing significant additional complexity and possibly some speed penalty. Hardware RAID controllers mitigate the write hole problem by using battery backup; software RAID relies on UPS. Neither of these workarounds is 100% effective. While battery power protects against power outage, OS or firmware crash is no less damaging.

ZFS works around write hole by embracing the complexity. It is not like RAIDZn does not have a write hole problem per se, because it does. However, once you add transactions, copy-on-write, and checksums on top of RAIDZ, the write hole goes away.

Overall tradeoff is a risk of write hole silently damaging limited area of the array (which may be more or less important) versus the risk of losing the entire system to a catastrophic failure if something goes wrong with a ZFS pool. ZFS fans will say that you never lose a ZFS pool to a simple power failure, but empirical evidence to the contrary is abundant.

Rebuild speed

After a drive goes bad and is replaced, the data from the bad drive needs to be regenerated onto the new drive. This process is typically called rebuild, but ZFS calls it resilvering. There are two significant metrics:

  1. Rebuild speed, measured in megabytes per second.
  2. Rebuild time, amount of time required to rebuild all the missing data.

and two significant considerations:

  1. Rebuild speed for traditional RAID is much faster. However, traditional RAID has to rebuild both used and free blocks.
  2. Rebuild speed for ZFS RAIDZ is slower. However, RAIDZ only needs to rebuild blocks that do hold data. RAIDZ does not rebuild the empty blocks, thus completing rebuilds faster when pool has significant free space on it.

In a traditional RAID, where all blocks are regular, you take block 0 from each of the old drives, compute the correct data for block 0 on the missing drive, and write the data onto a new drive. This process is then repeated for all blocks, even for the blocks that hold no data. This is because the traditional RAID controller does not know which blocks on the RAID are in use and which are not. If the array is otherwise idle, serving no user requests during rebuild, the process is done sequentially from start to end, which is the fastest way to access rotational hard drives.

ZFS uses variable-sized blocks. Therefore, for each recordsize worth of data, which can be anywhere from 4 KB to 1 MB, ZFS needs to consult block pointer tree to see how data is laid out on disks. Because block pointer trees are often fragmented, and files are often fragmented, there is quite a lot of head movement involved. Rotational hard drives perform much slower with a lot of head movement, so megabyte per second speed of the rebuild is slower than that of a traditional RAID. Now, ZFS only rebuilds the part of the array which is in use, and it does not rebuild free space. Therefore, on lightly used pools it may actually complete faster than a traditional RAID. However, this advantage disappears as the pool fills up.


ZFS and RAIDZ are better than traditional RAID in almost all respects, except when it comes to a catastrophic failure when your ZFS pool refuses to mount. If this happens, recovery of ZFS pool is more complicated and requires more time than recovery of a traditional RAID. This is because ZFS and RAIDZ are much more complex.

This reflects the Catch-22 of complexity:

  • More complex system can be made more robust against a larger set of anticipated failures, compared to a simpler system.
  • As complexity increases, unanticipated failures become more difficult to recover from, again compared to a simpler system.

Bonus - entropy histograms

As a bonus, let's take a look at entropy histograms, used to determine RAID block size in traditional RAID.

Entropy historgams for RAID5 (top) and RAIDZ (bottom)

Entropy historgams for RAID5 (top) and RAIDZ (bottom);
X-axis shows disk addresses (LBAs), Y-axis shows differential entropy

You see, the RAID5 histogram (the top one) is beautifully simple, showing RAID block boundaries with a good signal-to-noise ratio. Each peak on the top histogram corresponds to change from one block to another, distance between peaks indicating block size. RAIDZ histogram (the bottom one) is much more difficult to interpret. Peaks labelled with red dots may look like candidates for block boundaries, but they are not, because the peaks are not equidistant. Bottom line again is that RAIDZ analysis is much more complicated than what we do for traditional RAID, and traditional tools are either useless or require much more skill to interpret results properly.

Created Thursday, July 4, 2019

Updated 01 September 2019

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.