ZFS RAIDZ vs. traditional RAID

How does ZFS RAIDZ compare to its corresponding traditional RAID when it comes to data recovery? For discussion of performance, disk space usage, maintenance and stuff you should look elsewhere. I only cover data recovery side of things.

Conceptual differences

  • Traditional RAID is separated from the filesystem. In traditional systems, one can mix and match RAID levels and filesystems.
  • RAIDZ is integrated with ZFS and cannot be used with any other filesystem.
  • ZFS uses an additional level of checksums to detect silent data corruption, when the data block is damaged, but the hard drive does not flag it as bad. These checksums are not limited to RAIDZ. ZFS uses checksums with any level of redundancy, including single-drive pools.

Equivalent RAID levels

As far as disk space goes, RAIDZn uses n drives for redundancy. Therefore

  • RAIDZ (sometimes explicitly specified as RAIDZ1) is approximately the same as RAID5 (single parity),
  • RAIDZ2 is approximately the same as RAID6 (dual parity),
  • RAIDZ3 is approximately the same as (mostly hypothetical) RAID7 (triple parity).

Disk space overhead is not exactly the same. RAIDZ is much more complicated than traditional RAID, and its disk space usage calculation is also complicated. Various factors affect RAIDZ overhead, including average file size.

Recovery from drive failures

Simplifying, there are two types of drive failures.

  • Fail-stop, when the drive either fails in its entirety, or certain sectors are reporting errors when read.
  • Silent data corruption, when the drive returns incorrect data without any warning and without any method to discern that the data is in fact incorrect.

The distinction is important, because parity RAID can reconstruct one bad data block for each available parity block, but only if you know which block is bad.

With fail-stop failures, ZFS RAIDZn is identical to its corresponding traditional RAID. When drive fails completely, you know that all blocks stored on that drive are now bad and you need to reconstruct them.

With silently corrupt data, RAIDZn can reconstruct damaged data, thanks to the extra checksum provided by ZFS. Having no extra help, traditional RAID does not recover from silent data corruption, because it does not know which block to reconstruct.

Recovery from loss of metadata

Recovery of RAIDZn is very different from traditional RAID if the RAID metadata, such as block size and disk order, is lost.

Traditional RAID is regular. Once you know block size, RAID level, and disk order, you can convert any array data block address to its corresponding disk and address on disk, and also determine where the corresponding parity block is. Because the block and parity patterns are regular, filesystem-agnostic statistical analysis tools (like Klennet RAID Viewer) are very effective with traditional RAID. In most cases, one can figure out the layout without any knowledge of the filesystem in use.

RAIDZn is not regular. It does not have fixed block size, and there is no set pattern of data and parity blocks. The physical layout depends on what data is written to disks and in what order. So, there is no way one can determine the layout without including filesystem into analysis. This, while doable, greatly increases computational requirements. It also prevents analysis by filesystem-agnostic RAID analysis tools.

Write hole

Write hole is a failure mode of the traditional parity RAID. In traditional RAID5 or RAID6, parity blocks must always match their corresponding data blocks. However, parity and data blocks are written to different disks. If power fails mid-write, it is possible that some disks complete their writes and some others don't. Therefore, some disks will contain old (pre-update) data and some will contain new (post-update) data. In this case, parity no longer matches the data. Even worse, it is not possible to tell which one is correct (without utilizing some kind of external checksum). This is called write hole and it cannot be completely fixed in traditional RAID without introducing significant additional complexity and possibly some speed penalty. Hardware RAID controllers mitigate the write hole problem by using battery backup; software RAID relies on UPS. Neither of these workarounds is 100% effective. While battery power protects against power outage, OS or firmware crash is no less damaging.

ZFS works around write hole by embracing the complexity. It is not like RAIDZn does not have a write hole problem per se, because it does. However, once you add transactions, copy-on-write, and checksums on top of RAIDZ, the write hole goes away.

Overall tradeoff is a risk of write hole silently damaging limited area of the array (which may be more or less important) versus the risk of losing the entire system to a catastrophic failure if something goes wrong with a ZFS pool. ZFS fans will say that you never lose a ZFS pool to a simple power failure, but empirical evidence to the contrary is abundant.

Summary

ZFS and RAIDZ are better than traditional RAID in almost all respects, except when it comes to a catastrophic failure when your ZFS pool refuses to mount. If this happens, recovery of ZFS pool is more complicated and requires more time than recovery of a traditional RAID. This is because ZFS and RAIDZ are much more complex.

This reflects the Catch-22 of complexity:

  • More complex system can be made more robust against a larger set of anticipated failures, compared to a simpler system.
  • As complexity increases, unanticipated failures become more difficult to recover from, again compared to a simpler system.

Bonus - entropy histograms

As a bonus, let's take a look at entropy histograms, used to determine RAID block size in traditional RAID.

Entropy historgams for RAID5 (top) and RAIDZ (bottom)

Entropy historgams for RAID5 (top) and RAIDZ (bottom);
X-axis shows disk addresses (LBAs), Y-axis shows differential entropy

You see, the RAID5 histogram (the top one) is beautifully simple, showing RAID block boundaries with a good signal-to-noise ratio. Each peak on the top histogram corresponds to change from one block to another, distance between peaks indicating block size. RAIDZ histogram (the bottom one) is much more difficult to interpret. Peaks labelled with red dots may look like candidates for block boundaries, but they are not, because the peaks are not equidistant. Bottom line again is that RAIDZ analysis is much more complicated than what we do for traditional RAID, and traditional tools are either useless or require much more skill to interpret results properly.

Created Thursday, July 4, 2019

Updated 01 September 2019

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.