ZFS, hard drives, partitions, and fault tolerance. Also, fault domains.

The good, the bad, the ugly, the fault domains, and also multi-actuator hard drives.

Fault domains

The fault domain is a set of several things which fail together. You can only define a fault domain with respect to some specific failure or failure mode.

All partitions on the same hard drive are in the same fault domain with respect to a single drive failure. Should the hard drive fail, all its partitions become inaccessible simultaneously (and maybe forever).
All the hard drives attached to the same controller or power supply are in the same fault domain with respect to the controller or the power supply. Should the controller or the power supply fail, all drives become inaccessible until you replace the controller or the power comes back.

Suppose you want something X to be a backup of something Y against some failure Z. For the backup to work, X and Y must be in different fault domains with respect to the failure Z. For example, for a backup to protect you against a hard drive failure, the backup copy must be on a different hard drive than the original. If you want a backup against a fire, the original and the backup must be in different buildings because the fault domain grows to encompass the entire building.

The good

Generally, allocating the entire hard drive to the ZFS pool is considered good practice. There are several widely accepted methods to do that:

  • Allocate the entire drive to a pool without any partitioning.
  • Make a single partition occupying the entire drive, and allocate this partition to a pool. This method allows the use of software encryption like LUKS or GELI.
  • Make a small partition at the front of the drive, holding a copy of a boot partition, followed by a large partition occupying most of the drive. Allocate the large partition to a pool. For example, in a four-disk system, there will be a four-way mirror of a boot partition at the start of each drive, followed by a four-member RAIDZ1 ZFS pool occupying most of the drive.

In all of these cases, each hard drive is a fault domain for itself. There is an argument to be made that if all the drives (or large groups of drives) are attached to the same controller, then the controller failure will bring down all the drives simultaneously. However, replacing the controller usually solves the controller failure, whereas the drive failures are often forever.

The bad

It is impossible to improve fault tolerance by adding more replicas in the same fault domain. Let's say, for example, that you have three hard drives, each having one unit of capacity. You have several valid options to make a fault-tolerant set

  • RAIDZ1, with two units of available capacity and fault tolerance of a single hard drive.
  • RAIDZ21, with one unit of available capacity and fault tolerance of two hard drives.
  • Three-way mirror, same as RAIDZ22, one available capacity, and two-drive fault tolerance.

So far, still good.

Now, can we somehow squeeze more out of the same hardware? Let's split each drive into two partitions of equal size, for six partitions in total. Now, make a RAIDZ2 out of six partitions. One might expect to produce a RAIDZ2, tolerant of two failures, at the cost of only one hard drive capacity.

Except it does not work like that. What it does is it produces a RAIDZ2 capable of tolerating the failure of two partitions, not drives. And two partitions are on the same drive and will fail simultaneously. It is impossible to improve fault tolerance by pretending to split the fault domain.

The ugly

There was one other idea that I read somewhere and unfortunately forgot the source. I don't know if the author ever implemented the idea, but nothing is inherently wrong with it except that it is utterly impractical. The original question was – is it somehow possible to eke more life out of a hard drive that is slowly developing bad blocks? Actually, yes. Split a single drive into several partitions, then make a RAIDZ, or even RAIDZ2, over these partitions. If there is a new bad sector, ZFS will recover using the parity data from the other partitions. The sector will then be marked as bad so as not to reuse it3. In marked contrast to a previous bad idea, this does work, because the fault domain is limited to a single sector. This ugly method does not protect against a drive failure and was never intended to. It does protect against a single sector going bad at the cost of a horrific loss of performance.

Late addition: the question later resurfaced in the application to a MicroSD card.

Performance considerations

Spinning disks have radically different performance characteristics for sequential operations vs. head seeks. The reason for that is purely mechanical, and apples in the same manner to both reads and writes4. Sequential operations are significantly faster because of how mechanical heads move. Furthermore, the time required to position heads is, to some extent proportional, to the distance between the LBAs for the two sectors. In other words, it takes significantly longer to read the first and the last sectors of the disk than it takes to read the first and the second sectors. The difference is that the head has to travel through its entire range, which takes time.

The filesystems, including ZFS, are designed to take this into account. The filesystem tries to place file data and the corresponding metadata together, thus keeping head travel time to a minimum. RAIDZ distributes the data on the disks storing all parts of the same data block at the same address on all disks. Thus, the head movements are synchronized across disks, minimizing total head travel time in the same way as for a single disk.

Synchronizing head movement works well as long as RAIDZ members are different physical disks. As soon as you replace different disks with partitions on the same disk, everything falls apart. The single disk can no longer execute three read requests in parallel. Instead, the disk will execute each request in turn, with a significant delay between them as heads reposition to each partition. At best, you get a 3x slowdown with three partitions.

So, it is not good to split hard drives into partitions and make a RAID (of any kind) of these partitions.

Multi-actuator hard drives

Multi-actuator hard drives did not quite make the cut, and I will discuss them separately.


Note 1. I'm not quite sure one can practically create RAIDZ2 on three disks. However, any limitation is purely artificial. There is nothing in the ZFS structure to make a RAIDZ2 on three disks impossible.

Note 1. Available capacity estimates are radically simplified in this example.

Note 3. On second thought, I doubt ZFS does that. I can't recall ever seeing a ZFS metadata structure holding any kind of bad block list.

Note 4. Unless we are talking SMR (Shingled Magnetic Recording) drives, which have different characteristics for reads and wires in any mode of operation. However, SMR drives and their associated complexity are outside the scope this time.

Filed under: ZFS.

Created Saturday, June 19, 2021