ZFS, hard drives, partitions, and fault tolerance. Also, fault domains.

The good, the bad, the ugly, the fault domains, and also multi-actuator hard drives.

Fault domains

The fault domain is a set of several things which fail together. The fault domain can be only defined with respect to some specific failure, or some specific failure mode.

All partitions on the same hard drive are in the same fault domain (with respect to a single drive failure). Should the hard drive fail, all the partitions on it become inaccessible simultaneously (and maybe forever).
All the hard drives attached to the same controller or the same power supply are in the same fault domain with respect to the controller or the power supply. Should the controller or the power supply fail, all drives become inaccessible until you replace the controller or the power comes back.

Suppose you want something X to be a backup of something Y against some failure Z. For the backup to work, X and Y must be in different fault domains with respect to the failure Z. For example, for a backup to protect you against a hard drive failure, the backup copy must be on a different hard drive than the original. If you want a backup against a fire, the original and the backup must be in the different buildings because the fault domain grows to encompass the entire building.

The good

Generally, allocating the entire hard drive to the ZFS pool is considered good practice. There are several widely accepted methods to do that:

  • Allocate the entire drive to a pool without any partitioning.
  • Make a single partition occupying the entire drive, and allocate this partition to a pool. This method allows the use of software encryption like LUKS or GELI.
  • Make a small partition at the front of the drive, holding a copy of a boot partition, followed by a large partition occupying most of the drive. Allocate the large partition to a pool. This way, for example, in a four-disk system, there will be a four-way mirror of a boot partition at the start of each drive, followed by a four-member RAIDZ1 ZFS pool occupying most of the drive.

In all of these cases, each hard drive is a fault domain for itself. There is an argument to be made that all the drives (or large groups of drives) are attached to the same controller, then the controller failure will bring down all the drives simultaneously. However, replacing the controller usually solves the controller failure, whereas the drive failures are often forever.

The bad

It is not possible to improve fault tolerance by adding more replicas in the same fault domain. Let's say, for example, that you have three hard drives, each having one unit of capacity. You have several valid options to make a fault-tolerant set

  • RAIDZ1, with two units of available capacity and fault tolerance of a single hard drive.
  • RAIDZ21, with one unit of available capacity and fault tolerance of two hard drives.
  • Three-way mirror, same as RAIDZ22, one available capacity, and two-drive fault tolerance.

So far, still good.

Now, can we somehow squeeze more out of the same hardware? Let's split each drive into two partitions of equal size, for six partitions in total. Now, make a RAIDZ2 out of six partitions. One might expect to get a RAIDZ2, tolerant of two failures, at the cost of only one hard drive capacity.

Except it does not work like that. What it does it produces a RAIDZ2 capable of tolerating failure of two partitions, not drives. And two partitions are on the same drive and will fail at the same time. It is not possible to improve fault tolerance by pretending to split the fault domain.

The ugly

There was one other idea which I read somewhere and unfortunately forgot the source. I don't know if the author ever implemented the idea, but nothing is inherently wrong with it, except that it is utterly impractical. The original question was – is it somehow possible to eke more life out of a hard drive that is slowly developing bad blocks? Actually, yes. Split a single drive into several partitions, then make a RAIDZ, or even RAIDZ2, over these partitions. If there is a new bad sector, it will be corrected by ZFS error correction logic, using the parity data from the other partitions. The sector will then be marked as bad so we not to reuse it3. In marked contrast to a previous bad idea, this works because fault domain is limited to a single sector. This ugly method does not protect against a drive failure, and it was never intended to. It does protect against a single sector going bad at the cost of the horrific loss of performance.

Late addition: the question later resurfaced in application to a MicroSD card.

Performance considerations

Spinning disks have radically different performance characteristics for sequential operations and head seeks. The reason for that is purely mechanical, and apples in the same manner to both reads and writes4. Sequential operations are significantly faster because of how mechanical heads move. Furthermore, the time required to position heads is to some extent proportional to the distance between the LBAs for the two sectors. In other words, it takes significantly longer to read the first and the last sectors of the disk than it takes to read the first and the second sectors. The difference is because the head has to travel through its entire range, which takes time.

The filesystems, including ZFS, are designed to take this into account. The filesystem tries to place file data and the corresponding metadata together, thus keeping head travel time to a minimum. RAIDZ distributes the data on the disks storing all parts of the same data block at the same address on all disks. Thus, the head movements are synchronized across disks, minimizing total head travel time in the same way as for a single disk.

Synchronizing head movement works well as long as RAIDZ members are different physical disks. As soon as you replace different disks with partitions on the same disk, everything falls apart. The single disk can no longer execute three read requests in parallel. Instead, the disk will execute each request in turn, and there is a significant delay between them as heads reposition to each partition. At best, you get a 3x slowdown with three partitions.

So, it is not good to split hard drives into partitions and make a RAID (of any kind) of these partitions.

Multi-actuator hard drives

Multi-actuator hard drives did not quite make the cut, and I will discuss them separately.

Footnotes

Note 1. I'm not quite sure one can practically create RAIDZ2 on three disks. However, any limitation is purely artificial. There is nothing in the ZFS structure to make a RAIDZ2 on three disks impossible.

Note 1. Available capacity estimates are radically simplified in this example.

Note 3. On the second thought, I doubt ZFS does that. I can't recall ever seeing a ZFS metadata structure holding any kind of bad blocks list.

Note 4. Unless we are talking SMR (Shingled Magnetic Recording) drives, which have different characteristics for reads and wires in any mode of operation. However, SMR drives and their associated complexity are outside the scope this time.

Created 19 June, 2021

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.