ZFS and power failures revisited

ZFS was designed to be resilient to power failures. It has all the modern stuff the filesystem needs - copy-on-write, intention logging, transactions with rollback, whatever. However, there are plenty of cases when a power failure breaks a ZFS pool to the point where the pool no longer mounts (see my other post here for common symptoms).

Why does this happen?

The brief answer is - some real-life pieces of software or hardware did not match the theoretical requirements. In theory, ZFS survives power failure if the system meets all hardware requirements. In practice, the requirements were not met.

Requirements

For the resilient design to work, many things must work just right. Let's have a simplified list for simplified storage involving a filesystem, disk controller, and disks.

  1. The design of the filesystem itself must be correct, leaving no holes through which the power failure at an inopportune time can cause a catastrophic failure.
  2. The implementation of the above design must also be correct. The design and implementation are, in fact, two different things.
  3. Then, there is a significant set of non-trivial requirements for the hardware. Aside from generally proper function (for example, the data should go where we told the disk to write it and not end up somewhere else), the requirements are chiefly about caching and write ordering. Specifically, neither the controller nor disks should cache something we said not to cache, and in some cases, they must honor the requested order of writes (that is, the disk must write X before it writes Y).

What could possibly go wrong?

As far as the design and implementation of ZFS go, I won't discuss that in any detail. Suffice it to say that while I personally think it is excessively complicated, I don't know of any flaw in the design. The implementation is a much bigger can of worms, but then again, some very clever people have been working on it for a long time, and it looks pretty good.

Now the hardware is a different kettle of fish. A modern hard drive is an incredibly complicated bit of machinery and programming, with all the issues the complexity implies.

A little bit of a detour

Let's first look at RAID controllers. All the ZFS best practice manuals say in unison that you should avoid RAID controllers at all costs. Theoretically, RAID controllers are supposed to conform to the same requirements hard drives do, including handling of their caches, so there should be no harm in using them. In practice, however, RAID controller firmware is even more complex than that of a hard drive, and it is not exactly bug-free. Take a look at the release notes for the firmware update for the ServeRAID M1210e (original, archived)

  1. Looking at various references to the C source code files and the corresponding line numbers, you can get a glimpse of just how big the firmware is.
  2. One bit is especially disturbing
    Fixed an issue where cache is not flushed immediately upon firmware receiving the cache-flush out-of-band command (SCGCQ00882125)
    here goes your cache flushing guarantee.

You can easily see the rationale behind the best practice recommendation - do not pile more complexity onto the already complex system.

Back to hard drives

There are two aspects to hard drives - firmware and hardware. While complex, hard drive firmware is significantly more reliable than RAID controller firmware. There were worrisome incidents over the years, but not that many. The hardware part of a drive is a marvel of mechanical engineering, with mechanical parts spinning and flying around with unbelievably low tolerances.

All of this requires clean and stable power to work correctly. The requirements flow downstream - the filesystem requires something from a hard drive, and the hard drive, in turn, requires something from its power supply. And therein finally lies the problem - the power supply may prove unable to match the requirements during a brownout or some other kind of a transient on the power line. So the power guarantee goes, and, in turn, whatever the hard drive guaranteed to the filesystem also goes. Most of the time, people get lucky, and nothing serious is affected. However, every once in a while, something vital gets damaged.

Conclusion

The conclusion is pretty simple - even if you use ZFS, which is resistant to power failures, do use a UPS. And, while we are at it, do the battery maintenance and replacement as prescribed.

Filed under: ZFS.

Created Wednesday, June 22, 2022