Recovery of destroyed or corrupt ZFS pools

Here, I'm talking about completely damaged ZFS pools when the pool cannot be mounted at all. I will talk about undeleting individual files from functioning pools later.

ZFS disk labels

ZFS uses disk labels to record which disk belongs to which pool and what are the parameters of that pool. The important thing is that the labels hold disk order and RAID levels for vdevs composing the pool. Each disk stores four labels, two at the start and two at the end. The rationale is that any massive overwrite will likely go from start to end, and if the user stops it in time, the labels at the end of the disk survive. The labels also hold pointers to the filesystem metadata and are thus updated every time the filesystem is modified. If all the labels on a given disk are damaged, ZFS can no longer identify the disk as a pool member, and the disk drops out. If the number of missing disks exceeds pool redundancy, the pool no longer mounts. If the labels are damaged, it does not matter if the rest of the data (or most of it) is intact.

Common failure modes

There are several ways to damage multiple disk labels at once. None of them are exclusive to ZFS. All filesystems are more or less vulnerable to the same set of problems.

  • The most drastic and common, it seems, is to destroy the pool and create another pool over the same disk set. This happens when people confuse disks, pool names, and whatnot. Since labels are always in the same place on disks, the new labels overwrite the old ones precisely. The old filesystem metadata is then impossible to reach by normal means.
  • Sometimes, faulty RAM causes wrong data to be written to the filesystem. Since labels are updated often, errors tend to propagate fairly quickly. The pool may be damaged beyond repair if enough data is written before the system crashes.
  • Additionally, pools may crash after a power failure. In theory, ZFS is supposed to protect against this type of failure, but there is an important restriction - ZFS relies on the hard drive reliably flushing its cache, and that sometimes does not happen. To improve performance, some drives and some USB configurations will lie about caches being flushed. This indeed improves performance, but at the cost of reliability.
  • Last but not least, various random glitches in hardware or software take their fair share of failure cases.

Other key metadata

Once the correct disk labels are identified, there are still many things ZFS needs to read to eventually mount the pool.

  1. The disk label, specifically the uberblock, holds a pointer to the MOS (Meta Object Set). The uberblock is the approximate equivalent of a boot sector or a superblock in other filesystems. It is technically part of a disk label.
  2. MOS contains an object directory and references to each dataset in the filesystem.
  3. Each dataset then contains a full set of references describing either files and directories (for datasets you access as files, either directly or over the network), or an instance of ZVOL, a single large file you access as a block storage (typically over iSCSI).

Only after all these pieces of data are correctly read ZFS gets full access to the content of the filesystem. Most of the pieces have two copies stored, with some having three copies. Multiple copies are stored even if the filesystem uses RAIDZ or mirrors, which provide certain redundancies themselves. This way, a single bad sector or a single spot overwrite does not bring down the entire filesystem.

However, this set of metadata is less critical than the disk labels. If all the disk labels (or enough of them anyway) are damaged, there is no hope of finding copies because there are only a few places to look. The complex process is then required to rebuild disk labels from the data on disks. Once it is done, however, the rest of the metadata can typically be found by a reasonably simple search.

Symptoms

Apart from just not being able to locate the pool to import, there are several error messages associated with catastrophic damage to disk labels or other metadata.

The pool metadata is corrupted. The pool cannot be imported due to damaged devices or data.
cannot import 'tank': one or more devices is currently unavailable
One or more devices are missing from the system. The pool cannot be imported. Attach the missing devices and try again.
Additional devices are known to be part of this pool, though their exact configuration cannot be determined.
One or more devices could not be used because the label is missing or invalid.

Recovery

Klennet ZFS Recovery reads the filesystem without requiring disk labels to be readable or correct. What's worse, if a new pool was created over the deleted one, the labels are readable and perfectly correct but point to the metadata of a new empty pool. The solution is to find all the remnants of the metadata of the original pool by scanning all the available drives, then assemble what metadata can be salvaged and read it. Because ZFS is copy-on-write, multiple copies of old metadata are scattered all around, and the recovery is pretty good, even from somewhat overwritten pools.

Filed under: ZFS.

Created Sunday, December 30, 2018

Updated 27 Oct 2019