Recovery of destroyed or corrupt ZFS pools

Here, I'm talking about completely damaged ZFS pools, when the pool cannot be mounted at all. I will talk about undeleting individual files from functioning pools later.

ZFS disk labels

ZFS uses disk labels to record which disk belongs to which pool and what are the parameters of that pool. The important thing is that the labels hold disk order and RAID levels for vdevs composing the pool. Four labels are stored on each disk, two at the start and two at the end. The rationale is that any kind of massive overwrite is likely to go from start to end and if the overwrite is aborted in time, the labels at the end of the disk survive. The labels also hold pointers to the filesystem metadata and are thus updated every time the filesystem is modified. If all the labels on a given disk are damaged, ZFS can no longer identify the disk as the member of the pool and the disk drops out. If the number of missing disks exceeds pool redundancy, the pool no longer mounts. If the labels are damaged, it does not matter if the rest of the data (or most of it anyway) is intact.

Common failure modes

There are several ways to damage multiple disk labels at once. None of them are really exclusive to ZFS, all filesystems are more or less vulnerable to the same set of problems.

  • The most drastic and the most common, it seems, is to destroy the pool and create another pool over the same disk set. This happens when people confuse disks, pool names, and whatnot. Since labels are always in the same place on disks, the new labels overwrite the old ones precisely. The old filesystem metadata is then impossible to reach by normal means.
  • Sometimes, faulty RAM causes wrong data to be written to the filesystem. Since labels are updated often, errors tend to propagate fairly quick. If enough data is written before the system crashes, the pool may be damaged beyond repair.
  • Additionally, pools may crash after a power failure. In theory, ZFS is supposed to protect against this type of failure, but there is an important restriction - ZFS relies on the hard drive reliably flushing its cache, and that sometimes does not happen. Some drives and some USB configurations will lie about caches being written, all to improve performance. This indeed improves performance, but at a cost of reliability.
  • Last not least, various random glitches, either in hardware or software, take their fair share of failure cases.

Other key metadata

Once the correct disk labels are identified, there are still many things ZFS need to read to eventually mount the pool.

  1. Disk label, more specifically the uberblock, holds a pointer to the MOS (Meta Object Set). The uberblock is the approximate equivalent for a boot sector or a superblock in other filesystems. It is technically part of a disk label.
  2. MOS contains object directory, and a set of references to each dataset in the filesystem.
  3. Each dataset then contains a full set of references describing either files and directories (for datasets you access as files, either directly or over the network), or an instance of ZVOL, single large file which you access as a block storage (typically over iSCSI).

Only after all these pieces of data are correctly read ZFS gets full access to the content of the filesystem. Most of the pieces have two copies stored, with some having three copies. Multiple copies are stored even if the filesystem uses RAIDZ or mirrors, which provide certain redundancies themselves. This way, a single bad sector, or a single spot overwrite does not bring down the entire filesystem.

However, this set of metadata is less critical than the disk labels. If all the disk labels (or enough of them anyway) are damaged, there is no hope to find copies, because there are only few places to look. The complex process is then required to rebuild disk labels from the data on disks. Once it is done, however, the rest of the metadata can typically be found by a reasonably simple search.

Symptoms

Apart from just not being able to locate the pool to import, there are several error messages associated with catastrophic disk label or other metadata damage.

The pool metadata is corrupted. The pool cannot be imported due to damaged devices or data.
cannot import 'tank': one or more devices is currently unavailable
One or more devices are missing from the system. The pool cannot be imported. Attach the missing devices and try again.
Additional devices are known to be part of this pool, though their exact configuration cannot be determined.
One or more devices could not be used because the label is missing or invalid.

Recovery

Klennet ZFS Recovery reads the filesystem without requiring disk labels to be readable or correct. What’s worse, if a new pool was created over the deleted one, the labels are readable, perfectly correct, but point to the metadata of a new empty pool. The solution is to find all the remnants of the metadata of the original pool by scanning all the available drives, then assemble what metadata can be salvaged and read it. Because ZFS is copy-on-write there are multiple copies of old metadata scattered all around, recovery is pretty good, even from somewhat overwritten pools.

Created Sunday, December 30, 2018

Updated 27 Oct 2019

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.