There is no fsck for ZFS. Why?

fsck, a file system checker, is a UNIX/Linux tool to scan the filesystem and fix errors. They say,

ZFS does not need fsck because the data on disk is always correct.

No. Saying that while discussing the recovery of a catastrophically crashed filesystem is not a wise thing to do.

The actual reason is the complexity of the filesystem and its repair.

Designing a simple filesystem repair tool

How do we even start designing a filesystem repair tool, like fsck in Linux or chkdsk in Windows?

First, we need to write down a list of so-called invariants. Filesystem invariants are statements that must be true on a consistent filesystem. You can think of them as a set of rules making up a filesystem. Note that I did not say anything about the data being valid. The invariants do not describe user data. Empty volume, holding no data at all, is still consistent.

Let's take FAT16, probably the simplest filesystem in use. The invariants go like this:

  • The boot sector must end in 55AA signature.
  • No FAT entry and no directory entry may reference cluster numbers higher than the number of clusters on volume.
  • Every cluster, starting with cluster 3, is either empty or assigned to exactly one file (on FAT16, cluster 2 holds the root directory, and clusters 0 and 1 do not really exist).
  • The number of clusters assigned to each file (as recorded in the allocation table, FAT) must match the file size (as recorded in a directory). Zero-sized files occupy exactly one cluster.

and so on. FAT16 being really simple, we can probably get away with less than 100 invariants.

Now, we need to design a corrective action for every possible violation of the invariants above. Some decisions are to be made as to what the corrective actions are exactly. Let's see some examples:

  • If the cluster is found to be in use, but no file uses it, we can create a file and use some automatically generated name for it. We do it in data recovery all the time, anyway. No big deal. Too bad you then need a specialized tool to sort .chk files based on their content.
  • If the cluster belongs to N files simultaneously, make N copies of the cluster and assign each copy to its specific file. What if there are not enough free clusters on volume to make N copies?
  • What are we to do if the file size does not match the number of clusters assigned to that file? Let's say there are more clusters than there should be. Should we increase the file size to accommodate extra clusters? Or should we mark extra clusters as unused? What if the damage is the other way around, so that there are not enough clusters for the given file size? Do we decrease the file size to match the actual number of clusters? Or should we assign extra clusters to a file? Which clusters should we use, then? What if they are in use by some other file?

Now, what if several of the invariants fail at the same time?

The complexity is increasing pretty quickly, and that's for the simplest filesystem available, with probably less than a hundred invariants. Also, if you examine the decisions above closely, you will see that they are in no way precise but rather represent some kind of a best guess. There will be cases when we guess incorrectly, and the data in the files will be corrupt even though the filesystem itself will be fixed (or, more precisely, forced into some arbitrary consistent state).

Complexity, more complexity, and some more complexity

As we move to more complex filesystems, the number of invariants increases. We can probably get away with a couple hundred or so on NTFS. This is about as far as it goes, with NTFS being the most complex filesystem still having a useful repair tool (not that you should use it, for it sometimes does more harm than good).

Now, ReFS (on Windows) and ZFS (on Linux) are significantly more complex than NTFS. The number of invariants to track increases, probably well into thousands. Each of these needs to be identified and written down. Then as we come to interactions between multiple failures, complexity just explodes. Fault trees grow so weird that even exploring them becomes impractical. I would argue that the filesystem check and repair tool is more complex than the filesystem driver itself. It also takes comparable time and effort to implement, debug, and test properly, which is very expensive (after all, it takes years to develop a filesystem).

You should also remember that filesystem checking does not necessarily give you valid data back. A successful run of fsck or chkdsk more or less guarantees that the driver does not choke when accessing the filesystem, but that's about it. As complexity increases, so do the chances that some files that were readable before the fix will become inaccessible. Also, the option to declare a file inaccessible and get rid of it in some way becomes increasingly attractive because of growing concerns that any attempted fix will break something elsewhere in the filesystem.

Given all that, there is probably no point in writing fsck for ZFS. It will require enormous time and effort to write, and even the best implementation comes nowhere near backups in terms of recovery quality.

So, you should rely on backups, as usual. fsck for ZFS is not forthcoming.

Filed under: ZFS.

Created Thursday, August 8, 2019