There is no fsck for ZFS. Why?

fsck, a file system checker, is a UNIX/Linux tool to scan the filesystem and fix errors on it. They say,

ZFS does not need fsck because the data on disk is always correct.

No. Saying that while discussing recovery of the catastrophically crashed filesystem is not a wise thing to do.

The actual reason is the complexity of the filesystem and its repair.

Designing a simple filesystem repair tool

How do we even start designing a filesystem repair tool, like fsck in Linux or chkdsk in Windows?

First, we need to write down a list of so called invariants. Filesystem invariants are statements which must be true on a consistent filesystem. You can think of them as a set of rules making up a filesystem. Note immediately that I did not say anything about data being valid. The invariants do not describe user data. Empty volume, holding no data at all, is still consistent.

Let’s take FAT16, probably the simplest filesystem in use. The invariants go like this:

  • Boot sector must end in 55AA signature.
  • No FAT entry and no directory entry may reference cluster numbers higher than the number of clusters on volume.
  • Starting with cluster 3, every cluster is either empty or assigned to exactly one file (on FAT16, cluster 2 holds root directory, and clusters 0 and 1 do not really exist).
  • For each file, number of clusters assigned to it (as recorded in the allocation table, FAT) must match the file size (as recorded in a directory). Zero-sized files occupy exactly one cluster.

and so on. FAT16 being really simple, we can probably get away with less than 100 invariants for it.

Now, we need to design a corrective action for every possible violation of the invariants above. Some decisions are to be made as to what the corrective actions are, exactly. Let's see some examples:

  • If the cluster is found to be in use, but no file uses it, we can create a file and use some automatically generated name for it. We do it in data recovery all the time anyway, no big deal. Too bad that you then need a specialized tool to sort .chk files based on their content.
  • If the cluster is found belonging to N files at the same time, make N copies of the cluster and assign each copy to its specific file. What if there are not enough free clusters on volume to make N copies?
  • If the file size does not match the number of clusters assigned to a file, what are we to do? Let’s say there are more clusters than there should be. Should we increase file size to accommodate extra clusters? Or should we mark extra clusters as unused? What if the damage is the other way round, so that there are not enough clusters for a given file size? Do we decrease the file size to match the actual number of clusters? Or maybe we should assign extra clusters to a file? Which clusters should we use, then? What if they are in use by some other file?

Now, what if several of the invariants fail at the same time?

You see, the complexity is increasing pretty quick, and that’s for the simplest filesystem available, with probably less than a hundred invariants. Also, if you examine the decisions above closely, you will see that they are in no way precise, but rather represent some kind of a best guess. There will be cases when we guess incorrect, and the data in the files will be corrupt even though the filesystem itself will be fixed (or, more precisely, will be forced to some consistent state).

Complexity, more complexity, and some more complexity

As we move to more complex filesystems, number of invariants increases. On NTFS, we can probably get away with couple of hundred or so. This is about as far as it goes, with NTFS being the most complex filesystem which still has a useful repair tool (not that you should use it, for it sometimes makes more harm than good).

Now, ReFS (on Windows) and ZFS (on Linux) are significantly more complex than NTFS. The number of invariants to track increases, probably well into thousands. Each of these needs to be identified and written down. Then as we come to interactions between multiple failures, complexity just explodes. Fault trees grow so weird that even exploring them becomes impractical. I would argue that filesystem check and repair tool is more complex than the filesystem driver itself. It also takes comparable time and effort to implement, debug, and test properly, which turns out to be very expensive (after all, it takes years to develop a filesystem).

You should also keep in mind that filesystem checking does not necessarily give you valid data back. A successful run of fsck or chkdsk more or less guarantees that the driver does not choke when accessing the filesystem, but that’s about it. As complexity increases, so do chances that some files which were readable before the fix will become inaccessible after the fix. Also, the option to just declare file inaccessible and get rid of it in some way becomes more and more attractive, because of growing concerns that any attempted fix will break something elsewhere in the filesystem.

Given all that, there is probably no point in writing fsck for ZFS. It will require enormous time and effort to write, and even the best implementation comes nowhere near backups in terms of recovery quality.

So, you should rely on backups, as usual. fsck for ZFS is not forthcoming.

Created Thursday, August 8, 2019

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.