Simplified theory of copy-on-write filesystems

WARNING: for illustration only; not a technical reference.

This simplified model does not map well onto technical implementations. Technical implementations are completely different. The only reason I decided to put this explanation out there is that I think it is easy to understand. The model applies to all the CoW filesystems, including ZFS, BTRFS, ReFS, F2FS, and others.

The principal difference between CoW and a traditional filesystem is that a CoW filesystem never overwrites the data in place. If you change a file, a traditional filesystem will change the blocks already belonging to that file. A CoW filesystem will

  1. allocate new space for the changed blocks,
  2. write the changed blocks,
  3. change the references accordingly, and, finally,
  4. mark the original blocks as free.

Let's say you have a disk with four slots for data on it and a single dataset (collection of files and directories) that starts at Version 0. A data slot can contain one dataset. I know this could have been more realistic but bear with me. So the initial disk state is as follows:

Slot Content Marked
1 Version 0 Active
2 Blank Empty
3 Blank Empty
4 Blank Empty

Now you change something in the dataset. Instead of modifying the existing dataset, the CoW filesystem creates a copy of it and writes it into one of the free slots. So we get

Slot Content Marked
1 Version 0 Empty
2 Version 1 Active
3 Blank Empty
4 Blank Empty

And then you make another change, thus producing Version 2, and we arrive at the following:

Slot Content Marked
1 Version 0 Empty
2 Version 1 Empty
3 Version 2 Active
4 Blank Empty

The empty space will eventually be reused. However, now you want a snapshot of Version 2. The system will respond like this:

Slot Content Marked
1 Version 0 Empty
2 Version 1 Empty
3 Version 2 Active, Snapshot of Version 2
4 Blank Empty

Then the next change, producing Version 3, will result in

Slot Content Marked
1 Version 0 Empty
2 Version 1 Empty
3 Version 2 Snapshot of Version 3
4 Version 3 Active

Now the next change will overwrite any of the empty spaces at random.

Slot Content Marked
1 Version 0 Empty
2 Version 4 Active
3 Version 2 Snapshot of Version 4
4 Version 3 Empty

Difference between theory and practice

This was a very simplified theory.

In practice, there are no slots, and multiple versions share common data (only changes are written), not to mention a myriad of other technicalities, but the general idea holds. The system will overwrite any place declared free and will not overwrite any place assigned to an active dataset or a snapshot.

Also, there is no circular rule to overwrite the oldest data first. The system does not track the age of free blocks. All free blocks go into the uniform pool. The filesystem then draws blocks from the free pool without regard for what it overwrites.

Klennet ZFS Recovery and snapshots

Klennet ZFS Recovery is designed to look through the Content column, mostly ignoring the Marked column. During the scan, it identifies "Version X" and sorts through all intact versions.

Technically, identifying what is the snapshot of what on a damaged filesystem is a tricky task. It is often impossible due to damage or relevant metadata sections being overwritten over time.

So, ZFS Recovery is specifically designed not to care. If there is a snapshot name, good, it will read the name and try to associate it with data; but it is not required for recovery. As a side effect, you may have to look through many unnamed datasets to sort out what's what.

Filed under: ZFS.

Created Wednesday, April 19, 2023