Recovery of large ZFS pools

Recovery of large ZFS pools requires proper planning because large amounts of data must be processed and transferred. Generally, recovery of any pool larger than 100 TB in raw capacity or consisting of more than ten hard drives should be treated with consideration.

After observing quite a number of recoveries, some involving petabyte-sized pools, I compiled a list of guidelines I have learned through this experience.

Hardware considerations

If you had a custom-built machine running your pool before the failure, and it still runs satisfactorily, it is probably good enough to run the recovery. However, make sure the hardware is functioning properly. If a hardware problem, such as bad RAM or a disk controller transient, ruined your pool, nothing good would come from attempting recovery with the very same hardware. If there is no fault with the machine, put in an additional SSD or even a USB stick with Windows on it, and run the recovery from it.

Hardware requirements for the recovery are not exact and vary significantly depending on the specifics of the filesystem. Among the factors are the number of files, average file size, record size, compression method, and filesystem age. However, the following general guidelines apply:

CPU

Generally, one core per four hard drives provides close to optimal performance. However, there is little point in going past 16 cores since the disk controller is probably saturated by that time.

RAM

The more, the merrier, up to 128 GB or so. There is probably no point in adding more than 128 GB RAM. 32 GB is a minimum, and an SSD-backed swap file must come with that. I recommend 64 GB, and 128 GB should be quite enough for up to a petabyte-sized pool.

Disk controller(s)

Whatever you used to access your disks originally is probably good if it still works. For more than ten disks, there are few, if any, desktop-level motherboards with more than ten SATA ports, so you will be looking for an extra controller. Any PCI-E SAS/SATA HBA is fine. Try using dumb HBAs or RAID controllers flashed with IT-mode firmware (which essentially converts the RAID card into a dumb adapter).

You can theoretically use RAID controllers which expose individual disks either as a single-disk JBOD or a single-disk RAID0, but I strongly recommend you don't. This is because RAID controllers will write their metadata onto disks, overwriting the original content. No matter how small the change, you don't want it.

If you are using an add-on controller, make sure it is in the correct PCI-E slot, and the slot is properly configured in BIOS. Some motherboards have selectable configurations of PCI-E slots, something like x16/x8/x1 vs. x16/x4/x4, and it is possible to have an x4 slot working at x1 speed or an x8 slot working at x4 speed. Make sure you have it configured right.

Network

It is quite likely that you will use the network to copy the data out. Use whatever hardware setup you have; there are no requirements. Obviously, a slower network means a slower copy. Also, make sure you read this page for the important quirk of Windows network access control.

Software considerations

Avoid running recoveries inside virtual machines. This applies to all recoveries, not just large ZFS pools. You want to avoid extra complexity between hardware and software doing the recovery. Handling of bad sectors, transient conditions, and other hardware issues becomes less predictable as additional levels of software are stacked upon each other.

Klennet ZFS Recovery requires Windows, so you will be running Windows. Install all Windows updates and then disable automatic updates. You do not want the system rebooting for updates halfway through the recovery.

Plan on running no other applications or a limited number of applications except ZFS Recovery on the machine. While running Task Manager is quite all right, rendering video on the same system is certainly not. ZFS Recovery works best when it has the entire machine to itself.

As a special case of running no other applications during recovery, do not run two recoveries simultaneously. Even though the system may look sufficiently powerful, two concurrent recoveries compete for RAM for their caches, disk access, and for who knows what else.

Miscellanea

If everything is set up correctly, with ten or more disks, expect speeds between 1500 and 2500 MB/sec during the initial stage of the scan. If you do not get speeds like this, something is, in all likelihood, wrong with your setup.

If you want to do a small-scale test before running a full-scale recovery, don't. Small-scale tests are very difficult to construct to reflect your large case properly.

The above recommendations are by no means complete. Different cases present varying challenges, and no document can describe all the quirks one can encounter in the real world. If you have any questions about your specific case, send a support request, and I will take a look.

Filed under: ZFS.

Created Thursday, December 26, 2019