Recovery of large ZFS pools

Recovery of large ZFS pools requires proper planning, because large amounts of data must be processed and transferred. Generally, recovery of any pool larger than 100 TB in raw capacity or consisting of more than ten hard drives should be treated with consideration.

After observing quite a number of recoveries, some involving petabyte-sized pools, I compiled a list of guidelines I have learned through this experience.

Hardware considerations

If you had a custom-built machine running your pool prior to failure, and it still runs satisfactory, it is probably good enough to run the recovery. However, make sure the hardware is functioning properly. If your pool was ruined by a hardware problem, such as bad RAM or some kind of disk controller transient, nothing good will come from attempting recovery with the very same hardware. If there is no fault with the machine, put in an additional SSD, or even a USB stick with Windows on it, and run the recovery from it.

Hardware requirements for the recovery are not exact, and vary significantly depending on the specifics of the filesystem. Among the factors are number of files, average file size, record size, compression method, and filesystem age. However, the following general guidelines apply:

CPU

Generally, one core per four hard drives provides close to optimal performance. However, there is no much point in going past 16 cores, since by that time the disk controller is probably saturated.

RAM

The more the merrier, up to 128 GB or so. There is really no point in adding more than 128 GB RAM. 32 GB is a minimum, and a SSD-backed swap file must come with that. I recommend 64 GB, and 128 GB should be quite enough up to a petabyte-sized pool.

Disk controller(s)

Whatever was used to access your disks originally is probably good, if it still works. For more than ten disks, there are few, if any, desktop-level motherboard with more than ten SATA ports, so you will be looking for an extra controller. Any PCI-E SAS/SATA HBA is fine. Try using dumb HBAs, or RAID controllers flashed with IT-mode firmware (which essentially converts RAID card into dumb adapter).

You can theoretically use RAID controllers which expose individual disks either as a single-disk JBOD or a single-disk RAID0, but I strongly recommend you don’t. This is because RAID controllers will write their metadata onto disks, overwriting the original content. No matter how small the change, you don’t want it.

If you are using an add-on controller, make sure it is in the correct PCI-E slot and the slot is properly configured in BIOS. Some motherboards have selectable configurations of PCI-E slots, something like x16/x8/x1 vs. x16/x4/x4, and it is possible to have an x4 slot working at x1 speed, or an x8 slot working at x4 speed. Make sure you have it configured right.

Network

It is quite likely that you will use network to copy the data out. Use whatever hardware set up you have at hand, there are really no requirements. Obviously, slower network means slower copy. Also make sure you read this page for important quirk of Windows network access control.

Software considerations

Avoid running recoveries inside virtual machines. This applies to all recoveries, not just large ZFS pools. You don’t want extra level of complexity between hardware and software doing the recovery. Handling of bad sectors, transient conditions, and other hardware issues of all sorts becomes less predictable as additional levels of software are stacked upon each other.

Klennet ZFS Recovery requires Windows to run, so you will be running Windows. Install all Windows updates and then disable automatic update. You do not want the system rebooting for updates halfway through the recovery.

Plan on running no other applications or limited number on applications except ZFS Recovery on the machine. While running Task Manager is quite all right, rendering video on the same system is certainly not. ZFS Recovery works best when it has the entire machine to itself.

As a special case of running no other applications during recovery, do not run two recoveries at the same time. Even though the system may look sufficiently powerful, two concurrent recoveries compete with each other for RAM for their caches, for disk access, and for who knows what else.

Miscellanea

If everything is set up correctly, with ten or more disks expect speeds between 1500 and 2500 MB/sec during the initial stage of the scan. If you do not get speeds like this, something is in all likelihood wrong with your setup.

If you want to do a small-scale test before running a full-scale recovery, don’t. Small-scale tests are very difficult to construct to properly reflect your large case.

The above recommendations are by no means complete. Different cases present varying challenges, and no document can describe all the quirks one can encounter in real world. If you have any questions about your specific case, send a support request and I will take a look.

Created Thursday, December 26, 2019

I have a low volume mailing list, for news and tips, which I send out once or twice a month.
Subscribe if you are interested.