Monitoring disk activity during recovery

During the disk scan, analysis, and file copy, Klennet ZFS Recovery provides an overview of disk activity. It is useful to keep an eye on it, or at least look at it every now and then. Disk activity view helps you

  • See if there are any problems reading the disk, and
  • identify the offending disk, within reason (see Tricks and Quirks below).

Disk activity view

Disk activity view is a table showing certain parameters for each of the disks involved in the recovery. The data is updated once per second.

  • Disk name, rather obviously.
  • Response time. Time it took the disk to process the request, in milliseconds. Indicated value is the largest one seen during the last second (since the last update).
  • % Time. Indicates disk utilization, averaged over the last one-second interval. 100% indicates the disk was busy all the time, and 0% indicates the disk was idle.
  • Speed and IOPS. Average read speed, in bytes per second, and IOPS (number of requests per second) values. These are mostly provided for entertainment.
  • Delays. This is the important metric. Every time the disk takes longer than 750 ms to process a request, the delay counter is increased. This does not always mean there is a bad block, just that it took unusually long to process the request. High delay count (100 or more) on a single disk almost always indicates a faulty disk. There are a few exceptions, though - see Tricks and Quirks below.

When Delays increases, the corresponding drive is highlighted in red. The highlight then decays over several seconds. You can see it on the screenshot below (Disk4 is listed in dark red). The response time was already back to normal at the time I took the screenshot, and the highlighting is already at half brightness.

Klennet ZFS Recovery disk activity view

Klennet ZFS Recovery disk activity view.

Tricks and quirks

1. In some cases, there is mutual interference between the disks attached to the same controller. When a disk locks up while trying to work around a bad sector, it may cause the entire bus to lock up. Then, all the disks on the same bus will delay their requests until the faulty disk completes its retry attempts. This will be logged in disk activity view as several disks having a problem simultaneously. This especially applies to USB-to-SATA converters, which I strongly recommend you avoid like the plague. If you see this happening, examine S.M.A.R.T. data on all the disks and identify which one is faulty.

2. If you use power saving of some kind, some or even all of the disks may be asleep and spun down when you start either analysis or copying. First read request causes disks to spin up, and this certainly takes longer than a 500-millisecond threshold. So each spin-up will increase Delays by one. This is why you should not be worried if you see Delays value of one or two, especially immediately after you start scanning or copying.

3. Sum of % Time values for all disks does not always match % Disk Time value displayed in the overall performance overview at the bottom of the window. This is because the sampling for the two is done at different points in time and at different frequencies. Also, if you are reading sparse VHD or VHDX disk images, the values not match because of how sparse areas are counted.

Identifying a faulty disk

If one of the disks indicates Delays much higher than all others, and all disks are in use, then it is the most likely candidate.

If there are several disks with high Delays value, or if you see multiple disks locking up at the same time, suggesting cross-disk interference, time to do a S.M.A.R.T. check. Pull the attributes from every disk and examine the raw values.

  • On Linux, use smartctl.
  • On Windows, I recommend Hard Disk Sentinel for its ability to see through many of the RAID controllers.

Look at the Raw columns for the attributes, not Value columns. Check the following attributes:

  • Reallocated Sectors Count
  • Current Pending Sectors Count
  • Offline Uncorrectable Sectors Count

Any disk which has non-zero raw value for any of these attributes should be considered faulty and not used.

It is also quite possible for disk to have good S.M.A.R.T. attributes and still be faulty. There is no replacing human judgement. If in doubt, file a support request.

Possible corrective actions

This way or other, you need eliminate disks which do not work from the process, and replace them with something which works. There are three common options:

  1. Replace faulty disk with a disk image file,
  2. replace it with its clone on another disk, or
  3. redundancy permitting, exclude the disk from the analysis altogether (I don't recommend it, though).

I generally recommend using a disk image file (copy of the entire disk content in file) instead of a clone (copy of the entire damaged disk on a new good disk).

  • If you go with clone, and the clone drive is larger than the original drive, make sure the clone is zeroed beforehand. Otherwise, residual data still on the clone may contaminate the analysis.
  • If you choose disk image file, make sure each disk image file is stored on its own physcial dirve. Also, do not place multiple disk image files onto a single large RAID array. The recovery process is highly parallelized and having two parallel requests compete for a physcial drive kills performance.
  • In any case, you will need a software to create either a clone or a disk image file.
    • On Linux, use ddrescue.
    • On Windows, I recommend my own Klennet Disk Imager, but there are plenty more. For a one-time job, you don't need a license key, just use the demo.

Another option is to evaluate how much redundancy you had before the pool failed, how much you still have, and if you can tolerate losing one more drive. For example, if you had 10 drives in a RAIDZ2 setup, and you have all the drives, and determined that only one of them is faulty, you may exclude the offending drive from the analysis, probably with no ill effects. However, it is difficult to be sure there will be no ill effects. What if there were some resilver/rebuild attempts? Is one of the drives slightly out of sync? The complexity of ZFS makes the decision difficult, so I would not recommend it. Better stick to disk image files.