Reviews, comparison testing, and challenges in data recovery

There is a slight problem with data recovery software reviews and testing. Reviews are fine when they talk features, user interface, and pricing but not quite so fine when it comes to recovery capabilities. The slight problem with testing recovery capabilities is that nobody knows how exactly to do it. The seemingly straightforward idea is to put multiple damaged drives through recovery and see how different software fares against the set of tasks. However, nobody seems to know what makes a good test set. There are several approaches to testing, covering the entire spectrum from completely artificial to testing on real-life cases.

Artificially created test sets

Artificial test cases are crafted by either generating the disk image from scratch or inserting some well-defined data into the disk image of a blank filesystem.

DFRWS challenges from 2006 and 2007 are good examples of artificially generated test sets used to investigate one specific aspect of data recovery. Except they got cluster sizes wrong. Even back in 2006, nobody used 512 bytes per cluster. Is this important? Yes, carving performance depends on the cluster size for speed and, to some extent, for output quality. The two DFRWS challenges, by the way, are yet to be completely solved (as of summer 2018).

The big problem with crafting test cases is that nobody knows how close they are to real-world cases. Moreover, it is rarely clear what properties are to be reproduced and to what precision. The only sure way to list all the factors affecting the recovery is to study the algorithms involved, but these are generally not available for study.

On the other side of the coin, data recovery software can theoretically be tuned to whatever benchmarks are available. Such tuning will almost invariably cause performance to degrade on real-life recoveries. I'm not aware of this ever being done in practice, though, most likely because there are no widely accepted and established benchmarks in data recovery.

The use of artificial test sets is the most scientific approach where all known factors are being controlled, and as such, it is often used in academic research.

Real filesystems created for test purposes

Another approach is to put some test data onto a real blank filesystem (on a USB flash drive, for example), do some known damage to it (delete some files or format the drive), and then run recovery on it to see how much you can recover. Wichever software recovers the most is then considered the best.

That is a middle-ground approach and arguably the best overall, but again some factors can't be easily replicated.

  1. Newly formatted blank filesystem allocation patterns differ from these of an old, used filesystem. A blank filesystem has a low, if any, amount of very predictable fragmentation. A used filesystem has bits of partly overwritten files scattered around, and fragmentation patterns are nothing like on a blank filesystem. This phenomenon is called filesystem aging, and it is difficult to simulate accurately.
  2. Different filesystem implementations have different aging behavior and different allocation strategies. The standard Windows FAT filesystem driver behaves differently from drivers used in photo or dash cameras. There is a difference even in most simple use cases. Copying a video file to a memory card may not be the same as a dash camera recording video to the same file. Sure, the files are identical, but their location on media may be different enough to matter.

This approach is what reviewers use. In simple recoveries, like unformatting something, they actually do create realistic scenarios, but the performance in these simple scenarios is pretty same across first-line data recovery software. As scenarios grow more complex, like RAID recoveries, the quality of test cases and reviews inevitably declines. I don't think anybody cares, though.

Benchmarking on real-life cases

The last and most realistic approach is to take a real-life damaged media, make a disk image from it, and run it through whatever recovery software is being tested.

This option is only available to data recovery technicians. Real-life cases vary immensely - some are for mechanical repair and do not need any software intervention, some are outright unrecoverable, and finally, a limited set of cases are suitable to test software on. Real-life cases are hard to come by for a layman. Also, technicians with their specialized knowledge are in the best position to tweak various software parameters and settings for the best results in each case. So, their reviews (which are few and far between, although you can find some on specialized forums) are usually the most comprehensive.

Benchmarking on real-life cases is what data recovery technicians use. They have an abundance of cases to test on, and in the testing process, they determine which software works best and if the software fits their established processes and procedures. However, as far as I'm aware, technicians invariably use all sorts of different software depending on their understanding of each case.

Filed under: Benchmarks.

Created Tuesday, June 5, 2018