On the complexity of file carving

This is a fairly simple recap of computational requirements for different file carving methods.

There are several approaches to carving with no filesystem metadata available:

  1. The most simple is header-footer or header-length carving - identify file headers and footers when possible, and everything in between is considered file content. If footers cannot be defined or are computationally expensive to search for, we can use a fixed amount of data after the header.
  2. Bifragment gap carving, where files are considered to have two fragments with a gap between them. We need to test all combinations of gap position and length to determine which produces the best or the only valid file..
  3. Full match carving, where each block on the disk is tested to see which block matches the particular file.

Now let's see what the computational requirements are.

  1. The time required for header-footer carving is linearly proportional to the size of the source drive. This is why all common data recovery tools implement header-footer carving. The software has to scan the entire drive anyway and can do carving at negligible additional cost.
  2. The time required for bifragment gap carving is the number of attempts to make multiplied by time per attempt. The number of attempts is proportional to the number of files (N), single file size (S), and maximum allowed gap length (G). Validation time (time spent per attempt) is most often proportional to file size (S). Now, it is easy to note several things
    1. N*S, the number of files multiplied by the file size, equals the media capacity C.
    2. Maximum allowed gap length G increases with capacity but is not directly proportional to it. Actually, it is more like G ~ log(C)
    This finally works out to total time being proportional to C*log(C)*S, that is, C1.5*log(C)
  3. Full match carving is the most comprehensive approach, producing the best possible results theoretically, but at the same time, the most expensive as far as computation time goes. The number of attempts required is proportional to the number of files (N) and media capacity (C). Time per attempt is, similar to the previous case, proportional to single file size (S).
    The final tally is C2.

C2 is pretty tough in practice. You need some expensive hardware, even for smallish media. Even bifragment carving can be difficult on severely fragmented media. Progress in CPUs is offset by ever-increasing storage capacity, so it is not getting any better with time. Massive parallel processing (say, cloud or clusters) is also good.

There are many other options to make processing faster. The most obvious is to prioritize subsequent (less distant) clusters over more distant clusters based on the knowledge that the average file fragment size is higher than one cluster. There are also more intricate ways to boost speed, and currently, there is a practical capability to deal with small media like memory cards, but not with typical hard drives.

Created Sunday, August 27, 2017

Updated 21 May 2018

Updated 03 Nov 2022