On the complexity of file carving

This is a fairly simple recap of computational requirements for different file carving methods.

There are several approaches to carving with no filesystem metadata available:

  1. The most simple is header-footer or header-length carving - identify file headers and footers when possible, and everything in between is considered to be file content. If footers cannot be defined or are computationally expensive to search for, fixed amount of data after the header can be used.
  2. Bifragment gap carving, where files are considered to have two fragments with a gap between them. All possible combinations of gap position and length are to be tested to find out which one, if any, produces best or only valid file.
  3. Full match carving, where each block on disk is tested to see which block is the best match to the particular file.

Now let's see what the computational requirements are.

  1. Time required for header-footer carving is linearly proportional to the size of the source drive. This is why header-footer carving is implemented in all common data recovery software - the software has to scan the entire drive anyway and carving can be implemented at negligible additional cost.
  2. Time required for bifragment gap carving is basically number of attempts to make multiplied by time per attempt. Number of attempts is proportional to number of files (N), single file size (S), and maximum allowed gap length (G). Validation time (time spent per attempt) is at best constant, but most often is proportional to file size (S). Now, it is easy to note several things
    1. N*S, number of files multiplied by file size, is actually equal to the media capacity C.
    2. Maximum allowed gap length G increases with capacity, but is not directly proportinal to it. Actually, it is more like G~log(C)
    This finally works out to total time being proportional to C*log(C)*S, that is, C1.5*log(C)
  3. Full match carving is the most complete approach, producing theoretically best possible results, but at the same time the most expensive as far as computation time goes. Number of attempts required is proportional to number of files (N) and media capacity (C). Time per attempt is, similar to the previous case, proportional to single file size (S).
    The final tally is C2.

C2 is pretty tough in practice. You need some expensive hardware even for smallish media. Even bifragment carving can be difficult on a severely fragmented media. Progress in CPUs is offset by ever-increasing storage capacity, so it is not like it's getting any better with time. Massive parallel processing (say, cloud or clusters) is also good.

There are many other options to make processing faster. Most obvious is to prioritize subsequent (less distant) clusters over more distant clusters based on the knowledge that average file fragment size is higher than one cluster. There are some more intricate ways to boost speed, and currently there is a practical capability of working with common media like memory cards.

Created Sunday, August 27, 2017

Updated 21 May 2018