File carving methods in data recovery

File carving is a process of searching for files in a data stream based on knowledge of file formats rather than any other metadata. Carving is useful when filesystem metadata is damaged or otherwise unusable. The FAT filesystem, typically used on small media, is the most common example. Once the file is deleted, or the media is formatted, the location data is lost because pointers to file contents are zeroed out.

Contiguous and fragmented files

Files on the media can be stored either in one chunk (contiguous, or non-fragmented files) or in several non-adjacent chunks (non-contiguous, or fragmented files).

Fragmentation occurs when there is not enough contiguous free space to write a file; the filesystem then has to split a file into parts which fit the available chunks of free space. Fragmentation also occurs when several files are written to the volume simultaneously, and in some other cases.

Fragmentation increases as the filesystem ages, as files are created and deleted. Larger files fragment more; this applies to higher-resolution still images and especially videos. It is not unusual to have half or more of the video files on a memory card fragmented.

Carving for contiguous files

Any kind of carving requires the ability to identify the file header. First step in carving is to scan the media and produce the list of headers, which represent file start points. Once start points are located, there are several options to choose from

  1. Header and fixed size carving. Take some fixed size starting at the header. This works for files where trailing data after the end of file is not important, which is actually true for most of the file formats. For example, given that JPEGs hardly exceed 100 MB (as of Q4 2017), one can take 256 MB worth of data starting at the header and be sure that all contiguous JPEG files will be captured (although with much of excess data at the end of each file). The size limit can obviously be adjusted for the specific task, file type, and even for specific camera settings. This method does not work with file formats which to not tolerate extra data after the end of file, like original Canon CRW.
  2. Header and size carving. This is a variation of the fixed size method, where we use size derived from the header of the file instead of a fixed size. Obviously, this only works for file formats where file size is stored in header. Files produced by this method do not have extra data at the end of the recovered file.
  3. Header and footer carving. If file has some kind of footer we can identify at the end of file, the capture size can be limited to the nearest footer after the header, producing files with no excess data at end. This method, however, requires further adaptation to file formats which can be nested, most notable example being JPEG with another JPEG thumbnail in it. With contiguous files, this situation is fortunately easy to detect, as detected points go in groups of four, two headers followed by two footers.

All these methods are fast, and processing speeds are mostly limited by the source media read speed.

Carving for fragmented files

Fragmented files are much more difficult to carve. We still have headers, and with luck file sizes determined from these headers, but files are split into two or more fragments which are not necessarily in the correct order. To tackle this task, one needs two things – a validation function and a similarity (proximity) function.

Validation and proximity functions

A validation function determines if the given file is correct or not. Good validation functions are exceedingly difficult to implement. Each of the different file formats requires its own validation function; furthermore, variations of the format require appropriate adjustments. If the file format is poorly documented or not documented at all, constructing a good validation function may be impossible. File formats which have their own checksums, like ZIP, are the best to construct a validation function for. Loosely defined formats, like plain text or XML, are the worst.

A proximity function determines how likely it is that two blocks are in fact two adjacent blocks belonging to the same file, with result expressed either as a probability or as some score measured in arbitrary units.

Bifragment gap carving

Bifragment gap carving is a method used to recover files which are in two sequential fragments with some extra data (gap) in between them. The applicability of the method is obviously limited, but it has one significant advantage of not requiring a proximity function, it can be done with validation function alone. Let’s see the diagram. We have a file header and optionally a footer and the file size computed from the header. Distance between header and footer is D, and file size from the header is S.

Bifragment gap carving parameters

If there is no footer, the distance D is assumed to be some constant enough to accommodate the largest possible gap and the largest possible file (assume D = Lmax + Smax). Now, if S is known, the gap size L can be computed simply as L = D-S. Then, we need to do approximately N ~ (D-L)/B validation tests, trying each possible position of a gap, and see which one produces the valid file. B here is a block size, typically 512 bytes.

If a validation function is not quite good, there may be several positive validations, in which case several probable files are produced for user to sort out which one is actually needed. First and last positions for the gap are not tested, because these outermost positions are known to be occupied by header and footer which are obviously parts of the file.

Then, if S is not known, the gap size cannot be determined, and the test has to be repeated for each possible gap size up to Lmax = D-Smin, where Smin is some minimum acceptable file size. Smin is typically guessed by looking at known-good files produced be the same camera at the same settings. This makes for approximately N ~ ((D-Lmax)/B)*(Lmax/B) tests.

Being not able to identify the footer does not change much, except that even more tests must be performed after assuming some maximum size Dmax.

In real world ten validations can be performed per second per CPU core on a typical 20 MP image. File size, let’s assume it is 10 MB, which is a bit on the small side but still reasonable. Typical sector size is 512 bytes. In some cases, the file system block size can be sniffed from what’s left of the filesystem and used. This imporves speed as we will soon see. Let’s assume 50 validations per second.

Time required to process a single header is, given a block size of 512 bytes (that is, with no filesystem cluster information)

Parameter block size B = 512 bytes
D 10 MB + 32 KB 10 MB + 32 KB Dmax = 16 MB
S 10 MB unknown unknown
L L = D - S Lmax = 256 KB Lmax = 256 KB
N (D-L)/B (Lmax/B)*(D-Lmax)/B (Lmax/B)*(Dmax-Lmax)/B
~20,000 ~10,000,000 ~16,000,000
Time ~ 6 min ~40 hrs ~70 hrs

and with a block size of 32 KB, as it would be if filesystem cluster size is detected,

Parameter block size B = 32 KB
D 10 MB + 32 KB 10 MB + 32 KB Dmax = 16 MB
S 10 MB unknown unknown
L L = D - S Lmax = 256 KB Lmax = 256 KB
N (D-L)/B (Lmax/B)*(D-Lmax)/B (Lmax/B)*(Dmax-Lmax)/B
~320 ~2500 ~4000
Time ~ 6 sec ~50 sec ~80 sec

So, you see the time required increases drastically if not all parameters are known. If block size can be determined by some or other method, the scan time is inversely proportional to number of sectors per cluster squared, which is fairly impressive.

The only drawback of bifragment gap carving is its limited applicability.

I have some real-life data on the applicability of bifragment gap carving here if you are interested.

Complex fragmented file carving

There is no known single best algorithm to recover files from multiple fragments. There are many various approaches and methods, but none of them simple and none of them fast. The simplest method is to apply proximity function to all possible combinations of blocks, and then merge blocks starting with most proximate. This produces chains of blocks which are most likely belong to their respective files, one chain corresponding to one file. One variant of implementing this is a Parallel Unique Path (PUP) carving algorithm. The problem is, this requires amount of computation proportional to a square of number of blocks on the media, and this is real slow.

Complex carving requires a proximity function. The proximity function obviously depends on file format. If this is the image file, like JPEG, a variety of methods can be applied, starting with the simple color difference between adjacent points (first derivative), or second derivative, or something more fancy like Haar-like features, and then all the way up to neural networks. One is not limited to using just one proximity function; any mix or combination can be used.

There is a fine balance between precision of the proximity function and its performance. Obviously, more complicated functions are slower. What's less obvious, more complicated functions tend to have some bias creep in, so they work well with one group of images, and not quite well with another group. For example, edge detection works well against a natural landscape photo while having problems with black-and-white text scan. So, it has to be augmented with some kind of scene type detection for acceptable result.

To reduce amount of computation required, various tricks are applied. Blocks assigned to known-good files are excluded from the analysis. All-zero blocks, or blocks that do not resemble parts of JPEG images can also be excluded. In real life, a set of heuristics is used for block identification, to determine which blocks are skipped as candidates.

What Klennet Carver uses

Klennet Carver uses a combination of header-fixed size and complex carving using a mix of various proximity functions chosen to match to the specific situation. This provides good balance between performance, as in not having to wait years for the scan to complete, and recovery effectiveness, that is, being able to actually recover fragmented files. Also, limited amount of bifragment carving is used during image recovery.