Parallel Unique Path (PUP) Carving, Illustrated

What is PUP carving?

Parallel Unique Path (PUP) carving is one of the advanced file carving methods. It is used to reconstruct fragmented files without use of filesystem metadata by finding and assembling file fragments. Some other methods are described there. PUP carving requires two things:

  1. An ability to identify file start (file header),
  2. a proximity function (also called weight function) which tells us how good is the match between two clusters. It tells us how likely any given cluster is the next cluster for the given incomplete cluster chain.

It is also nice to be able to identify end of file, either by being able to detect file footer or by knowing file size. This is optional, though.

Simple example

Let’s see how this works. Normally, the proximity function provides a number, either a percentage or using some other scale. For illustration purposes, I will show proximity function output as line thickness – double line means strong signal, single solid line means average signal, and dashed line is for weak signal. We have six clusters in two files – red and green. File headers are on the left and footers are on the right. Initial step is to identify and mark headers.

Step 0 start

Then, compute proximity function values from headers to all unassigned clusters. This requires eight computations.

Step 0 proximity functions

Once this is done, we then select the strongest available signal and associate the appropriate cluster (red middle) with its incomplete file (red). That is, we choose the file for which the next cluster is best defined, and advance process one step by attaching that cluster to its corresponding file. It is important that we only do it one step at a time. The result is like follows

Step 1 start

The process is then repeated, but clusters already assigned to their respective files are not considered candidates for analysis. This accounts for Unique in Parallel Unique Path algorithm, and this is the important point as we will discover shortly. For now, let’s watch the process to the end.

Step 2

Step 3

Final result

Finally, we have two files properly reconstructed. You may notice that further into analysis, less and less proximity functions are needed. The computation effort required per step decreases with each step.

More complex example

In real life, we will be dealing with tens or hundreds of files, millions of clusters, and it is overly optimistic to assume the proximity function output is correct in all cases. That just does not happen. Let’s see what happens if there is an error in one of the iterations. I will stick to the same example for the sake of simplicity. In this example, red file produces strong wrong signal and wrong cluster is added.

Proximity calculation gone wrong...

... and its results

Now if we comply to the unique path requirement, then nothing is recoverable. The recovery of the red file went wrong, and it now blocks the green file from claiming its middle-green cluster. No matter how the proximities play out further, green file is not getting reconstructed. So, incorrect recovery of one file leads to incorrect recovery of some other file. This thing is called error propagation or cascading errors.

It may seem quite an obvious solution to stop requiring paths to be unique. If two files are allowed to take the same cluster, the problem seems to go away, allowing at least one file to be recovered, as in this example

Non-unique paths first step...

...first step result and a new proximity map...

...and final result.

It seems like not requiring paths to be unique helps... or does it? First of all, if clusters can be reused, the number of calculations increases. You can see by following the correct recovery (the first example) that the number of proximity function calculations at each step decreases from initial eight to just one in the last step. Losing that, however, is not really that bad in practice. Much more important is that unique path requirement also works another way round, in some cases improving the result. See, there are two sides to unique part requirement. At some point, the proximity function erroneously assigns higher score to the incorrect cluster A instead of the correct cluster B. As often happens in practice, A comes out as the best result and B as the second best. Then, one of the two things occur:

  1. If the error occurs when cluster A is not assigned to a file yet, it is assigned to a wrong file. This later causes error propagation.
  2. If the error occurs when cluster A is already assigned to a file, the algorithm is forced to take cluster B instead, thus recovering from a proximity function error, as illustrated below.

Recovering from proximity function errors

Practically, enforcing paths to be unique does more good than harm. As the best available signal is used in each iteration, strongest signals are genrally acted on first, and weaker signals are used last, when the choice of cluster to pick is significantly limited. Strong signals are generally more trustworthy and by the time we need to use less trustworthy weak signal, the choice and the number of ways for weak signal to screw up is significantly limited.

So, this is how the PUP algorithm works and why paths better be unique.

References and what else you can read on this topic

  1. Rainer Poisel, Simon Tjoa, and Paul Tavolato. “Advanced File Carving Approaches for Multimedia Files.” // Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, volume: 2, number: 4, pp. 42-58.
  2. A. Pal and N. D. Memon, “The evolution of file carving,” // IEEE Signal Processing Magazine, vol. 26, no. 2, pp. 59–71, March 2009.

Created Sunday, February 4, 2018