(Text for those without flash or javascript) Fulcrum's professionals are experienced CPAs, MBAs, ASAs, CFAs, affiliated professors and industry specialists Our expertise encompasses damages analysis, lost profit studies, business and intangible asset valuations, appraisals, fraud investigations, statistics, forensic economic analysis, royalty audits, strategic and market assessments, computer forensics, electronic discovery and analysis of computer data.
Electronic Discovery and Computer Forensics

Using Computer Hash Totals In Electronic Discovery

October 2007
Library Sections:

The hash value is a "digital fingerprint" of a larger document from which it was produced. A hash function takes a long variable-length data set or “string” as input, and produces a fixed length string as output. This is accomplished by running an algorithm against the data, and reducing the result to a series of letters and numbers. The fixed-length string or hash number is a representation of a longer message or document from which it was computed.

Although not the only algorithm, the most common is called MD5. The hash total from an MD5 algorithm is a 32 alphanumeric digit. There are so many possible combinations that there is no practical chance of having a duplicate or “collision” when all 32 digits are used.

Hash totals are used in a variety of data processing applications. Hash totals are used to:

1. Check message integrity (i.e., has all data been received without change or deletion)
2. Identify duplicate data (such as eliminating duplicate emails or documents)
3. Encrypt data

The first capability above is used to authenticate electronic evidence. A change of even a comma or a capitalized vs. lower case letter, when applied against the data algorithm, will generate a much different hash or digital signature.

As electronic evidence becomes increasingly important, courts and parties struggle with how to control and identify the evidence. Too often, courts and attorneys resort to paper or paper-like solutions, such as (i) printing out the computer-based information and sticking a bates number on the paper, or (ii) its electronic equivalent of creating a TIFF or PDF image to which a production number is added. (See what forms of evidence should I use). In most of these cases, the parties would be better served by leaving the data in a more computer friendly form that facilitates searches and retains imbedded data.

Although not widely adopted yet, hash totals could also be used as a means of identifying a document, similar to the way that production numbers are used on paper documents. Since the possibility of a duplicate hash total is nearly impossible, the hash totals would generate a unique number. Admittedly, a 32-digit alphanumeric number is burdensome to recognize, but a truncated hash total (e.g., using the first X of the 32 available digits) would have the same practical effect.

The possibility of a duplicate shortened number is a function of (i) how many digits are used, and (ii) the number of electronic files are being produced. Intuitively, the smaller the truncated hash total is, the greater the possibility of a duplicate. Similarly, the larger the number of documents being identified, the greater the possibility of a duplicate.

The following tool is useful when making judgments on this topic. The tool contains a bar and dial inputting the two variables described in the preceding paragraph. The percentage chance of having a single duplication is shown.

After using the tool, one will conclude that the use of six of the 32 available hash digits provides a low probability of a duplicate or collision. This is true on even larger electronic productions. If the electronic production is small, less than six digits will suffice. In those cases where a hash collision occurs, the full hash total can still be referred to as identification of the document at issue.

The above calculator addresses the possibility that any SINGLE hash amount will be duplicated by any other document in the collection. A different probability exists that ANY document would have such a duplicate. This second probability is quite a bit larger, particularly as the document collection grows. The following calculator shows the probability of having any duplicate in the entire document set.

The hashing process is quite fast, so the process need not be more time consuming than the current standard of using manual (aka bates) numbers.

The effort of working with an occasional duplicate production number is minor compared to the wasteful efforts now being employed when either:

1. Converting native electronic files to paper with manual production numbers or
2. Creating static PDF's or TIFFs with production numbers in the static files.

If hash totals are used in lieu of production numbers, one would need to create an index of documents to identify the source and order of the document. The index retains the information necessary to:

1. Resolve collision in production numbers and
2. Identify the source and order of the document.


Fulcrum Inquiry assists lawyers with electronic discovery and computer forensics.