TECHNOLOGY FOR EDD
By Sara Emami
THE EDD CONSULTANT
The ultimate goal of electronic discovery is
to obtain information that is truly relevant to the case.
There is a demand by legal entities and corporations
to process, review and ultimately produce this
information in the most cost-effective manner.
MATTERS TO THE
Enables clients to greatly reduce electronic data sizes
at the earliest stages of e-Discovery! This alone is
very significant and enables cost-efficiency!
Pre-processing is intended to de-nist, de-dupe and
apply data filters to efficiently cull (remove) large
sets of data which may be deemed unnecessary or
irrelevant to each case.
A SUCCESSFUL PRE-PROCESSING
PROGRAM SHOULD DO THE
Data cataloging (think of organization of files for quick and
File extension filtering
Document level and date/time filtering
File type identification
What is “NIST”?
Why is “NIST” and De“NIST”ING significant?
How does this apply to electronic discovery and how
does this influence technology?
What is a NIST? It is not
a laying ground for
The proper method for excluding these files is known
The National Institute for Standards and Technology
(“NIST”) publishes the National Software Reference
The NRSL is basically the Library of Congress of
software files. Its comprehensive listing of files
includes all of the files known to be distributed with
software packages such as Microsoft Office.
WHYS IS de“NIST”ing
In the typical electronic discovery case, DeNISTing
alone will reduce the volume of information to be
examined by 20%.
From the perspective of large corporations, NIST
comes in handy as reduced volume means less money
spent on the discovery process and cost-efficiency
comes into play.
The NIST list contains over 28 Million file signatures.
It is used regularly by the FBI and other law
enforcement entities to identify files with no
The list is free.
Many e-Discovery companies take advantage of this
free list and incorporate it into their software.
CONCERNS WITH de“NIST”
While the NIST list is updated four times per year, it
may not include important files.
In past, significant system files were not being
removed during the “DE-NISTing” process on
workstations using Windows 7 and the latest release
of Microsoft Office. Historically, this was a problem
in 2011 so at times, de-nisting is not full proof.
In past, the NIST list does not yet include Windows 7 files,
despite the fact that there are more three hundred million
workstations that run Windows 7. Additionally, it did NOT
include Microsoft Office 2010 files yet either.
Supplementing the NIST list by removing system files such
as EXE and DLL files is a clearly documentable method to
reduce the number of files in the review set.
This method doesn’t depend on HASH values and,
assuming that these file types are not responsive (which is
usually the case) can be an effective method for eliminating
files to review.
What is “Hash”?
When we think of “hash”, we are not referring to
McDonald’s breakfast potatoes.
From a technology perspective, we must think of
hash as an individual file’s digital fingerprint. The
listing includes the names of the files, their typical
file sizes and the “hash” value for the file..
HASH – THINK
When we think of hash we should think:
Encryption algorithm that forms the mathematical foundation
Hashing generates a unique alphanumeric value to identify a
particular computer file, group of files, or even an entire hard
Hash also allows for the identification of particular files, and the
easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document
Let us say for instance, the hash values of a Word
document I am working on now are:
MD5: 588BCBD1845342C10D9BBD1C23294459 SHA-1:
If I only change one comma in this multipage document, all else
remaining the same, the hash values are now:
MD5: 5F0266C4C326B9A1EF9E39CB78C352DC SHA-1:
Although the two files have only this trivial difference, there are no
similarities in these hash values, proving that hashing will detect even
the slightest file alteration.
Hashing can also be used to determine when fields or
segments within files are identical, even though the
entire file might be quite different (may require
For instance, you can hash only the body of an email,
the actual message, to determine whether it is
identical with another email, even when the
“reference” or the “to” and “from” fields are
different. This allows for an important filtering
process called “near de-duplication.”
THE MD5 AND WHY IT
An MD5 message hash helps e-discovery
professionals both verify the integrity of transferred
files and check the digital signature of those files.
When hash functions are applied, legal teams can
quickly locate documents in different formats within
a sizeable data collection.
Additionally, through the use of pre-culling hashing
tools, they can rapidly identify duplicate documents
by comparing hash values.
NEAR DE-DUPE? WHAT
When the near de-duplication occurs, is will reveal if there are
two documents that are similar. Think of having created a
Word document entitled, “mydog.doc”. Let us imagine that
we have revised “mydog.doc” and saved it as
“mydog_revisedversion1.doc”. Even though the Word filed was
saved with the intent of being the same document in revision
mode, what makes this a near de-dupe is the fact that though
the draft versions are the same, they are not identical.
Here is another example Near De-Duplication: One file exists
“File AV1.0″. This file is then opened, spell checked, and then
saved as “File AV1.1″. These files are very similar and are classed
as near “de-duplicates”.
With our knowledge of de-duplication, the creation of
near-de-duplication programs and software allow for an
even higher level of data de-duplication as it identifies
files that are similar and are not bit-level duplicates.
These near-de-duplication technologies help identify and
group/tag electronic files with “near duplicate”
similarities, however there are differences with regard to
the content or metadata, or even both.
Example of near de-duplication can include document
versions, emails sent to multiple custodians, different
parts of email chains, or similar proposals sent to several
IT IS ALL ABOUT
Imagine you are digging through years of files in your local
storage shed in the attempt to find one significant document
related to a past law-suit. You dig through 20 boxes of paper
and can’t find that one document!? Wouldn’t it be nice to have
an efficient way of eliminating a full day of search to just
under an hour? Finding and grouping of documents does this!
In recent years, the finding and grouping of documents in ediscovery has also been enhanced by new pre-culling tools
that go beyond query methodology in concept and fuzzy
Historically, document sets were compiled with keyword
searches and then narrowed by using fewer search terms.
ALL ABOUT ORG. (cont.)
Now, with the advent of concept clustering (i.e.,
foldering), advanced document analysis can help
organize information more effectively by subject.
This clustering capability greatly facilitates the
review process by showing attorneys which subjects
warrant the greatest attention or relevance to a
Data mapping software is one of the most powerful
Provides the framework for visual analysis, showing
users the different “points” across their continent of
Extract and index metadata and text from native
files, create clusters based on any combination of
attributes and allows users to search and analyze
document collections prior to full EDD processing.
DATA MAPPING (cont.)
Data mapping applications should be able to remove
duplicates in advance and can help legal
professionals reduce documents by as much as
eighty. This is why mapping is significant.
One huge benefit of data mapping is that it can
provide litigators direct control over the document
collection. They can manipulate data themselves, in
real time, without the need for vendor assistance or
EXAMPLE OF WHAT A
TOOL LOOKS LIKE
CONCLUSION ON PREPROCESSING
The goal of the pre-processing stages enables clients
to greatly reduce electronic data sizes at the earliest
stages in the e-Discovery lifecycle.
Targets the relevance of the data
Cost efficiency – equals happy clients!!!