Successfully reported this slideshow.
TECHNOLOGY FOR EDD
CONSULTANTS, NOT
DUMMIES
TOPIC: PRE-PROCESSING
By Sara Emami
THE EDD CONSULTANT
AND E-DISCOVERY
The ultimate goal of electronic discovery is
this:
to obtain information that is truly ...
WHY PRE-PROCESSING
MATTERS TO THE
DISCOVERY PROCESS?
Enables clients to greatly reduce electronic data sizes
at the earlie...
A SUCCESSFUL PRE-PROCESSING
PROGRAM SHOULD DO THE
FOLLOWING:
Data cataloging (think of organization of files for quick and...
DeNISTing
What is “NIST”?
Why is “NIST” and De“NIST”ING significant?

How does this apply to electronic discovery and how
...
What is a NIST? It is not
a laying ground for
eggs!!!
The proper method for excluding these files is known
as DeNISTing.
T...
WHYS IS de“NIST”ing
IMPORTANT?
In the typical electronic discovery case, DeNISTing
alone will reduce the volume of informa...
INTERSTING FACTS
ABOUT “NIST”
The NIST list contains over 28 Million file signatures.
It is used regularly by the FBI and ...
CONCERNS WITH de“NIST”
While the NIST list is updated four times per year, it
may not include important files.
In past, si...
CONCERNS (continued)
In past, the NIST list does not yet include Windows 7 files,
despite the fact that there are more thr...
What is “Hash”?
When we think of “hash”, we are not referring to
McDonald’s breakfast potatoes.
From a technology perspect...
HASH – THINK
ALGORITMS!!
When we think of hash we should think:
Encryption algorithm that forms the mathematical foundatio...
EXAMPLE

Let us say for instance, the hash values of a Word
document I am working on now are:

MD5: 588BCBD1845342C10D9BBD...
HASHING (cont.)
Hashing can also be used to determine when fields or
segments within files are identical, even though the
...
THE MD5 AND WHY IT
IS SIGNIFICANT!
An MD5 message hash helps e-discovery
professionals both verify the integrity of transf...
NEAR DE-DUPE? WHAT
IS THAT?
When the near de-duplication occurs, is will reveal if there are
two documents that are simila...
NEAR DE-DUPLICATION
(continued)
With our knowledge of de-duplication, the creation of
near-de-duplication programs and sof...
IT IS ALL ABOUT
ORGANIZATION!!!
Imagine you are digging through years of files in your local
storage shed in the attempt t...
ALL ABOUT ORG. (cont.)
Now, with the advent of concept clustering (i.e.,
foldering), advanced document analysis can help
o...
DATA MAPPING
Data mapping software is one of the most powerful
pre-culling tool.
Provides the framework for visual analysi...
DATA MAPPING (cont.)
Data mapping applications should be able to remove
duplicates in advance and can help legal
professio...
EXAMPLE OF WHAT A
PRE-PROCESSING
TOOL LOOKS LIKE
CONCLUSION ON PREPROCESSING
The goal of the pre-processing stages enables clients
to greatly reduce electronic data sizes ...
Upcoming SlideShare
Loading in …5
×

Sara's technology presentation pre processing

375 views

Published on

Pre-processing in E-Discovery. Educational PowerPoint for attorneys and paralegals.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Sara's technology presentation pre processing

  1. 1. TECHNOLOGY FOR EDD CONSULTANTS, NOT DUMMIES TOPIC: PRE-PROCESSING By Sara Emami
  2. 2. THE EDD CONSULTANT AND E-DISCOVERY The ultimate goal of electronic discovery is this: to obtain information that is truly relevant to the case. There is a demand by legal entities and corporations to process, review and ultimately produce this information in the most cost-effective manner.
  3. 3. WHY PRE-PROCESSING MATTERS TO THE DISCOVERY PROCESS? Enables clients to greatly reduce electronic data sizes at the earliest stages of e-Discovery! This alone is very significant and enables cost-efficiency! Pre-processing is intended to de-nist, de-dupe and apply data filters to efficiently cull (remove) large sets of data which may be deemed unnecessary or irrelevant to each case.
  4. 4. A SUCCESSFUL PRE-PROCESSING PROGRAM SHOULD DO THE FOLLOWING: Data cataloging (think of organization of files for quick and efficient access) File extension filtering Document level and date/time filtering File type identification MD5 HASH NIST FILTERING De-Duplication
  5. 5. DeNISTing What is “NIST”? Why is “NIST” and De“NIST”ING significant? How does this apply to electronic discovery and how does this influence technology?
  6. 6. What is a NIST? It is not a laying ground for eggs!!! The proper method for excluding these files is known as DeNISTing. The National Institute for Standards and Technology (“NIST”) publishes the National Software Reference Library (“NRSL”). The NRSL is basically the Library of Congress of software files. Its comprehensive listing of files includes all of the files known to be distributed with software packages such as Microsoft Office.
  7. 7. WHYS IS de“NIST”ing IMPORTANT? In the typical electronic discovery case, DeNISTing alone will reduce the volume of information to be examined by 20%. From the perspective of large corporations, NIST comes in handy as reduced volume means less money spent on the discovery process and cost-efficiency comes into play.
  8. 8. INTERSTING FACTS ABOUT “NIST” The NIST list contains over 28 Million file signatures. It is used regularly by the FBI and other law enforcement entities to identify files with no evidentiary value. The list is free. Many e-Discovery companies take advantage of this free list and incorporate it into their software.
  9. 9. CONCERNS WITH de“NIST” While the NIST list is updated four times per year, it may not include important files. In past, significant system files were not being removed during the “DE-NISTing” process on workstations using Windows 7 and the latest release of Microsoft Office. Historically, this was a problem in 2011 so at times, de-nisting is not full proof.
  10. 10. CONCERNS (continued) In past, the NIST list does not yet include Windows 7 files, despite the fact that there are more three hundred million workstations that run Windows 7. Additionally, it did NOT include Microsoft Office 2010 files yet either. Supplementing the NIST list by removing system files such as EXE and DLL files is a clearly documentable method to reduce the number of files in the review set. This method doesn’t depend on HASH values and, assuming that these file types are not responsive (which is usually the case) can be an effective method for eliminating files to review.
  11. 11. What is “Hash”? When we think of “hash”, we are not referring to McDonald’s breakfast potatoes. From a technology perspective, we must think of hash as an individual file’s digital fingerprint. The listing includes the names of the files, their typical file sizes and the “hash” value for the file..
  12. 12. HASH – THINK ALGORITMS!! When we think of hash we should think: Encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.
  13. 13. EXAMPLE Let us say for instance, the hash values of a Word document I am working on now are: 
MD5: 588BCBD1845342C10D9BBD1C23294459
SHA-1: C24AE3125BFDBCE01A27FDDA21B3A7E83FAFF69E 
If I only change one comma in this multipage document, all else remaining the same, the hash values are now: 
MD5: 5F0266C4C326B9A1EF9E39CB78C352DC
SHA-1: 4C37FC6257556E954E90755DEE5DB8CDA8D76710 
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.
  14. 14. HASHING (cont.) Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different (may require software). For instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”
  15. 15. THE MD5 AND WHY IT IS SIGNIFICANT! An MD5 message hash helps e-discovery professionals both verify the integrity of transferred files and check the digital signature of those files. When hash functions are applied, legal teams can quickly locate documents in different formats within a sizeable data collection. Additionally, through the use of pre-culling hashing tools, they can rapidly identify duplicate documents by comparing hash values.
  16. 16. NEAR DE-DUPE? WHAT IS THAT? When the near de-duplication occurs, is will reveal if there are two documents that are similar. Think of having created a Word document entitled, “mydog.doc”. Let us imagine that we have revised “mydog.doc” and saved it as “mydog_revisedversion1.doc”. Even though the Word filed was saved with the intent of being the same document in revision mode, what makes this a near de-dupe is the fact that though the draft versions are the same, they are not identical. Here is another example Near De-Duplication: One file exists “File AV1.0″. This file is then opened, spell checked, and then saved as “File AV1.1″. These files are very similar and are classed as near “de-duplicates”.
  17. 17. NEAR DE-DUPLICATION (continued) With our knowledge of de-duplication, the creation of near-de-duplication programs and software allow for an even higher level of data de-duplication as it identifies files that are similar and are not bit-level duplicates. These near-de-duplication technologies help identify and group/tag electronic files with “near duplicate” similarities, however there are differences with regard to the content or metadata, or even both. Example of near de-duplication can include document versions, emails sent to multiple custodians, different parts of email chains, or similar proposals sent to several clients.
  18. 18. IT IS ALL ABOUT ORGANIZATION!!! Imagine you are digging through years of files in your local storage shed in the attempt to find one significant document related to a past law-suit. You dig through 20 boxes of paper and can’t find that one document!? Wouldn’t it be nice to have an efficient way of eliminating a full day of search to just under an hour? Finding and grouping of documents does this! In recent years, the finding and grouping of documents in ediscovery has also been enhanced by new pre-culling tools that go beyond query methodology in concept and fuzzy searching. Historically, document sets were compiled with keyword searches and then narrowed by using fewer search terms.
  19. 19. ALL ABOUT ORG. (cont.) Now, with the advent of concept clustering (i.e., foldering), advanced document analysis can help organize information more effectively by subject. This clustering capability greatly facilitates the review process by showing attorneys which subjects warrant the greatest attention or relevance to a particular case
  20. 20. DATA MAPPING Data mapping software is one of the most powerful pre-culling tool. Provides the framework for visual analysis, showing users the different “points” across their continent of data. Extract and index metadata and text from native files, create clusters based on any combination of attributes and allows users to search and analyze document collections prior to full EDD processing.
  21. 21. DATA MAPPING (cont.) Data mapping applications should be able to remove duplicates in advance and can help legal professionals reduce documents by as much as eighty. This is why mapping is significant. One huge benefit of data mapping is that it can provide litigators direct control over the document collection. They can manipulate data themselves, in real time, without the need for vendor assistance or external processing.
  22. 22. EXAMPLE OF WHAT A PRE-PROCESSING TOOL LOOKS LIKE
  23. 23. CONCLUSION ON PREPROCESSING The goal of the pre-processing stages enables clients to greatly reduce electronic data sizes at the earliest stages in the e-Discovery lifecycle. Targets the relevance of the data Organizes Cost efficiency – equals happy clients!!!

×