Your SlideShare is downloading. ×
Sara's technology presentation pre processing
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sara's technology presentation pre processing


Published on

Pre-processing in E-Discovery. Educational PowerPoint for attorneys and paralegals.

Pre-processing in E-Discovery. Educational PowerPoint for attorneys and paralegals.

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 2. THE EDD CONSULTANT AND E-DISCOVERY The ultimate goal of electronic discovery is this: to obtain information that is truly relevant to the case. There is a demand by legal entities and corporations to process, review and ultimately produce this information in the most cost-effective manner.
  • 3. WHY PRE-PROCESSING MATTERS TO THE DISCOVERY PROCESS? Enables clients to greatly reduce electronic data sizes at the earliest stages of e-Discovery! This alone is very significant and enables cost-efficiency! Pre-processing is intended to de-nist, de-dupe and apply data filters to efficiently cull (remove) large sets of data which may be deemed unnecessary or irrelevant to each case.
  • 4. A SUCCESSFUL PRE-PROCESSING PROGRAM SHOULD DO THE FOLLOWING: Data cataloging (think of organization of files for quick and efficient access) File extension filtering Document level and date/time filtering File type identification MD5 HASH NIST FILTERING De-Duplication
  • 5. DeNISTing What is “NIST”? Why is “NIST” and De“NIST”ING significant? How does this apply to electronic discovery and how does this influence technology?
  • 6. What is a NIST? It is not a laying ground for eggs!!! The proper method for excluding these files is known as DeNISTing. The National Institute for Standards and Technology (“NIST”) publishes the National Software Reference Library (“NRSL”). The NRSL is basically the Library of Congress of software files. Its comprehensive listing of files includes all of the files known to be distributed with software packages such as Microsoft Office.
  • 7. WHYS IS de“NIST”ing IMPORTANT? In the typical electronic discovery case, DeNISTing alone will reduce the volume of information to be examined by 20%. From the perspective of large corporations, NIST comes in handy as reduced volume means less money spent on the discovery process and cost-efficiency comes into play.
  • 8. INTERSTING FACTS ABOUT “NIST” The NIST list contains over 28 Million file signatures. It is used regularly by the FBI and other law enforcement entities to identify files with no evidentiary value. The list is free. Many e-Discovery companies take advantage of this free list and incorporate it into their software.
  • 9. CONCERNS WITH de“NIST” While the NIST list is updated four times per year, it may not include important files. In past, significant system files were not being removed during the “DE-NISTing” process on workstations using Windows 7 and the latest release of Microsoft Office. Historically, this was a problem in 2011 so at times, de-nisting is not full proof.
  • 10. CONCERNS (continued) In past, the NIST list does not yet include Windows 7 files, despite the fact that there are more three hundred million workstations that run Windows 7. Additionally, it did NOT include Microsoft Office 2010 files yet either. Supplementing the NIST list by removing system files such as EXE and DLL files is a clearly documentable method to reduce the number of files in the review set. This method doesn’t depend on HASH values and, assuming that these file types are not responsive (which is usually the case) can be an effective method for eliminating files to review.
  • 11. What is “Hash”? When we think of “hash”, we are not referring to McDonald’s breakfast potatoes. From a technology perspective, we must think of hash as an individual file’s digital fingerprint. The listing includes the names of the files, their typical file sizes and the “hash” value for the file..
  • 12. HASH – THINK ALGORITMS!! When we think of hash we should think: Encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.
  • 13. EXAMPLE Let us say for instance, the hash values of a Word document I am working on now are: 
MD5: 588BCBD1845342C10D9BBD1C23294459
If I only change one comma in this multipage document, all else remaining the same, the hash values are now: 
MD5: 5F0266C4C326B9A1EF9E39CB78C352DC
SHA-1: 4C37FC6257556E954E90755DEE5DB8CDA8D76710 
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.
  • 14. HASHING (cont.) Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different (may require software). For instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”
  • 15. THE MD5 AND WHY IT IS SIGNIFICANT! An MD5 message hash helps e-discovery professionals both verify the integrity of transferred files and check the digital signature of those files. When hash functions are applied, legal teams can quickly locate documents in different formats within a sizeable data collection. Additionally, through the use of pre-culling hashing tools, they can rapidly identify duplicate documents by comparing hash values.
  • 16. NEAR DE-DUPE? WHAT IS THAT? When the near de-duplication occurs, is will reveal if there are two documents that are similar. Think of having created a Word document entitled, “mydog.doc”. Let us imagine that we have revised “mydog.doc” and saved it as “mydog_revisedversion1.doc”. Even though the Word filed was saved with the intent of being the same document in revision mode, what makes this a near de-dupe is the fact that though the draft versions are the same, they are not identical. Here is another example Near De-Duplication: One file exists “File AV1.0″. This file is then opened, spell checked, and then saved as “File AV1.1″. These files are very similar and are classed as near “de-duplicates”.
  • 17. NEAR DE-DUPLICATION (continued) With our knowledge of de-duplication, the creation of near-de-duplication programs and software allow for an even higher level of data de-duplication as it identifies files that are similar and are not bit-level duplicates. These near-de-duplication technologies help identify and group/tag electronic files with “near duplicate” similarities, however there are differences with regard to the content or metadata, or even both. Example of near de-duplication can include document versions, emails sent to multiple custodians, different parts of email chains, or similar proposals sent to several clients.
  • 18. IT IS ALL ABOUT ORGANIZATION!!! Imagine you are digging through years of files in your local storage shed in the attempt to find one significant document related to a past law-suit. You dig through 20 boxes of paper and can’t find that one document!? Wouldn’t it be nice to have an efficient way of eliminating a full day of search to just under an hour? Finding and grouping of documents does this! In recent years, the finding and grouping of documents in ediscovery has also been enhanced by new pre-culling tools that go beyond query methodology in concept and fuzzy searching. Historically, document sets were compiled with keyword searches and then narrowed by using fewer search terms.
  • 19. ALL ABOUT ORG. (cont.) Now, with the advent of concept clustering (i.e., foldering), advanced document analysis can help organize information more effectively by subject. This clustering capability greatly facilitates the review process by showing attorneys which subjects warrant the greatest attention or relevance to a particular case
  • 20. DATA MAPPING Data mapping software is one of the most powerful pre-culling tool. Provides the framework for visual analysis, showing users the different “points” across their continent of data. Extract and index metadata and text from native files, create clusters based on any combination of attributes and allows users to search and analyze document collections prior to full EDD processing.
  • 21. DATA MAPPING (cont.) Data mapping applications should be able to remove duplicates in advance and can help legal professionals reduce documents by as much as eighty. This is why mapping is significant. One huge benefit of data mapping is that it can provide litigators direct control over the document collection. They can manipulate data themselves, in real time, without the need for vendor assistance or external processing.
  • 23. CONCLUSION ON PREPROCESSING The goal of the pre-processing stages enables clients to greatly reduce electronic data sizes at the earliest stages in the e-Discovery lifecycle. Targets the relevance of the data Organizes Cost efficiency – equals happy clients!!!