An Inevitable Reality: Machine-based eDiscovery Review<br />Jason R. Baron, Esq.Director of Litigation, National Archives ...
The Problem<br />Technologies and Techniques<br />The "Unfolding Law“ & Current Research<br />Question & Answer<br />
1.8Zb<br />Lots of It<br />95%<br />Mostly Unstructured<br />85%<br />Mostly Unmanaged<br />85%<br />Created by Organizati...
A Legal Crossroads<br />“[T]he legal profession is at a crossroads: the choice is between continuing to conduct discovery ...
Search v. Review<br />
A Better Mousetrap<br /><ul><li>Scenario: </li></ul>	- 10Mdocs (25% attachments)<br />	- Review50 per hour<br />  	- 100 p...
FINDING RESPONSIVE DOCUMENTS IN A LARGE DATA SET: FOUR LOGICAL CATEGORIES <br />Not Relevant and Retrieved<br />Relevant a...
The Problem<br />Technologies and Techniques <br />The "Unfolding Law“ and Current Research<br />Question & Answer<br />
Techniques<br />Advanced Search<br />Greater Interaction with Opposing Counsel<br />Iterative, tiered and phased approach<...
10<br />Technology Tools<br />Greater Use Made of Boolean Strings<br />Fuzzy Search Models<br />Probabilistic models (Baye...
Emerging New Predictive Strategies<br />Improved review and case assessment: cluster docs thru use of software with minima...
The Problem<br />Technologies and Techniques <br />The “Unfolding Law” and Current Research<br />Question & Answer<br />
Unfolding Law<br />Fed. Rule Civ. P. 1 (aim is to secure the just, speedy, economical determination of every action)<br />...
Judge Facciola writing for the U.S. District Court for the District of Columbia<br />14<br />    “Whether search terms or ...
Judge Grimm writing for the U.S. District Court for the District of Maryland<br />15<br />    “[W]hile it is universally a...
What is TREC?<br />16<br />Conference series co-sponsored by the National Institute of Standards and Technology (NIST) and...
17<br />TREC Legal Track<br />The TREC Legal Track was designed to evaluate the effectiveness of search technologies in a ...
Recall & Precision<br />Team A<br /><ul><li> Recall = 30.5%
 Precision = 7.7%
 F1 = 12.3%</li></li></ul><li>Recall & Precision<br />Team B<br /><ul><li> Recall = 19.8%
 Precision = 16.9%
 F1 = 18.3%</li></li></ul><li>Recall & Precision<br />Team C<br /><ul><li> Recall = 76.2%
 Precision = 84.4%
 F1 = 80.1%</li></li></ul><li>Nobody Finds Everything<br />National Archives and <br />21<br />Source: TREC 2006 Legal Tra...
“Boolean” Searches May Miss A Large Percentage of Relevant Documents<br />78% of relevant documents were only found by som...
Interactive Task – Results from 2008 & 2009<br />Topic 102 (2008)<br />Topic 103 (2008)<br />Topic 104 (2008)<br />Topic 2...
Upcoming SlideShare
Loading in …5
×

Jason Baron, Esq. and James Shook, Esq. - An Inevitable Reality: Machine-based eDiscovery Review

1,458 views

Published on

An Inevitable Reality: Machine-based eDiscovery Review

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,458
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Key Issue: Why are search and other semantically based technologies the most-important ones in e-discovery?
  • Jason Baron, Esq. and James Shook, Esq. - An Inevitable Reality: Machine-based eDiscovery Review

    1. 1. An Inevitable Reality: Machine-based eDiscovery Review<br />Jason R. Baron, Esq.Director of Litigation, National Archives and Records Administration<br />jason.baron@nara.gov<br />James D. Shook, Esq.<br />Director, eDiscovery and Compliance Group, EMC Corporation<br />jim.shook@emc.com<br />
    2. 2. The Problem<br />Technologies and Techniques<br />The "Unfolding Law“ & Current Research<br />Question & Answer<br />
    3. 3. 1.8Zb<br />Lots of It<br />95%<br />Mostly Unstructured<br />85%<br />Mostly Unmanaged<br />85%<br />Created by Organizations<br />▲<br />Becoming More Regulated<br />Information Today – The Big Picture<br />Information<br />
    4. 4. A Legal Crossroads<br />“[T]he legal profession is at a crossroads: the choice is between continuing to conduct discovery as it has ‘always been practiced’ in a paper world – before the advent of computers [and] the Internet . . . Or, alternatively, embracing new ways of thinking in today’s digital world.”<br />The Sedona Conference, The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (2009)<br />
    5. 5. Search v. Review<br />
    6. 6. A Better Mousetrap<br /><ul><li>Scenario: </li></ul> - 10Mdocs (25% attachments)<br /> - Review50 per hour<br /> - 100 people, 10 hrs per day, 7 days a week, 52 weeks a year ….<br /><ul><li>Result:</li></ul> - 28 weeks . . . -<br /> - $20 million in cost<br />
    7. 7. FINDING RESPONSIVE DOCUMENTS IN A LARGE DATA SET: FOUR LOGICAL CATEGORIES <br />Not Relevant and Retrieved<br />Relevant and Retrieved<br />DOCUMENT SET <br />FALSE POSITIVES<br />Relevant and Not Retrieved<br />Not Relevant and Not Retrieved<br />FALSE NEGATIVES<br />
    8. 8. The Problem<br />Technologies and Techniques <br />The "Unfolding Law“ and Current Research<br />Question & Answer<br />
    9. 9. Techniques<br />Advanced Search<br />Greater Interaction with Opposing Counsel<br />Iterative, tiered and phased approach<br />Project Management, Sampling, Quality Control<br />Jason R. Baron, Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in e-Discovery Search, XVII Rich. J.L. & Tech. 9 (2011), http://jolt.richmond.edu/v17i3/article9.pdf <br />
    10. 10. 10<br />Technology Tools<br />Greater Use Made of Boolean Strings<br />Fuzzy Search Models<br />Probabilistic models (Bayesian)<br />Statistical methods (clustering)<br />Machine learning approaches to semantic representation<br />Categorization tools: taxonomies and ontologies<br />Social network analysis<br />Hybrid approaches<br />Reference: Appendix to The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2007), available at http://www.thesedonaconference.org (link to publications)<br />
    11. 11. Emerging New Predictive Strategies<br />Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to code “seeded” data set<br />Slide adapted from Gartner Conference<br />June 23, 2010 Washington, D.C.<br />
    12. 12. The Problem<br />Technologies and Techniques <br />The “Unfolding Law” and Current Research<br />Question & Answer<br />
    13. 13. Unfolding Law<br />Fed. Rule Civ. P. 1 (aim is to secure the just, speedy, economical determination of every action)<br />U.S. v. O’Keefe <br />Victor Stanley I<br />Privilege Concerns<br />
    14. 14. Judge Facciola writing for the U.S. District Court for the District of Columbia<br />14<br /> “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. See George L. Paul & Jason R. BaronInformation Inflation: Can the Legal System Adapt?', 13 RICH. J.L. & TECH.. 10 (2007) * * * <br />Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.”<br /> -- U.S. v. O'Keefe, 537 F.Supp.2d 14, 24 D.D.C. 2008).<br />
    15. 15. Judge Grimm writing for the U.S. District Court for the District of Maryland<br />15<br /> “[W]hile it is universally acknowledged that keyword searches are useful tools for search and retrieval of ESI, all keyword searches are not created equal; and there is a growing body of literature that highlights the risks associated with conducting an unreliable or inadequate keyword search or relying on such searches for privilege review.” Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008); see id., text accompanying nn. 9 & 10 (citing to Sedona Search Commentary & TREC Legal Track research project)<br />
    16. 16. What is TREC?<br />16<br />Conference series co-sponsored by the National Institute of Standards and Technology (NIST) and the Advanced Research and Development Activity (ARDA) of the Department of Defense<br />Designed to promote research into the science of information retrieval<br />First TREC conference was in 1992<br />15th Conference held November 15-17, 2006 in U.S. in Gaithersburg, Maryland (NIST headquarters)<br />
    17. 17. 17<br />TREC Legal Track<br />The TREC Legal Track was designed to evaluate the effectiveness of search technologies in a real-world legal context<br />First of a kind study using nonproprietary data since Blair/Maron research in 1985 <br />Hypothetical complaints and 100+ “requests to produce” drafted by members of The Sedona Conference®<br />“Boolean negotiations” conducted as a baseline for search efforts <br />Documents to be searched were drawn from a publicly available 7 million document tobacco litigation Master Settlement Agreement database<br />New Interactive Task added in 2008 and continued in 2009 using Topic Authorities and a post-adjudication round<br />In 2009, a second Enron data set was added as a separate task<br />Participating teams of information scientists from around the world contributing computer runs, plus in 2008 thru 2011 from legal service providers <br />Results from 2010 round currently being processed – will be posted on TREC website soon<br />
    18. 18. Recall & Precision<br />Team A<br /><ul><li> Recall = 30.5%
    19. 19. Precision = 7.7%
    20. 20. F1 = 12.3%</li></li></ul><li>Recall & Precision<br />Team B<br /><ul><li> Recall = 19.8%
    21. 21. Precision = 16.9%
    22. 22. F1 = 18.3%</li></li></ul><li>Recall & Precision<br />Team C<br /><ul><li> Recall = 76.2%
    23. 23. Precision = 84.4%
    24. 24. F1 = 80.1%</li></li></ul><li>Nobody Finds Everything<br />National Archives and <br />21<br />Source: TREC 2006 Legal Track<br />
    25. 25. “Boolean” Searches May Miss A Large Percentage of Relevant Documents<br />78% of relevant documents were only found by some other technique<br />Source: TREC 2007 Legal Track<br />
    26. 26. Interactive Task – Results from 2008 & 2009<br />Topic 102 (2008)<br />Topic 103 (2008)<br />Topic 104 (2008)<br />Topic 201 (2009)<br />Topic 202 (2009)<br />Topic 203 (2009)<br />Topic 204 (2009)<br />Topic 205 (2009)<br />Topic 206 (2009)<br />Topic 207 (2009)<br />Source: 2008/2009 TREC Legal Track<br />
    27. 27. An Inevitable Reality: Machine-based eDiscovery Review<br />The Problem<br />Technologies and Techniques<br />The “Unfolding Law” and Current Research<br />Question & Answer<br />Jason R. Baron, Esq.Director of Litigation, National Archives and Records Administration<br />jason.baron@nara.gov<br />James D. Shook, Esq.<br />Director, eDiscovery and Compliance Group, EMC Corporation<br />jim.shook@emc.comwww.kazeon.com/blog<br />
    28. 28. Next Steps<br />Best practices white papers, analyst papers and more…<br />eDiscovery <br />kazeon.com<br />emc.com/ediscovery<br />Information Governance<br />emc.com/informationgovernance<br />emc.com/SourceOneCity<br />Upcoming events<br />Masters Conference<br />mastersconference.com<br />Best Practices eDiscovery webcasts (EMC+Masters Conf)<br />kazeon.com/newsroom2/webinars.php<br />

    ×