Successfully reported this slideshow.
Your SlideShare is downloading. ×

Predictive Coding Legaltech


Check these out next

1 of 47 Ad

More Related Content

Similar to Predictive Coding Legaltech (20)


Predictive Coding Legaltech

  1. 1. Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective John Tredennick Jeremy Pickens Jim Eidelman
  2. 2. How Many Do I Have to Check? 1.  You have a bag with 1 million M&Ms 2.  It contains mostly brown M&Ms 3.  You cannot see into the bag 4.  You have a scoop that will pull out 100 M&Ms at a time 5.  Your hope is that there are no red M&Ms in the bag 6.  You pull out a scoop and they are all brown How many scoops do you need to review to be confident there are no red M&Ms?
  3. 3. Let’s Take a Poll How many scoops? 2 1 3 5 10 20 100? 500? 1,000?
  4. 4. How Confident Do You Need to Be? Does 95% work? How about 99% How many errors can you tolerate? §  Five out of a hundred? §  One out of a hundred? §  One percent = 10,000 At a 95% confidence level and 5% percent margin of error: 384 M&Ms At a 99% confidence level and 1% margin of error: 459 M&Ms At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms
  5. 5. Predictive Coding
  6. 6. Does it Work?
  7. 7. What Have the Courts Said?
  8. 8. What Have the Courts Said? “Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.” Magistrate Judge Andrew Peck
  9. 9. Predictive Coding 1.0 1.  Assemble your corpus. 2.  Assemble a seed set of documents. 3.  Review the seed set. 4.  Apply machine learning and automatically tag the remainder of the corpus.
  10. 10. Predictive Coding 1.0 §  Tremendous gains in review effectiveness §  Substantial cost savings §  It works. Often quite well ….when the corpus is complete.
  11. 11. 67.5 uploads per case 533 matters, nearly 36,000 uploads across the matters.
  12. 12. 166.3 days loading case This is collection driven, not loading limits.
  13. 13. In which upload and on which day do your responsive documents show up? 67 166 uploads days Terms that do not appear early begin appearing later.
  14. 14. Machine-Assisted Decision Making Upload timeline of 6 TB case. When should machine-assisted Is it here? decision making (e.g. early case assessment) begin? Or here?
  15. 15. Example: Responsive Early, Junk Later To:, From: Subject: Company Picnic Bob, would you coordinate with Alice and make sure we have enough hamburger buns for the company picnic? Please try and find them at a reasonable price. Responsive Junk
  16. 16. Example: Junk Early, Responsive Later To:, From: Subject: Get Together Let’s get together at 7pm at the Sports Bar to discuss pricing of our components. The Broncos are playing and I really want to watch Tebow. Junk Responsive
  17. 17. Problems With Predictive Coding 1.0 The corpus is almost never complete §  Continuous collection and rolling uploads §  When does “Early Case Assessment” begin? Changing Issues §  Responsiveness is “bursty” Shifting Concept Relationships §  Due both to increasing corpus and changing issues §  Exploration is extremely limited
  18. 18. Our Approach Predictive Coding 2.0 necessitates the ability to deal with dynamic change and flux. We have developed a flexible analytics framework based on bipartite graphs It is aware of changes in corpus and in coding so as to enable smart review and adaptive related concept suggestion as information pours in.
  19. 19. Our Approach Avoid the lock-in that arises due to poor decision making that occurs early in the matter when corpus (collection) and coding information is incomplete. Goal: Continuous Case Assessment
  20. 20. What Is Underneath? A full bipartite graph of the documents and features (e.g. words, phrases, dates) that comprise those documents
  21. 21. Terms Documents
  22. 22. Feedback: Immediate and Continuous Continuous feedback aids better decision making and predictive coding. Adapts to both: New arrival of coding information New arrival of documents and terms
  23. 23. Terms Documents
  24. 24. Predictive Coding 2.0 Feedback – and improvement – is iterative, continuous, amplified. The more you review, the less you have to review % of Docs Examined Manually
  25. 25. Better Decisions As Understanding Improves Term relationships change over time Using continuous improvement, decisions can be revised and refined as the matter proceeds.
  26. 26. Terms Documents Time uncovers new relationships
  27. 27. Looking at Concepts Over Time 20%   65%   Start with the lube   fuels   key term piping   fob   battery   purityethane   “fuel” mounted   petrochemicals   redundant   fin   batteries   paraxylene   At 20% compartments   cif   mixture   phy   these are airflow   fwd   the related ansi   swopt   ventilation   brentpartials   terms chargers   brg   stainless   locswap   rotor   benzene   And at 65% bleed   diff   accessory   spd   plenum   liquids   detector   opt  
  28. 28. Related Terms Through Coding Filters
  29. 29. Terms Documents Responsive NonResponsive
  30. 30. Putting Related Concepts to Work The whole corpus Topic 203 …whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans… Topic 205 …analyses, evaluations, TREC collection projections, plans, and reports on with many topics the volume(s) or geographic identified location(s) of energy loads.
  31. 31. Model In the Whole Collection Term   Score   Look at the keyword “model” modeling   1000   equation   864   Scope is the stochastic   706   whole collection variables   677   parameters   518   probability   365   simulation   337   assumption   325   returns   251   curves   211  
  32. 32. Model In Topic 203 Term   Score   Look at the keyword “model” flows   1000   assumptions   913   Scope: Topic 203 gains   872   shares   864   meeting liquidity   486   financial fluctuations   374   forecasts analysts   285   cents   254   whitewing   237   handles   166  
  33. 33. Model In Topic 205 Term   Score   Look at the keyword “model” bids   1000   congestion   611   Scope: Topic 205 loads   455   constraints   354   analyzing clearing   292   energy zonal   194   volumes signals   192   procure   190   dispatch   152   csc   120  
  34. 34. Model In Comparison Now, Whole Corpus   Topic 203   Topic  205   imagine this modeling   flows   bids with batches equation   assumptions   congestion and coding stochastic   gains   loads changes variables   shares   constraints over time! parameters   liquidity   clearing Note: Our system probability   fluctuations   zonal can accept any simulation   analysis   signal combination of coding and assumption   cents   procure metadata filters to dynamically returns   whitewing   dispatch assess your data curves   handles   csc
  35. 35. Summary Incomplete Collections Changing Coding Calls Havoc for Machine Coding
  36. 36. Predictive Coding 2.0 Problem: The corpus is almost never complete Answer: Review Algorithms that are iterative and continuous Problem: Changing Issues Answer: Review Algorithms that are adaptive and continuous Problem: Shifting Concept Relationships Answer: Concept Relationships that are calculated dynamically, on- the-fly, and coding-aware. Continuous Case Assessment
  37. 37. Analytics Consulting §  Analytics consulting and predictive ranking for nearly 4 years §  How it started -- Before “Predictive Coding” became popular: “Can’t you predict what documents are probably relevant based on your review so far?” – Judge, SDNY §  Predictive Ranking: Iterative search techniques + algorithms §  Then off-the-shelf Predictive Coding 1.0 technologies §  Catalyst’s research is exciting! We apply the research to real-world scenarios. Applying Bipartite Analytics…
  38. 38. Smart Review with the Bipartite Analytics Technology Advantages: §  Accurate §  Dynamic §  Flexible §  “Just in Time” suggestions
  39. 39. Smart Review Scenarios 1. “What happened” – examples: FCPA investigation, conspiracy ECA 2. Typical large scale litigation with lots of ESI – e.g., class action lawsuit 3. Highly complex litigation with multiple issues – e.g. patent and unfair competition claims
  40. 40. Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.
  41. 41. Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding. §  Faster location of valuable “veins” of information due to search filters §  Rapid learning and application of that learning through flexible, “just in time” predictive coding 2.0. §  “Choose your own adventure”
  42. 42. Scenario 2 – Large Scale Litigation Goal: Minimize cost because of learning across large document set, increase quality with focused review, and maximize protection of privilege and trade secrets Applying the Technology: §  Prioritized review based on rapid, continuous learning §  Large scale defensible culling §  More accurate ranking of “potentially privileged” documents
  43. 43. Scenario 3– Highly Complex Litigation Goal: Review and produce with multiple and changing issues Applying the Technology §  Rapid learning across multiple topics §  Leverage ability to adjust for change in topics §  Review quality improves because of focus §  Explore otherwise hidden subjects with Concept Explorer §  Leverage learning across narrow, focused lines of inquiry (e.g., emails between two people in a narrow time window) §  Protect privileged documents
  44. 44. Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective John Tredennick Jeremy Pickens Jim Eidelman