Predictive Coding Legaltech

Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Predictive Coding 2.0Making E-DiscoveryMore Efficient and Cost Effective John Tredennick Jeremy Pickens Jim Eidelman
  • 2. How Many Do I Have to Check? 1.  You have a bag with 1 million M&Ms 2.  It contains mostly brown M&Ms 3.  You cannot see into the bag 4.  You have a scoop that will pull out 100 M&Ms at a time 5.  Your hope is that there are no red M&Ms in the bag 6.  You pull out a scoop and they are all brown How many scoops do you need to review to be confident there are no red M&Ms?
  • 3. Let’s Take a Poll How many scoops? 2 1 3 5 10 20 100? 500? 1,000?
  • 4. How Confident Do You Need to Be? Does 95% work? How about 99% How many errors can you tolerate? §  Five out of a hundred? §  One out of a hundred? §  One percent = 10,000 At a 95% confidence level and 5% percent margin of error: 384 M&Ms At a 99% confidence level and 1% margin of error: 459 M&Ms At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms
  • 5. Predictive Coding
  • 6. Does it Work?
  • 7. What Have the Courts Said?
  • 8. What Have the Courts Said? “Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.” Magistrate Judge Andrew Peck
  • 9. Predictive Coding 1.0 1.  Assemble your corpus. 2.  Assemble a seed set of documents. 3.  Review the seed set. 4.  Apply machine learning and automatically tag the remainder of the corpus.
  • 10. Predictive Coding 1.0 §  Tremendous gains in review effectiveness §  Substantial cost savings §  It works. Often quite well ….when the corpus is complete.
  • 11. 67.5uploads per case533 matters, nearly 36,000 uploads across the matters.
  • 12. 166.3days loading case This is collection driven, not loading limits.
  • 13. In which upload and on which day do your responsivedocuments show up? 67 166 uploads days Terms that do not appear early begin appearing later.
  • 14. Machine-Assisted Decision Making Upload timeline of 6 TB case. When should machine-assisted Is it here? decision making (e.g. early case assessment) begin? Or here?
  • 15. Example: Responsive Early, Junk Later To:, From: Subject: Company Picnic Bob, would you coordinate with Alice and make sure we have enough hamburger buns for the company picnic? Please try and find them at a reasonable price.Responsive Junk
  • 16. Example: Junk Early, Responsive Later To:, From: Subject: Get Together Let’s get together at 7pm at the Sports Bar to discuss pricing of our components. The Broncos are playing and I really want to watch Tebow. Junk Responsive
  • 17. Problems With Predictive Coding 1.0 The corpus is almost never complete §  Continuous collection and rolling uploads §  When does “Early Case Assessment” begin? Changing Issues §  Responsiveness is “bursty” Shifting Concept Relationships §  Due both to increasing corpus and changing issues §  Exploration is extremely limited
  • 18. Our Approach Predictive Coding 2.0 necessitates the ability to deal with dynamic change and flux. We have developed a flexible analytics framework based on bipartite graphs It is aware of changes in corpus and in coding so as to enable smart review and adaptive related concept suggestion as information pours in.
  • 19. Our ApproachAvoid the lock-in that arises due to poor decision making thatoccurs early in the matter when corpus (collection) and codinginformation is incomplete.Goal:Continuous Case Assessment
  • 20. What Is Underneath? A full bipartite graph of the documents and features (e.g. words, phrases, dates) that comprise those documents
  • 21. Terms Documents
  • 22. Feedback: Immediate and ContinuousContinuous feedback aids better decisionmaking and predictive coding.Adapts to both: New arrival of coding information New arrival of documents and terms
  • 23. Terms Documents
  • 24. Predictive Coding 2.0 Feedback – and improvement – is iterative, continuous, amplified. The more you review, the less you have to review % of Docs Examined Manually
  • 25. Better Decisions As Understanding ImprovesTerm relationships change over timeUsing continuous improvement,decisions can be revised and refinedas the matter proceeds.
  • 26. Terms Documents Time uncovers new relationships
  • 27. Looking at Concepts Over Time 20%   65%   Start with the lube   fuels   key term piping   fob   battery   purityethane   “fuel” mounted   petrochemicals   redundant   fin   batteries   paraxylene   At 20% compartments   cif   mixture   phy   these are airflow   fwd   the related ansi   swopt   ventilation   brentpartials   terms chargers   brg   stainless   locswap   rotor   benzene   And at 65% bleed   diff   accessory   spd   plenum   liquids   detector   opt  
  • 28. Related Terms Through Coding Filters
  • 29. Terms Documents Responsive NonResponsive
  • 30. Putting Related Concepts to Work The whole corpus Topic 203 …whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans… Topic 205 …analyses, evaluations, TREC collection projections, plans, and reports on with many topics the volume(s) or geographic identified location(s) of energy loads.
  • 31. Model In the Whole Collection Term   Score   Look at the keyword “model” modeling   1000   equation   864   Scope is the stochastic   706   whole collection variables   677   parameters   518   probability   365   simulation   337   assumption   325   returns   251   curves   211  
  • 32. Model In Topic 203 Term   Score   Look at the keyword “model” flows   1000   assumptions   913   Scope: Topic 203 gains   872   shares   864   meeting liquidity   486   financial fluctuations   374   forecasts analysts   285   cents   254   whitewing   237   handles   166  
  • 33. Model In Topic 205 Term   Score   Look at the keyword “model” bids   1000   congestion   611   Scope: Topic 205 loads   455   constraints   354   analyzing clearing   292   energy zonal   194   volumes signals   192   procure   190   dispatch   152   csc   120  
  • 34. Model In Comparison Now, Whole Corpus   Topic 203   Topic  205  imagine this modeling   flows   bidswith batches equation   assumptions   congestion and coding stochastic   gains   loads changes variables   shares   constraints over time! parameters   liquidity   clearingNote: Our system probability   fluctuations   zonalcan accept any simulation   analysis   signalcombination ofcoding and assumption   cents   procuremetadata filtersto dynamically returns   whitewing   dispatchassess your data curves   handles   csc
  • 35. Summary Incomplete Collections Changing Coding CallsHavoc for Machine Coding
  • 36. Predictive Coding 2.0Problem: The corpus is almost never complete Answer: Review Algorithms that are iterative and continuousProblem: Changing Issues Answer: Review Algorithms that are adaptive and continuousProblem: Shifting Concept Relationships Answer: Concept Relationships that are calculated dynamically, on- the-fly, and coding-aware.Continuous Case Assessment
  • 37. Analytics Consulting§  Analytics consulting and predictive ranking for nearly 4 years§  How it started -- Before “Predictive Coding” became popular: “Can’t you predict what documents are probably relevant based on your review so far?” – Judge, SDNY§  Predictive Ranking: Iterative search techniques + algorithms§  Then off-the-shelf Predictive Coding 1.0 technologies§  Catalyst’s research is exciting! We apply the research to real-world scenarios. Applying Bipartite Analytics…
  • 38. Smart Review with the Bipartite Analytics Technology Advantages: §  Accurate §  Dynamic §  Flexible §  “Just in Time” suggestions
  • 39. Smart Review Scenarios1. “What happened” – examples: FCPA investigation, conspiracy ECA2. Typical large scale litigation with lots of ESI – e.g., class action lawsuit3. Highly complex litigation with multiple issues – e.g. patent and unfair competition claims
  • 40. Scenario 1 – What happened?Goal: Rapidly determine facts and resolve matter if possibleApplying the TechnologySmall number of knowledgeable attorneys drill into documents using thefusion of advanced search features and flexible predictive coding.
  • 41. Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding. §  Faster location of valuable “veins” of information due to search filters §  Rapid learning and application of that learning through flexible, “just in time” predictive coding 2.0. §  “Choose your own adventure”
  • 42. Scenario 2 – Large Scale Litigation Goal: Minimize cost because of learning across large document set, increase quality with focused review, and maximize protection of privilege and trade secrets Applying the Technology: §  Prioritized review based on rapid, continuous learning §  Large scale defensible culling §  More accurate ranking of “potentially privileged” documents
  • 43. Scenario 3– Highly Complex LitigationGoal: Review and produce with multiple and changing issuesApplying the Technology §  Rapid learning across multiple topics §  Leverage ability to adjust for change in topics §  Review quality improves because of focus §  Explore otherwise hidden subjects with Concept Explorer §  Leverage learning across narrow, focused lines of inquiry (e.g., emails between two people in a narrow time window) §  Protect privileged documents
  • 44. Predictive Coding 2.0Making E-DiscoveryMore Efficient and Cost Effective John Tredennick Jeremy Pickens Jim Eidelman