Predictive Coding Legaltech

Predictive Coding 2.0
Making E-Discovery
More Efficient and Cost Effective

John Tredennick
Jeremy Pickens
Jim Eidelman

How Many Do I Have to Check?
1.  You have a bag with 1 million M&Ms
2.  It contains mostly brown M&Ms
3.  You cannot see into the bag
4.  You have a scoop that will pull out 100
M&Ms at a time
5.  Your hope is that there are no red
M&Ms in the bag
6.  You pull out a scoop and they are all
brown

How many scoops do you need to review
to be confident there are no red M&Ms?

Let’s Take a Poll

How many scoops?
2
1 3

5 10 20

100? 500?
1,000?

How Confident Do You Need to Be?

Does 95% work? How about 99%
How many errors can you tolerate?
§  Five out of a hundred?
§  One out of a hundred?
§  One percent = 10,000

At a 95% confidence level and 5% percent margin of error: 384 M&Ms
At a 99% confidence level and 1% margin of error: 459 M&Ms
At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms

What Have the Courts Said?

“Until there is a judicial opinion approving (or
even critiquing) the use of predictive coding,
counsel will just have to rely on this article as a
sign of judicial approval. In my opinion,
computer-assisted coding should be used in
those cases where it will help ‘secure the just,
speedy, and inexpensive’ (Fed. R. Civ. P. 1)
determination of cases in our e-discovery
world.”

Magistrate Judge Andrew Peck

1.  Assemble your corpus.
2.  Assemble a seed set of
documents.
3.  Review the seed set.
4.  Apply machine learning and
automatically tag the remainder
of the corpus.

§  Tremendous gains in review
effectiveness
§  Substantial cost savings
§  It works. Often quite well

….when the corpus is complete.

67.5
uploads per case
533 matters, nearly 36,000 uploads across the matters.

166.3
days loading case
This is collection driven, not loading limits.

In which upload and on which day do your responsive
documents show up?

67 166
uploads days
Terms that do not appear early begin appearing later.

Machine-Assisted Decision Making

Upload timeline of 6 TB case.

When should machine-assisted
Is it here? decision making (e.g. early case
assessment) begin?

Or here?

Example: Responsive Early, Junk Later
To: bob@company.com, alice@company.com
From: charles@company.com
Subject: Company Picnic
Bob, would you coordinate with Alice and make sure we have
enough hamburger buns for the company picnic? Please try
and find them at a reasonable price.

Responsive Junk

Example: Junk Early, Responsive Later
To: bob@company.com, alice@privatemail.com
From: charles@company.com
Subject: Get Together
Let’s get together at 7pm at the Sports Bar to discuss pricing of
our components. The Broncos are playing and I really want to
watch Tebow.

Junk
Responsive

Problems With Predictive Coding 1.0

The corpus is almost never complete
§  Continuous collection and rolling uploads
§  When does “Early Case Assessment” begin?

Changing Issues
§  Responsiveness is “bursty”

Shifting Concept Relationships
§  Due both to increasing corpus and changing issues
§  Exploration is extremely limited

Our Approach
Predictive Coding 2.0 necessitates the ability to deal with
dynamic change and flux.
We have developed a flexible analytics framework based
on bipartite graphs
It is aware of changes in corpus and in coding so as to
enable smart review and adaptive related concept
suggestion as information pours in.

Our Approach
Avoid the lock-in that arises due to poor decision making that
occurs early in the matter when corpus (collection) and coding
information is incomplete.

Goal:
Continuous Case Assessment

What Is Underneath?

A full bipartite graph of the
documents and features (e.g.
words, phrases, dates) that
comprise those documents

Feedback: Immediate and Continuous

Continuous feedback aids better decision
making and predictive coding.

Adapts to both:
New arrival of coding information
New arrival of documents and terms


Feedback – and
improvement – is iterative,
continuous, amplified.

The more you review, the
less you have to review

% of Docs Examined Manually

Better Decisions As Understanding Improves

Term relationships change over time

Using continuous improvement,
decisions can be revised and refined
as the matter proceeds.

Terms Documents

Time
uncovers
new
relationships

Looking at Concepts Over Time
20%
65%

Start with the lube
fuels

key term piping
fob

battery
purityethane

“fuel” mounted
petrochemicals

redundant
fin

batteries
paraxylene

At 20% compartments
cif

mixture
phy

these are airflow
fwd

the related ansi
swopt

ventilation
brentpartials

terms chargers
brg

stainless
locswap

rotor
benzene

And at 65% bleed
diff

accessory
spd

plenum
liquids

detector
opt

Related Terms Through Coding Filters

Terms Documents

Responsive

NonResponsive

Putting Related Concepts to Work
The whole corpus

Topic 203
…whether the Company had met,
or could, would, or might meet its
financial forecasts, models,
projections, or plans…

Topic 205
…analyses, evaluations,
TREC collection projections, plans, and reports on
with many topics the volume(s) or geographic
identified location(s) of energy loads.

Model In the Whole Collection
Term
Score

Look at the
keyword “model” modeling
1000

equation
864

Scope is the stochastic
706

whole collection variables
677

parameters
518

probability
365

simulation
337

assumption
325

returns
251

curves
211

Model In Topic 203
Term
Score

Look at the
keyword “model” flows
1000

assumptions
913

Scope: Topic 203 gains
872

shares
864

meeting liquidity
486

financial fluctuations
374

forecasts
analysts
285

cents
254

whitewing
237

handles
166

Model In Topic 205
Term
Score

Look at the
keyword “model” bids
1000

congestion
611

Scope: Topic 205 loads
455

constraints
354

analyzing clearing
292

energy zonal
194

volumes
signals
192

procure
190

dispatch
152

csc
120

Model In Comparison
Now, Whole Corpus
Topic 203
Topic
205

imagine this modeling
flows
bids
with batches equation
assumptions
congestion
and coding stochastic
gains
loads
changes variables
shares
constraints
over time! parameters
liquidity
clearing
Note: Our system probability
fluctuations
zonal
can accept any simulation
analysis
signal
combination of
coding and assumption
cents
procure
metadata filters
to dynamically
returns
whitewing
dispatch
assess your data curves
handles
csc

Summary

Incomplete Collections

Changing Coding Calls

Havoc for Machine Coding

Problem: The corpus is almost never complete
Answer: Review Algorithms that are iterative and continuous
Problem: Changing Issues
Answer: Review Algorithms that are adaptive and continuous
Problem: Shifting Concept Relationships
Answer: Concept Relationships that are calculated dynamically, on-
the-fly, and coding-aware.

Continuous Case Assessment

Analytics Consulting
§  Analytics consulting and predictive ranking for nearly 4 years
§  How it started -- Before “Predictive Coding” became popular:

“Can’t you predict what documents are probably
relevant based on your review so far?”
– Judge, SDNY

§  Predictive Ranking: Iterative search techniques + algorithms
§  Then off-the-shelf Predictive Coding 1.0 technologies
§  Catalyst’s research is exciting! We apply the research to real-world
scenarios. Applying Bipartite Analytics…

Smart Review with the Bipartite Analytics

Technology Advantages:
§  Accurate
§  Dynamic
§  Flexible
§  “Just in Time” suggestions

Smart Review Scenarios
1. “What happened” – examples: FCPA investigation, conspiracy ECA

2. Typical large scale litigation with lots of ESI –
e.g., class action lawsuit

3. Highly complex litigation with multiple issues –
e.g. patent and unfair competition claims

Scenario 1 – What happened?
Goal: Rapidly determine facts and resolve matter if possible

Applying the Technology

Small number of knowledgeable attorneys drill into documents using the
fusion of advanced search features and flexible predictive coding.

Scenario 1 – What happened?
Goal: Rapidly determine facts and resolve matter if possible

Small number of knowledgeable attorneys drill into documents using
the fusion of advanced search features and flexible predictive coding.

§  Faster location of valuable “veins” of information
due to search filters
§  Rapid learning and application of that learning
through flexible, “just in time” predictive coding 2.0.
§  “Choose your own adventure”

Scenario 2 – Large Scale Litigation
Goal: Minimize cost because of learning across large document set,
increase quality with focused review, and maximize protection of
privilege and trade secrets

Applying the Technology:
§  Prioritized review based on rapid, continuous learning
§  Large scale defensible culling
§  More accurate ranking of “potentially privileged” documents

Scenario 3– Highly Complex Litigation
Goal: Review and produce with multiple and changing issues

§  Rapid learning across multiple topics
§  Leverage ability to adjust for change in topics
§  Review quality improves because of focus
§  Explore otherwise hidden subjects with Concept Explorer
§  Leverage learning across narrow, focused lines of inquiry (e.g.,
emails between two people in a narrow time window)
§  Protect privileged documents

Predictive Coding Legaltech

Recommended

Recommended

More Related Content

Similar to Predictive Coding Legaltech

Similar to Predictive Coding Legaltech (20)

Predictive Coding Legaltech