TAR versus Keyword Challenge

Private and Confidential – Copyright 2019
Prevalence (Richness)
• Percentage of population that is relevant.
• Important when choosing methodology.

Recall
• Percentage of relevant docs found.
• Defensibility.

Precision
• Percentage of retrieved docs that are relevant.
• Related to cost (review effort).
• 1/P = average docs reviewed per relevant doc found.

Precision-Recall Curve

Meaningful Comparison of Systems
Fair Approaches:
• Equal defensibility (recall), compare cost.
• Equal cost, compare defensibility (recall).
At least one system should achieve reasonable recall.

Bad Metric: Accuracy
• Percentage of predictions that are right.
• Makes bad systems look good.

Bad Metric: F1 Score
• F1 = 2*P*R / (P + R)
• Between P and R, closer to the smaller one.

Are Research Results Relevant?
• Many studies aren’t focused on e-discovery.
• Appropriate metrics used?
• Reasonable recall achieved?
• Realistic data set?

Keyword Search vs. TAR
Search rules for the challenge (due to software limitations):
• No phrase search, proximity search, wildcards, or stemming
• Keywords are not case-sensitive
• Boolean operators must be upper case
• Weights (positive and negative) are OK. Default weight is 1.
• Example: microsoft^2.5 OR (windows AND NOT house)^1.2 OR software
Topics:
• Law: existing law, excluding politics or proposed new law
• Medical: business-oriented (not scientific) articles about the medical industry
• Biology: mainstream science articles (not medical treatment)
Submission: http://clustify.com/query
Analysis of final results will be posted at: http://blog.clustify.com

Keyword Search on Steroids
scientists^1000 OR gene^990 OR genes^973 OR protein^856 OR proteins^804 OR biotechnology^774 OR
cells^712 OR biology^708 OR dna^669 OR function^660 OR researchers^603 OR cell^600 OR human^543
OR expression^516 OR molecular^496 OR experiments^482 OR genetic^464 OR drugs^460 OR biotech^448
OR population^441 OR mammalian^427 OR development^411 OR sequence^404 OR investigators^374 OR
novel^368 OR disease^361 OR wild^361 OR pharmaceutical^360 OR reagents^349 OR adult^349 OR
scientific^345 OR island^344 OR antibody^335 OR rapid^328 OR synthesis^324 OR mouse^323 OR
…
OR reactions^-252 OR war^-252 OR populations^-264 OR computer^-283 OR effects^-289 OR optical^-291
OR electronic^-293 OR treatment^-296 OR risk^-310 OR society^-323 OR table^-344 OR learning^-346 OR
tests^-412

Finding Word Weights

Training / Control Set Animation

Keyword Search Strategies
Similar to TAR 1.0 (SPL):
• Review a random sample of docs.
• Examine docs to find query keywords.
• Repeat until query improvement is minimal.
• Hard when prevalence is low.
Similar to TAR 2.0 (CAL):
• Create a query.
• Review top docs from query.
• Adjust query to add keywords from relevant docs and to suppress non-relevant docs.
• Repeat until can’t find any more relevant docs.
• Good when prevalence is low, but is it robust?

TAR 2.0 Robust? - Weak Seed

TAR 2.0 Robust? – Wrong Seed

TAR 2.0 Robust? – Disjoint Relevance

Toy Example Illustrating Workflows

TAR 1.0

TAR 2.0

TAR 3.0

Review Effort (All Candidates Reviewed)

Review Effort (No Candidates Reviewed)

Beyond Keywords
• Use meta-data.
• Feature engineering.
• Adjacent word pairs instead of single words.
• Non-linear relevance boundary.
• Transformations to handle synonyms, etc. (LSA, word2vec, etc.)

Tips
• Think of TAR as a more systematic way to do keywords, plus more.
• Beware of keyword search culling before applying TAR – many relevant docs probably lost.
• Use the right performance metrics.
• Choose the right TAR workflow for the situation.

Misleading Metrics and Irrelevant Research (Accuracy and F1)
https://blog.cluster-text.com/2018/12/12/misleading-metrics-and-irrelevant-research-accuracy-and-f1/
The Single Seed Hypothesis
https://blog.cluster-text.com/2015/04/25/the-single-seed-hypothesis/
TAR 3.0 Performance
https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/
References

TAR versus Keyword Challenge

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to TAR versus Keyword Challenge

Similar to TAR versus Keyword Challenge (20)

More from Ipro Tech

More from Ipro Tech (20)

Recently uploaded

Recently uploaded (20)

TAR versus Keyword Challenge

Editor's Notes