2. Private and Confidential – Copyright 2019
Prevalence (Richness)
• Percentage of population that is relevant.
• Important when choosing methodology.
3. Private and Confidential – Copyright 2019
Recall
• Percentage of relevant docs found.
• Defensibility.
4. Private and Confidential – Copyright 2019
Precision
• Percentage of retrieved docs that are relevant.
• Related to cost (review effort).
• 1/P = average docs reviewed per relevant doc found.
6. Private and Confidential – Copyright 2019
Meaningful Comparison of Systems
Fair Approaches:
• Equal defensibility (recall), compare cost.
• Equal cost, compare defensibility (recall).
At least one system should achieve reasonable recall.
7. Private and Confidential – Copyright 2019
Bad Metric: Accuracy
• Percentage of predictions that are right.
• Makes bad systems look good.
8. Private and Confidential – Copyright 2019
Bad Metric: F1 Score
• F1 = 2*P*R / (P + R)
• Between P and R, closer to the smaller one.
9. Private and Confidential – Copyright 2019
Are Research Results Relevant?
• Many studies aren’t focused on e-discovery.
• Appropriate metrics used?
• Reasonable recall achieved?
• Realistic data set?
10. Private and Confidential – Copyright 2019
Keyword Search vs. TAR
Search rules for the challenge (due to software limitations):
• No phrase search, proximity search, wildcards, or stemming
• Keywords are not case-sensitive
• Boolean operators must be upper case
• Weights (positive and negative) are OK. Default weight is 1.
• Example: microsoft^2.5 OR (windows AND NOT house)^1.2 OR software
Topics:
• Law: existing law, excluding politics or proposed new law
• Medical: business-oriented (not scientific) articles about the medical industry
• Biology: mainstream science articles (not medical treatment)
Submission: http://clustify.com/query
Analysis of final results will be posted at: http://blog.clustify.com
11. Private and Confidential – Copyright 2019
Keyword Search on Steroids
scientists^1000 OR gene^990 OR genes^973 OR protein^856 OR proteins^804 OR biotechnology^774 OR
cells^712 OR biology^708 OR dna^669 OR function^660 OR researchers^603 OR cell^600 OR human^543
OR expression^516 OR molecular^496 OR experiments^482 OR genetic^464 OR drugs^460 OR biotech^448
OR population^441 OR mammalian^427 OR development^411 OR sequence^404 OR investigators^374 OR
novel^368 OR disease^361 OR wild^361 OR pharmaceutical^360 OR reagents^349 OR adult^349 OR
scientific^345 OR island^344 OR antibody^335 OR rapid^328 OR synthesis^324 OR mouse^323 OR
…
OR reactions^-252 OR war^-252 OR populations^-264 OR computer^-283 OR effects^-289 OR optical^-291
OR electronic^-293 OR treatment^-296 OR risk^-310 OR society^-323 OR table^-344 OR learning^-346 OR
tests^-412
14. Private and Confidential – Copyright 2019
Keyword Search Strategies
Similar to TAR 1.0 (SPL):
• Review a random sample of docs.
• Examine docs to find query keywords.
• Repeat until query improvement is minimal.
• Hard when prevalence is low.
Similar to TAR 2.0 (CAL):
• Create a query.
• Review top docs from query.
• Adjust query to add keywords from relevant docs and to suppress non-relevant docs.
• Repeat until can’t find any more relevant docs.
• Good when prevalence is low, but is it robust?
24. Private and Confidential – Copyright 2019
Beyond Keywords
• Use meta-data.
• Feature engineering.
• Adjacent word pairs instead of single words.
• Non-linear relevance boundary.
• Transformations to handle synonyms, etc. (LSA, word2vec, etc.)
25. Private and Confidential – Copyright 2019
Tips
• Think of TAR as a more systematic way to do keywords, plus more.
• Beware of keyword search culling before applying TAR – many relevant docs probably lost.
• Use the right performance metrics.
• Choose the right TAR workflow for the situation.
26. Private and Confidential – Copyright 2019
Misleading Metrics and Irrelevant Research (Accuracy and F1)
https://blog.cluster-text.com/2018/12/12/misleading-metrics-and-irrelevant-research-accuracy-and-f1/
The Single Seed Hypothesis
https://blog.cluster-text.com/2015/04/25/the-single-seed-hypothesis/
TAR 3.0 Performance
https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/
References
Cost = review effort if we will review all docs that will potentially be produced.
Precision does not account for review of training docs or control set (when doing TAR).
We say “cost” instead of precision here because should take training docs and control set into account.
Do NOT compare with different cost and different defensibility – cannot reach a conclusion unless the same system wins on both.
How much cost should you be willing to trade for more defensibility? Depends on circumstances, so no good answer.
If none of the methods achieves recall that is adequate for e-discovery, results aren’t relevant.
At R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%. 1-NN requires review of 15.2 docs per relevant doc found, whereas 40-NN requires only 1.4. 1-NN requires over 10x as much review.
Mixes precision and recall, which measure very different things.
This query is from using TAR 3.0 for training to find biology documents with SVM.
Hundreds of words with positive weights – will miss very little (good for high recall).
Precisely tuned positive and negative weights.
Takes word correlation into account (some algorithms don’t).
Can make use of broad words like “scientists” by adding words like “physics” with negative weight.
Relies more on sorting of docs than on trying to pick the right subset.
Doesn’t do something like “cell AND NOT (phone OR fuel OR solar)”, which could lose some relevant docs. Instead, use negative weights to push down in the sorted listing without losing.
Based on actual data, not a guess.
SVM
Slope of the boundary line determines word weights.
Explain margin.
Shorter bars are better.
Review effort includes training, control set, and review of docs predicted to be relevant (to achieve 75% recall).
Tasks are ordered by descending prevalence (6.9% down to 0.3%).
Meta-data: sender/recipient can be critical when looking for privileged docs.
Feature engineering: Sender with only first name is probably spam.