Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TAR versus Keyword Challenge

492 views

Published on

Percentage of population that is relevant.
Important when choosing methodology.

Published in: Law
  • Login to see the comments

  • Be the first to like this

TAR versus Keyword Challenge

  1. 1. TAR Versus Keyword Challenge
  2. 2. Private and Confidential – Copyright 2019 Prevalence (Richness) • Percentage of population that is relevant. • Important when choosing methodology.
  3. 3. Private and Confidential – Copyright 2019 Recall • Percentage of relevant docs found. • Defensibility.
  4. 4. Private and Confidential – Copyright 2019 Precision • Percentage of retrieved docs that are relevant. • Related to cost (review effort). • 1/P = average docs reviewed per relevant doc found.
  5. 5. Private and Confidential – Copyright 2019 Precision-Recall Curve
  6. 6. Private and Confidential – Copyright 2019 Meaningful Comparison of Systems Fair Approaches: • Equal defensibility (recall), compare cost. • Equal cost, compare defensibility (recall). At least one system should achieve reasonable recall.
  7. 7. Private and Confidential – Copyright 2019 Bad Metric: Accuracy • Percentage of predictions that are right. • Makes bad systems look good.
  8. 8. Private and Confidential – Copyright 2019 Bad Metric: F1 Score • F1 = 2*P*R / (P + R) • Between P and R, closer to the smaller one.
  9. 9. Private and Confidential – Copyright 2019 Are Research Results Relevant? • Many studies aren’t focused on e-discovery. • Appropriate metrics used? • Reasonable recall achieved? • Realistic data set?
  10. 10. Private and Confidential – Copyright 2019 Keyword Search vs. TAR Search rules for the challenge (due to software limitations): • No phrase search, proximity search, wildcards, or stemming • Keywords are not case-sensitive • Boolean operators must be upper case • Weights (positive and negative) are OK. Default weight is 1. • Example: microsoft^2.5 OR (windows AND NOT house)^1.2 OR software Topics: • Law: existing law, excluding politics or proposed new law • Medical: business-oriented (not scientific) articles about the medical industry • Biology: mainstream science articles (not medical treatment) Submission: http://clustify.com/query Analysis of final results will be posted at: http://blog.clustify.com
  11. 11. Private and Confidential – Copyright 2019 Keyword Search on Steroids scientists^1000 OR gene^990 OR genes^973 OR protein^856 OR proteins^804 OR biotechnology^774 OR cells^712 OR biology^708 OR dna^669 OR function^660 OR researchers^603 OR cell^600 OR human^543 OR expression^516 OR molecular^496 OR experiments^482 OR genetic^464 OR drugs^460 OR biotech^448 OR population^441 OR mammalian^427 OR development^411 OR sequence^404 OR investigators^374 OR novel^368 OR disease^361 OR wild^361 OR pharmaceutical^360 OR reagents^349 OR adult^349 OR scientific^345 OR island^344 OR antibody^335 OR rapid^328 OR synthesis^324 OR mouse^323 OR … OR reactions^-252 OR war^-252 OR populations^-264 OR computer^-283 OR effects^-289 OR optical^-291 OR electronic^-293 OR treatment^-296 OR risk^-310 OR society^-323 OR table^-344 OR learning^-346 OR tests^-412
  12. 12. Private and Confidential – Copyright 2019 Finding Word Weights
  13. 13. Private and Confidential – Copyright 2019 Training / Control Set Animation
  14. 14. Private and Confidential – Copyright 2019 Keyword Search Strategies Similar to TAR 1.0 (SPL): • Review a random sample of docs. • Examine docs to find query keywords. • Repeat until query improvement is minimal. • Hard when prevalence is low. Similar to TAR 2.0 (CAL): • Create a query. • Review top docs from query. • Adjust query to add keywords from relevant docs and to suppress non-relevant docs. • Repeat until can’t find any more relevant docs. • Good when prevalence is low, but is it robust?
  15. 15. Private and Confidential – Copyright 2019 TAR 2.0 Robust? - Weak Seed
  16. 16. Private and Confidential – Copyright 2019 TAR 2.0 Robust? – Wrong Seed
  17. 17. Private and Confidential – Copyright 2019 TAR 2.0 Robust? – Disjoint Relevance
  18. 18. Private and Confidential – Copyright 2019 Toy Example Illustrating Workflows
  19. 19. Private and Confidential – Copyright 2019 TAR 1.0
  20. 20. Private and Confidential – Copyright 2019 TAR 2.0
  21. 21. Private and Confidential – Copyright 2019 TAR 3.0
  22. 22. Private and Confidential – Copyright 2019 Review Effort (All Candidates Reviewed)
  23. 23. Private and Confidential – Copyright 2019 Review Effort (No Candidates Reviewed)
  24. 24. Private and Confidential – Copyright 2019 Beyond Keywords • Use meta-data. • Feature engineering. • Adjacent word pairs instead of single words. • Non-linear relevance boundary. • Transformations to handle synonyms, etc. (LSA, word2vec, etc.)
  25. 25. Private and Confidential – Copyright 2019 Tips • Think of TAR as a more systematic way to do keywords, plus more. • Beware of keyword search culling before applying TAR – many relevant docs probably lost. • Use the right performance metrics. • Choose the right TAR workflow for the situation.
  26. 26. Private and Confidential – Copyright 2019 Misleading Metrics and Irrelevant Research (Accuracy and F1) https://blog.cluster-text.com/2018/12/12/misleading-metrics-and-irrelevant-research-accuracy-and-f1/ The Single Seed Hypothesis https://blog.cluster-text.com/2015/04/25/the-single-seed-hypothesis/ TAR 3.0 Performance https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/ References
  27. 27. Thank you for joining us!

×