Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably

116 views

Published on

Presentation at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). August 30, 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/kutlu-desires18.pdf

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably

  1. 1. Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably Mucahid Kutlu, Tyler McDonnell, Aashish Sheshadri, Tamer Elsayed, & Matthew Lease UT Austin -&- Qatar U Slides: slideshare.net/mattlease ml@utexas.edu @mattlease
  2. 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at 96 universities around the world www.ischools.org What’s an Information School? 2
  3. 3. • Problem Statement • Related Work • Datasets • Mix & Match: Methods & Results Roadmap Proceedings of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES), Bertinoro, Italy, August 28-31, 2018.
  4. 4. Problem Statement • Traditional relevance assessors & processes (eg TREC) remain most reliable and trusted • Non-traditional relevance judging (eg crowd) offers ease, affordability, & speed/scalability, but more variability in quality • How can we make the best use of both? – Crowd may better judge some documents/topics than others; can we divide the work appropriately? – Use crowd in cases we expect their judgments would match those of traditional judges (ie, be “correct”) 4
  5. 5. Related Work @mattlease
  6. 6. Systematic Review is e-Discovery in Doctor’s Clothing Joint work with SIGIR 2016 Workshop on Medical IR (MedIR) Gordon V. Cormack (U. Waterloo) An Thanh Nguyen (U. Texas) Thomas A. Trikalinos (Brown U.) Byron C. Wallace (U. Texas)
  7. 7. Hybrid Man-Machine Relevance Judging • Systematic review (medicine) and e-Discovery (law / civil procedure) have traditionally relied on trusted doctors/lawyers for judging • Automatic relevance classification is more efficient but less accurate • Recent active learning work has investigated hybrid man-machine judging combinations – e.g., TAR & TREC Legal Track, recent CLEF track 10
  8. 8. Hybrid Crowd-Machine Labeling • Dynamic labeling models select which example to label next, how many crowd labels to collect, & which examples to label automatically – Work by Weld (UW) & Mausam (IIT), e.g., TurKontrol – Work by Kamar and Horvitz (MSR), e.g., CrowdSynth – Our work (2 slides ahead…) 11
  9. 9. Decision Theoretic Active Learning 12
  10. 10. Combining Crowd and Expert Labels using Decision Theoretic Active Learning Nguyen, Wallace, & Lease, AAAI HCOMP’15 Systematic Review 13
  11. 11. • Built a model to predict assessor disagreement • Built a crowd simulator based on real data – Simulated relevance judgments (of varying quality) • Considered cost models for expert vs. crowd judges • Evaluated cost vs. quality of hybrid NIST-Crowd collaborative judging models. 14 A Collaborative Approach to IR Evaluation. Aashish Sheshadri. Master's Thesis, UT CS, May 2014
  12. 12. • Built a model to predict assessor disagreement • Built a crowd simulator based on real data – Simulated relevance judgments (of varying quality) • Considered cost models for expert vs. crowd judges • Evaluated cost vs. quality of hybrid NIST-Crowd collaborative judging models. 15 This Work: Simplified, w/ Newer, Real Data
  13. 13. Datasets @mattlease
  14. 14. TREC’09 Million Query Track (ClueWeb’09) • 3K MTurk judgments collected for TREC 2010 Relevance Feedback Track – (Buckley, Smucker, & Lease, TREC’10 Notebook) – (Grady & Lease, NAACL’10 MTurk Workshop) – Judgments re-used in TREC Crowdsourcing Tracks • 1st crowd judgments collected in my lab – Relatively low quality: 65% MV / 70% DS 17
  15. 15. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments with T. McDonnell, M. Kutlu, & T. Elsayed HCOMP 2016, Best Paper Award 18
  16. 16. • Scale up approach from prior HCOMP paper • Mine rationales to understand disagreement • But not discussed… worker behavioral data Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement? with M. Kutlu, T. McDonnell, Y. Barkallah, & T. Elsayed ACM SIGIR 2018 19
  17. 17. • Mine crowd worker analytics (behavioral data) to predict label quality based on behavior Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations with T. Goyal, T. McDonnell, M. Kutlu, & T. Elsayed AAAI HCOMP 2018 20
  18. 18. Data (available online) 1. TREC’09 Million Query Track (ClueWeb’09) – Crowd judgments from TREC’09 RF Track (with Mark Smucker), re-used in TREC Crowdsourcing Tracks – 1st crowd judgments my lab ever collected; noisy 2. TREC’14 Web Track (ClueWeb’12) – 25K MTurk judgments just collected with “rationales” (Kutlu et al., SIGIR’18; Goyal et al., HCOMP’18) – Better quality through design: 80% MV / 81% DS 21
  19. 19. Mix & Match: Method • Aggregate crowd labels for consensus – e.g., SQUARE benchmark of aggregation methods and datasets (Sheshadri & Lease, HCOMP’13) – http://ir.ischool.utexas.edu/square • Prioritize (topic,document) pairs for judging – StatAP (most important first) – Disagreement oracle (avoid crowd disagreement) 22
  20. 20. Mix & Match: Results • Analyzed various correlations with judging disagreement – Disagreement model: (Sheshadri, 2014) Master’s Thesis – More disagreement for ambiguous topic definitions & topics requiring greater expertise (Kutlu et al., SIGIR’18) • Best results when ordering by disagreement oracle – Achieve Kendall’s τ = 0.9 when NIST performs only subset of judgments: 55% (MQ’09) and 15-20% (WT’14) • StatAP order beats random in WT’14, but not MQ’09 – With better judgments, seems simple & effective 23
  21. 21. Mix & Match: Results Detail 24
  22. 22. Matthew Lease - ml@utexas.edu - @mattlease Thank You! slideshare.net/mattlease Lab: ir.ischool.utexas.edu

×