Presentation at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). August 30, 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/kutlu-desires18.pdf
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably
1. Mix and Match: Collaborative Expert-Crowd
Judging for Building Test Collections
Accurately & Affordably
Mucahid Kutlu, Tyler McDonnell, Aashish Sheshadri,
Tamer Elsayed, & Matthew Lease
UT Austin -&- Qatar U
Slides: slideshare.net/mattlease ml@utexas.edu @mattlease
2. “The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at 96 universities around the world
www.ischools.org
What’s an Information School?
2
3. • Problem Statement
• Related Work
• Datasets
• Mix & Match: Methods & Results
Roadmap
Proceedings of the First Biennial Conference on Design of Experimental
Search & Information Retrieval Systems (DESIRES), Bertinoro, Italy, August 28-31, 2018.
4. Problem Statement
• Traditional relevance assessors & processes (eg
TREC) remain most reliable and trusted
• Non-traditional relevance judging (eg crowd)
offers ease, affordability, & speed/scalability, but
more variability in quality
• How can we make the best use of both?
– Crowd may better judge some documents/topics than
others; can we divide the work appropriately?
– Use crowd in cases we expect their judgments would
match those of traditional judges (ie, be “correct”)
4
6. Systematic Review is e-Discovery
in Doctor’s Clothing
Joint work with
SIGIR 2016 Workshop on Medical IR (MedIR)
Gordon V. Cormack (U. Waterloo) An Thanh Nguyen (U. Texas)
Thomas A. Trikalinos (Brown U.) Byron C. Wallace (U. Texas)
7.
8.
9.
10. Hybrid Man-Machine Relevance Judging
• Systematic review (medicine) and e-Discovery
(law / civil procedure) have traditionally relied
on trusted doctors/lawyers for judging
• Automatic relevance classification is more
efficient but less accurate
• Recent active learning work has investigated
hybrid man-machine judging combinations
– e.g., TAR & TREC Legal Track, recent CLEF track
10
11. Hybrid Crowd-Machine Labeling
• Dynamic labeling models select which example to
label next, how many crowd labels to collect, &
which examples to label automatically
– Work by Weld (UW) & Mausam (IIT), e.g., TurKontrol
– Work by Kamar and Horvitz (MSR), e.g., CrowdSynth
– Our work (2 slides ahead…)
11
13. Combining Crowd and Expert Labels using
Decision Theoretic Active Learning
Nguyen, Wallace, & Lease, AAAI HCOMP’15
Systematic Review
13
14. • Built a model to predict assessor disagreement
• Built a crowd simulator based on real data
– Simulated relevance judgments (of varying quality)
• Considered cost models for expert vs. crowd judges
• Evaluated cost vs. quality of hybrid NIST-Crowd
collaborative judging models. 14
A Collaborative Approach to IR Evaluation. Aashish
Sheshadri. Master's Thesis, UT CS, May 2014
15. • Built a model to predict assessor disagreement
• Built a crowd simulator based on real data
– Simulated relevance judgments (of varying quality)
• Considered cost models for expert vs. crowd judges
• Evaluated cost vs. quality of hybrid NIST-Crowd
collaborative judging models. 15
This Work: Simplified, w/ Newer, Real Data
18. Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
with T. McDonnell, M. Kutlu, & T. Elsayed
HCOMP 2016, Best Paper Award
18
19. • Scale up approach from prior HCOMP paper
• Mine rationales to understand disagreement
• But not discussed… worker behavioral data
Crowd vs. Expert: What Can Relevance Judgment
Rationales Teach Us About Assessor Disagreement?
with M. Kutlu, T. McDonnell, Y. Barkallah, & T. Elsayed
ACM SIGIR 2018
19
20. • Mine crowd worker analytics (behavioral data)
to predict label quality based on behavior
Your Behavior Signals Your Reliability: Modeling
Crowd Behavioral Traces to Ensure
Quality Relevance Annotations
with T. Goyal, T. McDonnell, M. Kutlu, & T. Elsayed
AAAI HCOMP 2018
20
21. Data (available online)
1. TREC’09 Million Query Track (ClueWeb’09)
– Crowd judgments from TREC’09 RF Track (with Mark
Smucker), re-used in TREC Crowdsourcing Tracks
– 1st crowd judgments my lab ever collected; noisy
2. TREC’14 Web Track (ClueWeb’12)
– 25K MTurk judgments just collected with “rationales”
(Kutlu et al., SIGIR’18; Goyal et al., HCOMP’18)
– Better quality through design: 80% MV / 81% DS
21
22. Mix & Match: Method
• Aggregate crowd labels for consensus
– e.g., SQUARE benchmark of aggregation methods and
datasets (Sheshadri & Lease, HCOMP’13)
– http://ir.ischool.utexas.edu/square
• Prioritize (topic,document) pairs for judging
– StatAP (most important first)
– Disagreement oracle (avoid crowd disagreement) 22
23. Mix & Match: Results
• Analyzed various correlations with judging disagreement
– Disagreement model: (Sheshadri, 2014) Master’s Thesis
– More disagreement for ambiguous topic definitions & topics
requiring greater expertise (Kutlu et al., SIGIR’18)
• Best results when ordering by disagreement oracle
– Achieve Kendall’s τ = 0.9 when NIST performs only subset of
judgments: 55% (MQ’09) and 15-20% (WT’14)
• StatAP order beats random in WT’14, but not MQ’09
– With better judgments, seems simple & effective
23