Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably

Mix and Match: Collaborative Expert-Crowd
Judging for Building Test Collections
Accurately & Affordably
Mucahid Kutlu, Tyler McDonnell, Aashish Sheshadri,
Tamer Elsayed, & Matthew Lease
UT Austin -&- Qatar U
Slides: slideshare.net/mattlease ml@utexas.edu @mattlease

“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at 96 universities around the world
www.ischools.org
What’s an Information School?
2

• Problem Statement
• Related Work
• Datasets
• Mix & Match: Methods & Results
Roadmap
Proceedings of the First Biennial Conference on Design of Experimental
Search & Information Retrieval Systems (DESIRES), Bertinoro, Italy, August 28-31, 2018.

Problem Statement
• Traditional relevance assessors & processes (eg
TREC) remain most reliable and trusted
• Non-traditional relevance judging (eg crowd)
offers ease, affordability, & speed/scalability, but
more variability in quality
• How can we make the best use of both?
– Crowd may better judge some documents/topics than
others; can we divide the work appropriately?
– Use crowd in cases we expect their judgments would
match those of traditional judges (ie, be “correct”)
4

Systematic Review is e-Discovery
in Doctor’s Clothing
Joint work with
SIGIR 2016 Workshop on Medical IR (MedIR)
Gordon V. Cormack (U. Waterloo) An Thanh Nguyen (U. Texas)
Thomas A. Trikalinos (Brown U.) Byron C. Wallace (U. Texas)

Hybrid Man-Machine Relevance Judging
• Systematic review (medicine) and e-Discovery
(law / civil procedure) have traditionally relied
on trusted doctors/lawyers for judging
• Automatic relevance classification is more
efficient but less accurate
• Recent active learning work has investigated
hybrid man-machine judging combinations
– e.g., TAR & TREC Legal Track, recent CLEF track
10

Hybrid Crowd-Machine Labeling
• Dynamic labeling models select which example to
label next, how many crowd labels to collect, &
which examples to label automatically
– Work by Weld (UW) & Mausam (IIT), e.g., TurKontrol
– Work by Kamar and Horvitz (MSR), e.g., CrowdSynth
– Our work (2 slides ahead…)
11

Decision Theoretic Active Learning
12

Combining Crowd and Expert Labels using
Decision Theoretic Active Learning
Nguyen, Wallace, & Lease, AAAI HCOMP’15
Systematic Review
13

• Built a model to predict assessor disagreement
• Built a crowd simulator based on real data
– Simulated relevance judgments (of varying quality)
• Considered cost models for expert vs. crowd judges
• Evaluated cost vs. quality of hybrid NIST-Crowd
collaborative judging models. 14
A Collaborative Approach to IR Evaluation. Aashish
Sheshadri. Master's Thesis, UT CS, May 2014

• Built a model to predict assessor disagreement
• Built a crowd simulator based on real data
– Simulated relevance judgments (of varying quality)
• Considered cost models for expert vs. crowd judges
• Evaluated cost vs. quality of hybrid NIST-Crowd
collaborative judging models. 15
This Work: Simplified, w/ Newer, Real Data

TREC’09 Million Query Track
(ClueWeb’09)
• 3K MTurk judgments collected for TREC 2010
Relevance Feedback Track
– (Buckley, Smucker, & Lease, TREC’10 Notebook)
– (Grady & Lease, NAACL’10 MTurk Workshop)
– Judgments re-used in TREC Crowdsourcing Tracks
• 1st crowd judgments collected in my lab
– Relatively low quality: 65% MV / 70% DS
17

Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
with T. McDonnell, M. Kutlu, & T. Elsayed
HCOMP 2016, Best Paper Award
18

• Scale up approach from prior HCOMP paper
• Mine rationales to understand disagreement
• But not discussed… worker behavioral data
Crowd vs. Expert: What Can Relevance Judgment
Rationales Teach Us About Assessor Disagreement?
with M. Kutlu, T. McDonnell, Y. Barkallah, & T. Elsayed
ACM SIGIR 2018
19

• Mine crowd worker analytics (behavioral data)
to predict label quality based on behavior
Your Behavior Signals Your Reliability: Modeling
Crowd Behavioral Traces to Ensure
Quality Relevance Annotations
with T. Goyal, T. McDonnell, M. Kutlu, & T. Elsayed
AAAI HCOMP 2018
20

Data (available online)
1. TREC’09 Million Query Track (ClueWeb’09)
– Crowd judgments from TREC’09 RF Track (with Mark
Smucker), re-used in TREC Crowdsourcing Tracks
– 1st crowd judgments my lab ever collected; noisy
2. TREC’14 Web Track (ClueWeb’12)
– 25K MTurk judgments just collected with “rationales”
(Kutlu et al., SIGIR’18; Goyal et al., HCOMP’18)
– Better quality through design: 80% MV / 81% DS
21

Mix & Match: Method
• Aggregate crowd labels for consensus
– e.g., SQUARE benchmark of aggregation methods and
datasets (Sheshadri & Lease, HCOMP’13)
– http://ir.ischool.utexas.edu/square
• Prioritize (topic,document) pairs for judging
– StatAP (most important first)
– Disagreement oracle (avoid crowd disagreement) 22

Mix & Match: Results
• Analyzed various correlations with judging disagreement
– Disagreement model: (Sheshadri, 2014) Master’s Thesis
– More disagreement for ambiguous topic definitions & topics
requiring greater expertise (Kutlu et al., SIGIR’18)
• Best results when ordering by disagreement oracle
– Achieve Kendall’s τ = 0.9 when NIST performs only subset of
judgments: 55% (MQ’09) and 15-20% (WT’14)
• StatAP order beats random in WT’14, but not MQ’09
– With better judgments, seems simple & effective
23

Mix & Match: Results Detail
24

Matthew Lease - ml@utexas.edu - @mattlease
Thank You!
slideshare.net/mattlease
Lab: ir.ischool.utexas.edu

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably

Similar to Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately & Affordably