Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations

Your Behavior Signals Your Reliability:
Modeling Crowd Behavioral Traces to Ensure
Quality Relevance Annotations
Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, &
Matthew Lease
UT Austin -&- Qatar U
Slides: slideshare.net/mattlease ml@utexas.edu @mattlease

“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at 96 universities around the world
www.ischools.org
What’s an Information School?
2

• Behavioral data: what & why?
• Prediction Tasks & Models
• The Labeling Task: Search Relevance Judging
• Data & Evaluation (three scenarios)
• Discussion & Future Work
Roadmap

How to Assess Crowd Work?
• Typical two approaches
– 1. Compare labels vs. expert’s (e.g., “gold”)
– 2. Compare labels vs. peers (e.g., MV, EM)
• 3. Compare labels to model predictions
– e.g., Ryu & Lease, ASIS&T’11
• 4. Collect & assess worker behavioral data
4

Worker Behavioral Data (Analytics)
• Could reduce need for experts & redundant work
• Could combine with other QC methods
• Could address “cold-start” problem
– Predict quality from worker’s first label via behavior 5

• Instrumenting the crowd: using implicit behavioral measures
to predict task performance (Rzeszotarski & Kittur, UIST’11)
– Correlate crowd behavior with crowd vs. expert labels
– Each worker assigned “pass/fail”; DT predicts via behavior
• Quality management in crowdsourcing using gold judges
behavior (Kazai & Zitouni, WSDM’ 16)
– Correlate crowd behavior with expert behavior
• No shared data or source code
• MmmTurkey: A Crowdsourcing Framework for Deploying
Tasks and Recording Worker Behavior on Mechanical Turk
– Records clicks, scrolls, mouse movements, key presses, copy or
paste actions, and change in window focus (with time stamps)
– Dang, Hutson, & Lease, HCOMP’16
– http://github.com/CuriousG102/turkey
– https://github.com/budang/turkey-lite***
Prior Work
6

Prediction via Behavioral Models
• Two prediction tasks (w/o aggregation)
1. label correctness (classification)
2. worker accuracy (regression)
• Three purely behavior-based models
1. Random Forest with Aggregate Features (RF-AF)
2. Random Forest with Sequential Features (RF-SF)
3. K-means with Sequence Clusters (kmeans-SC)
• See paper for details
– Also a 4th hybrid model using work history as well
• See paper for details
7

RF with Aggregate Features (RF-AF)
• Rzeszotarski & Kittur (2011) use Action features (only),
e.g, task time, on focus time, and raw event counts
• Kazai and Zitouni (2016) include Temporal features
between successive events within a HIT.
• We use both
8

RF with Sequential Features (RF-SF)
• For given task, workers likely to perform
actions in similar order
• Aggregate features don’t capture the order of
events occurring within a HIT
– e.g., a click followed by a scroll, etc.
• Feature templates: we extract all sequences of
length 2k + m, i.e., 2 fixed event sequences of
length k separated by m random events.
– for k = 2, m = 1, {Click, Click, <event>, Click, Scroll}
9

The Labeling Task:
Judging Relevance of Search Results
@mattlease

11
Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
with T. McDonnell, M. Kutlu, & T. Elsayed
HCOMP 2016

12
HCOMP 2016

HCOMP 2016
13

• Scale up approach from prior HCOMP paper
• Mine rationales to understand disagreement
• But not discussed… worker behavioral data
Crowd vs. Expert: What Can Relevance Judgment
Rationales Teach Us About Assessor Disagreement?
with M. Kutlu, T. McDonnell, Y. Barkallah, & T. Elsayed
to appear at ACM SIGIR 2018 (in 4 days)
14

Behavioral Data in this Study
• 3,984 unique HITs (i.e., behavioral traces)
• 2,294 unique document-topic pairs
• 106 unique workers
• 1-5 labels per document (variable)
15

Evaluation: Prediction via Behavior
16
• Cross-validation; different workers in each train/test split
– “Cold start”: must predict for unseen workers
• Two predictions tasks (per HIT, given behavioral data)
1. Classification: is worker’s label correct or not?
2. Regression: what is worker accuracy?
• We define as the true, time-varying worker accuracy as % of the
worker’s last 5 labels which were correct
• Baselines
– Simple baseline: constant prediction – always predict label
correct, accuracy = mean worker accuracy (65.8%)
– Decision Tree (akin to Rzeszotarski & Kittur, UIST’11)

Prediction Results (Behavior only)
17
Method Classification (Accuracy) Regression (MSE)
Baseline: Constant 65.6 5.6
Decision Tree – AF 60.4 6.7
Random Forest – AF 67.9 4.7
Random Forest - SF 68.8 4.8
• Constant baseline beats decision tree
• Aggregate vs. Sequential Features comparable
– Sequential slightly higher classification accuracy
• Notes
– Prediction based only on behavioral traces
– No aggregation; can use with single labeling
– More results in the paper

II. Aggregation via Behavioral Weighting
• Weighted voting based on
1. Predicted label confidence
2. Predicted worker accuracy
• Baselines
– Majority Vote (unweighted): ~64% accuracy
– EM (peer-agreement weighting): ~67%
• Behavior-only weighted aggregation (RF-SF)
– Weighting by predicted worker accuracy: ~69.5%
– Weighted by predicted label confidence: ~72%
18

III. Dynamic Labeling via Behavior
• Can we intelligently decide when to collect more
labels given only observed behavior?
• Markov Decision Process (MDP)
– State is current estimated label quality
• Individual label quality estimated by RF-SF; aggregate label
quality following Dai et al. (2013)
– Decide at each step whether to get another label
• Weigh likely quality improvement vs. cost
• Given target quality parameter, stop if think it’s reached
19

Target quality: 0.7
Selecting the
example to
label next
Dynamic Labeling: Results
20

Discussion
• With strong task design, less need for QC
– i.e., worker filtering & aggregation
• Biggest challenge was small data scale
• Ethical issues of behavioral data collection
– Workplace “surveillance”; oDesk work diary
21

Contributions & Future Work
• Three models for quality prediction via behavior
– Classification/Regression, Aggregation, & Dyn. Labeling
– 1st behavioral data --> aggregation & dynamic labeling
• Shared behavioral data for ~4K HITs
– http://ir.ischool.utexas.edu/webcrowd25k/
• Future Work
– Analyzing behavioral data at greater scale
– Hybrid aggregation (behavior + non-behavior)
– Transfer learning (i.e. application across tasks)
22

Matthew Lease - ml@utexas.edu - @mattlease
Thank You!
slideshare.net/mattlease
Lab: ir.ischool.utexas.edu

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations

Similar to Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations