Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations

96 views

Published on

Presentation at the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), July 7, 2018. Work by Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. Pages 41-49 in conference proceedings. Online version of paper includes corrections to official version in proceedings: https://www.ischool.utexas.edu/~ml/papers/goyal-hcomp18

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations

  1. 1. Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, & Matthew Lease UT Austin -&- Qatar U Slides: slideshare.net/mattlease ml@utexas.edu @mattlease
  2. 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at 96 universities around the world www.ischools.org What’s an Information School? 2
  3. 3. • Behavioral data: what & why? • Prediction Tasks & Models • The Labeling Task: Search Relevance Judging • Data & Evaluation (three scenarios) • Discussion & Future Work Roadmap
  4. 4. How to Assess Crowd Work? • Typical two approaches – 1. Compare labels vs. expert’s (e.g., “gold”) – 2. Compare labels vs. peers (e.g., MV, EM) • 3. Compare labels to model predictions – e.g., Ryu & Lease, ASIS&T’11 • 4. Collect & assess worker behavioral data 4
  5. 5. Worker Behavioral Data (Analytics) • Could reduce need for experts & redundant work • Could combine with other QC methods • Could address “cold-start” problem – Predict quality from worker’s first label via behavior 5
  6. 6. • Instrumenting the crowd: using implicit behavioral measures to predict task performance (Rzeszotarski & Kittur, UIST’11) – Correlate crowd behavior with crowd vs. expert labels – Each worker assigned “pass/fail”; DT predicts via behavior • Quality management in crowdsourcing using gold judges behavior (Kazai & Zitouni, WSDM’ 16) – Correlate crowd behavior with expert behavior • No shared data or source code • MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Mechanical Turk – Records clicks, scrolls, mouse movements, key presses, copy or paste actions, and change in window focus (with time stamps) – Dang, Hutson, & Lease, HCOMP’16 – http://github.com/CuriousG102/turkey – https://github.com/budang/turkey-lite*** Prior Work 6
  7. 7. Prediction via Behavioral Models • Two prediction tasks (w/o aggregation) 1. label correctness (classification) 2. worker accuracy (regression) • Three purely behavior-based models 1. Random Forest with Aggregate Features (RF-AF) 2. Random Forest with Sequential Features (RF-SF) 3. K-means with Sequence Clusters (kmeans-SC) • See paper for details – Also a 4th hybrid model using work history as well • See paper for details 7
  8. 8. RF with Aggregate Features (RF-AF) • Rzeszotarski & Kittur (2011) use Action features (only), e.g, task time, on focus time, and raw event counts • Kazai and Zitouni (2016) include Temporal features between successive events within a HIT. • We use both 8
  9. 9. RF with Sequential Features (RF-SF) • For given task, workers likely to perform actions in similar order • Aggregate features don’t capture the order of events occurring within a HIT – e.g., a click followed by a scroll, etc. • Feature templates: we extract all sequences of length 2k + m, i.e., 2 fixed event sequences of length k separated by m random events. – for k = 2, m = 1, {Click, Click, <event>, Click, Scroll} 9
  10. 10. The Labeling Task: Judging Relevance of Search Results @mattlease
  11. 11. 11 Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments with T. McDonnell, M. Kutlu, & T. Elsayed HCOMP 2016
  12. 12. 12 Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments with T. McDonnell, M. Kutlu, & T. Elsayed HCOMP 2016
  13. 13. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments with T. McDonnell, M. Kutlu, & T. Elsayed HCOMP 2016 13
  14. 14. • Scale up approach from prior HCOMP paper • Mine rationales to understand disagreement • But not discussed… worker behavioral data Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement? with M. Kutlu, T. McDonnell, Y. Barkallah, & T. Elsayed to appear at ACM SIGIR 2018 (in 4 days) 14
  15. 15. Behavioral Data in this Study • 3,984 unique HITs (i.e., behavioral traces) • 2,294 unique document-topic pairs • 106 unique workers • 1-5 labels per document (variable) 15
  16. 16. Evaluation: Prediction via Behavior 16 • Cross-validation; different workers in each train/test split – “Cold start”: must predict for unseen workers • Two predictions tasks (per HIT, given behavioral data) 1. Classification: is worker’s label correct or not? 2. Regression: what is worker accuracy? • We define as the true, time-varying worker accuracy as % of the worker’s last 5 labels which were correct • Baselines – Simple baseline: constant prediction – always predict label correct, accuracy = mean worker accuracy (65.8%) – Decision Tree (akin to Rzeszotarski & Kittur, UIST’11)
  17. 17. Prediction Results (Behavior only) 17 Method Classification (Accuracy) Regression (MSE) Baseline: Constant 65.6 5.6 Decision Tree – AF 60.4 6.7 Random Forest – AF 67.9 4.7 Random Forest - SF 68.8 4.8 • Constant baseline beats decision tree • Aggregate vs. Sequential Features comparable – Sequential slightly higher classification accuracy • Notes – Prediction based only on behavioral traces – No aggregation; can use with single labeling – More results in the paper
  18. 18. II. Aggregation via Behavioral Weighting • Weighted voting based on 1. Predicted label confidence 2. Predicted worker accuracy • Baselines – Majority Vote (unweighted): ~64% accuracy – EM (peer-agreement weighting): ~67% • Behavior-only weighted aggregation (RF-SF) – Weighting by predicted worker accuracy: ~69.5% – Weighted by predicted label confidence: ~72% 18
  19. 19. III. Dynamic Labeling via Behavior • Can we intelligently decide when to collect more labels given only observed behavior? • Markov Decision Process (MDP) – State is current estimated label quality • Individual label quality estimated by RF-SF; aggregate label quality following Dai et al. (2013) – Decide at each step whether to get another label • Weigh likely quality improvement vs. cost • Given target quality parameter, stop if think it’s reached 19
  20. 20. Target quality: 0.7 Selecting the example to label next Dynamic Labeling: Results 20
  21. 21. Discussion • With strong task design, less need for QC – i.e., worker filtering & aggregation • Biggest challenge was small data scale • Ethical issues of behavioral data collection – Workplace “surveillance”; oDesk work diary 21
  22. 22. Contributions & Future Work • Three models for quality prediction via behavior – Classification/Regression, Aggregation, & Dyn. Labeling – 1st behavioral data --> aggregation & dynamic labeling • Shared behavioral data for ~4K HITs – http://ir.ischool.utexas.edu/webcrowd25k/ • Future Work – Analyzing behavioral data at greater scale – Hybrid aggregation (behavior + non-behavior) – Transfer learning (i.e. application across tasks) 22
  23. 23. Matthew Lease - ml@utexas.edu - @mattlease Thank You! slideshare.net/mattlease Lab: ir.ischool.utexas.edu

×