Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Toward Better Crowdsourcing Science
(& Predicting Annotator Performance)
Matt Lease
School of Information
University of Te...
“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
3
Matt Lease...
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
4
Matt Lease <ml@utex...
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
5
Matt Lease <ml@utex...
A Popular Tale of Crowdsourcing Woe
• Heroic ML researcher asks the
crowd to perform a simple task
• Crowd (invariably) sc...
Matt Lease <ml@utexas.edu>
7
But why can’t the workers just get it
right to begin with?
Matt Lease <ml@utexas.edu>
8
Is everyone just lazy, stupid, or ...
Another story (a parable)
“We had a great software interface, but we went
out of business because our customers were too
s...
What is our responsibility?
• Ill-defined/incomplete/ambiguous/subjective task?
• Confusing, difficult, or unusable interf...
A Few Simple Suggestions (1 of 2)
1. Make task self-contained: everything the worker
needs to know should be visible in-ta...
Suggested Sequencing (2 of 2)
1. Simulate first draft of task with your in-house personnel.
Assess, revise, & iterate (ARI...
Toward Better Crowdsourcing Science
Goal: Strengthen individual studies and minimize
unwarranted spread of bias in our sci...
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
14
Matt Lease <ml@ute...
Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demograph...
CACM August, 2013
16
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.
Matt Lease <ml@utexas...
• “Contribute to society and human well-being”
• “Avoid harm to others”
“As an ACM member I will
– Uphold and promote the ...
• Mistakes are made in HITs rejection, worker blocking
– e.g., student error, bug, poor task design, noisy gold, etc.
• Wo...
ACM: “Contribute to society and human
well-being; avoid harm to others”
• How do we know who is doing the work, or if a
de...
Who Are We Building a Better Future For?
• “Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders
...
Could Effective Human Computation
Sometimes Be a Bad Idea?
• The Googler who Looked at the Worst of the Internet
• Policin...
Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for ...
• Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
23
Matt Lease <ml@ute...
Quality Control in Crowdsourcing
7/10/2015 24
Crowd workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Q...
Motivation
Matt Lease <ml@utexas.edu>
25
Equally Accurate Workers?
1 0 1 0
7/10/2015 26
1 0 1 0 1 0
0 0 0 0 1 0 1 1 1 1
Alice
Bob
time t
Correctness of the ith tas...
1: Time-series model
27
Latent Autoregressive
Real observation
Noise Model
Latent variable
𝑦𝑡 = f(𝑥 𝑡)
𝑥𝑡
Temporal correla...
7/10/2015 28
Integrate multi-dimensional features of a
crowd assessor
Multiple features
Alice
accuracy time
temporal
effec...
Features
7/10/2015 29
[1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10
[2...
Model
7/10/2015 30
Input: X (features for crowd assessor model)
Learning Framework [ ]
Output: Y (likelihood of getting co...
Which Features Matter?
7/10/2015 31
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
s va...
3: Reducing Supervision
Matt Lease <ml@utexas.edu>
32
Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supe...
Soft Label Updating & Discounting
Matt Lease <ml@utexas.edu>
33
Soft Label Updating
Matt Lease <ml@utexas.edu>
34
The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
35
Matt Leas...
Thank You!
ir.ischool.utexas.eduSlides: www.slideshare.net/mattlease
Upcoming SlideShare
Loading in …5
×

Toward Better Crowdsourcing Science

1,293 views

Published on

Invited talk at the ICML ’15 Workshop on Crowdsourcing and Machine Learning (CrowdML'15), July 10, 2015. Joint work with Hyun Joon Jung.

Published in: Technology
  • Be the first to comment

Toward Better Crowdsourcing Science

  1. 1. Toward Better Crowdsourcing Science (& Predicting Annotator Performance) Matt Lease School of Information University of Texas at Austin ir.ischool.utexas.edu @mattlease ml@utexas.edu Slides: www.slideshare.net/mattlease
  2. 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 www.ischools.org
  3. 3. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 3 Matt Lease <ml@utexas.edu>
  4. 4. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 4 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  5. 5. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 5 Matt Lease <ml@utexas.edu> Roadmap
  6. 6. A Popular Tale of Crowdsourcing Woe • Heroic ML researcher asks the crowd to perform a simple task • Crowd (invariably) screws it up… • “Aha!” cries the ML researcher, “Fortunately, I know exactly how to solve this problem!” Matt Lease <ml@utexas.edu> 6
  7. 7. Matt Lease <ml@utexas.edu> 7
  8. 8. But why can’t the workers just get it right to begin with? Matt Lease <ml@utexas.edu> 8 Is everyone just lazy, stupid, or deceitful?!? Much of our literature seems to suggest this: • Cheaters • Fraudsters • “Lazy Turkers” • Scammers • Spammers
  9. 9. Another story (a parable) “We had a great software interface, but we went out of business because our customers were too stupid to figure out how to use it.” Moral • Even if a user were stupid or lazy, we still lose • By accepting our own responsibility, we create another opportunity to fix the problem… – Cynical view: idiot-proofing Matt Lease <ml@utexas.edu> 9
  10. 10. What is our responsibility? • Ill-defined/incomplete/ambiguous/subjective task? • Confusing, difficult, or unusable interface? • Incomplete or unclear instructions? • Insufficient or unhelpful examples given? • Gold standard with low or unknown inter-assessor agreement (i.e. measurement error in assessing response quality)? • Task design matters! (garbage in = garbage out) – Report it for review, completeness, & reproducibility Matt Lease <ml@utexas.edu> 10
  11. 11. A Few Simple Suggestions (1 of 2) 1. Make task self-contained: everything the worker needs to know should be visible in-task 2. Short, simple, & clear instructions with examples 3. Avoid domain-specific & advanced terminology; write for typical people (e.g., your mom) 4. Engage worker / avoid boring stuff. If possible, select interesting content for people to work on 5. Always ask for open-ended feedback Matt Lease <ml@utexas.edu> 11 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  12. 12. Suggested Sequencing (2 of 2) 1. Simulate first draft of task with your in-house personnel. Assess, revise, & iterate (ARI) 2. Run task using relatively few workers & examples (ARI) 1. Do workers understand the instructions? 2. How long does it take? Is pay effective & ethical? 3. Replicate results on another dataset (generalization). (ARI) 4. [Optional] qualification test. (ARI) 5. Increase items. Look for boundary items & noisy gold (ARI) 6. Increase # of workers (ARI) Matt Lease <ml@utexas.edu> 12 Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
  13. 13. Toward Better Crowdsourcing Science Goal: Strengthen individual studies and minimize unwarranted spread of bias in our scientific literature • Occam’s Razor: avoid making assumptions beyond what the data actually tells us (avoid prejudice!) • Enumerate hypotheses for possible causes of low data quality, assess supporting evidence for each hypothesis, and for any claims made, cite supporting evidence • Recognize uncertainty of analyses and convey this via hedge statements such as, “the data suggests that…” • Avoid derogatory language use without very strong supporting evidence. The crowd enables our work!! – Acknowledge your workers! Matt Lease <ml@utexas.edu> 13
  14. 14. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 14 Matt Lease <ml@utexas.edu> Roadmap
  15. 15. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010. 15 Matt Lease <ml@utexas.edu>
  16. 16. CACM August, 2013 16 Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013. Matt Lease <ml@utexas.edu>
  17. 17. • “Contribute to society and human well-being” • “Avoid harm to others” “As an ACM member I will – Uphold and promote the principles of this Code – Treat violations of this code as inconsistent with membership in the ACM” 17 Matt Lease <ml@utexas.edu> “Which approaches are less expensive and is this sensible? With the advent of outsourcing and off-shoring these matters become more complex and take on new dimensions …there are often related ethical issues concerning exploitation… “…legal, social, professional and ethical [topics] should feature in all computing degrees.” 2008 ACM/IEEE Curriculum Update
  18. 18. • Mistakes are made in HITs rejection, worker blocking – e.g., student error, bug, poor task design, noisy gold, etc. • Workers have limited recourse for appeal • Our errors impact real people’s lives • What is the loss function to optimize? • Should anyone hold researchers accountable? IRB? • How do we balance the risk of human harm vs. the potential benefit if our research succeeds? Power Asymmetry on MTurk 18 Matt Lease <ml@utexas.edu>
  19. 19. ACM: “Contribute to society and human well-being; avoid harm to others” • How do we know who is doing the work, or if a decision to work (for a given price) is freely made? • Does it matter if work is performed by – Political refugees? Children? Prisoners? Disabled? • What (if any) moral obligation do crowdsourcing researchers have to consider broader impacts of our research (either good or bad) on the lives of those we depend on to power our systems? Matt Lease <ml@utexas.edu> 19
  20. 20. Who Are We Building a Better Future For? • “Irani and Silberman (2013) – “…AMT helps employers see themselves as builders of innovative technologies, rather than employers unconcerned with working conditions.” • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of the people we ask to power our computing?” 20
  21. 21. Could Effective Human Computation Sometimes Be a Bad Idea? • The Googler who Looked at the Worst of the Internet • Policing the Web’s Lurid Precincts • Facebook content moderation • The dirty job of keeping Facebook clean • Even linguistic annotators report stress & nightmares from reading news articles! 21 Matt Lease <ml@utexas.edu>
  22. 22. Join the conversation! Crowdwork-ethics, by Six Silberman http://crowdwork-ethics.wtf.tw an informal, occasional blog for researchers interested in ethical issues in crowd work 22 Matt Lease <ml@utexas.edu>
  23. 23. • Task Design, Language, & Occam’s Razor • What About the Humans? • Predicting Annotator Performance 23 Matt Lease <ml@utexas.edu> Roadmap Hyun Joon Jung
  24. 24. Quality Control in Crowdsourcing 7/10/2015 24 Crowd workers Label Aggregation Workflow Design Worker Management Existing Quality Control Methods Task Design Who is more accurate? (worker performance estimation and prediction) Requester Online marketplace Crowd workers
  25. 25. Motivation Matt Lease <ml@utexas.edu> 25
  26. 26. Equally Accurate Workers? 1 0 1 0 7/10/2015 26 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 Alice Bob time t Correctness of the ith task instance 1 -> correct , 0 -> wrong Accuracy(Alice) = Accuracy(Bob) = 0.5 But should we expect equal work quality in the future? What if examples are not i.i.d.? Bob seems to be improving over time.
  27. 27. 1: Time-series model 27 Latent Autoregressive Real observation Noise Model Latent variable 𝑦𝑡 = f(𝑥 𝑡) 𝑥𝑡 Temporal correlation How frequently y has changed over time 𝜑 Offset Sign navigates direction between correct vs. not 𝑐 1 0 1 0 -0.3 0.4 -0.10.8𝑥𝑡 𝑦𝑡 𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ EM Variant (LAMORE, Park et al. 2014) Jung et al. Predicting Next Label Quality: A Time-Series Model of Crowdwork. AAAI HCOMP 2014.
  28. 28. 7/10/2015 28 Integrate multi-dimensional features of a crowd assessor Multiple features Alice accuracy time temporal effect topic familiarity # of labels 00.7 10.3 0.6 0.8 20 0.6 8.5 0.5 0.2 21 1 0.65 7.5 0.4 0.4 22 0 0.63 11.5 0.3 0.5 23 ? Predict an assessor’s next label quality based on a single feature Alice 0.6 0.5 0.4 0.3 0 1 0 ? temporal effect Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015. 2: Modeling More Features
  29. 29. Features 7/10/2015 29 [1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10 [2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14 [3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14 How do we flexibly capture a wider range of assessor behaviors by incorporating multi-dimensional features? [1] [1] [2] [3] [3] [3] Various accuracy measures Task features Temporal features Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  30. 30. Model 7/10/2015 30 Input: X (features for crowd assessor model) Learning Framework [ ] Output: Y (likelihood of getting correct label at t) Generalizable feature-based Assessor Model (GAM) Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  31. 31. Which Features Matter? 7/10/2015 31 . Prediction performance (MAE) of assessors’ next judgments and corresponding cov s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov l asprediction performance. 49# 43# 39# 28# 27# 23# 22# 20# 19# 16# 10# 7# 5# 0# 10# 20# 30# 40# 50# AA# BA_opt# BA_PES# C# NumLabels# CurrentLabelQuality# AccChangeDirecHon# SA# Phi# BA_uni# TaskTime# TopicChange# TopicEverSeen# Fig.4. Summary of relativefeature importance across 54 regression models. ases (27), which implicitly indicates that task familiarity affects an assessor’s A GAM with the only top 5 features shows good performance (7-10% less than full-featured GAM ) Relative feature importance across 54 individual prediction models. Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
  32. 32. 3: Reducing Supervision Matt Lease <ml@utexas.edu> 32 Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.
  33. 33. Soft Label Updating & Discounting Matt Lease <ml@utexas.edu> 33
  34. 34. Soft Label Updating Matt Lease <ml@utexas.edu> 34
  35. 35. The Future of Crowd Work, CSCW’13 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 35 Matt Lease <ml@utexas.edu>
  36. 36. Thank You! ir.ischool.utexas.eduSlides: www.slideshare.net/mattlease

×