1. Toward Better Crowdsourcing Science
(& Predicting Annotator Performance)
Matt Lease
School of Information
University of Texas at Austin
ir.ischool.utexas.edu
@mattlease
ml@utexas.edu
Slides: www.slideshare.net/mattlease
2. “The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
3. The Future of Crowd Work, CSCW’13
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
3
Matt Lease <ml@utexas.edu>
4. • Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
4
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
5. • Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
5
Matt Lease <ml@utexas.edu>
Roadmap
6. A Popular Tale of Crowdsourcing Woe
• Heroic ML researcher asks the
crowd to perform a simple task
• Crowd (invariably) screws it up…
• “Aha!” cries the ML researcher, “Fortunately,
I know exactly how to solve this problem!”
Matt Lease <ml@utexas.edu>
6
8. But why can’t the workers just get it
right to begin with?
Matt Lease <ml@utexas.edu>
8
Is everyone just lazy, stupid, or deceitful?!?
Much of our literature
seems to suggest this:
• Cheaters
• Fraudsters
• “Lazy Turkers”
• Scammers
• Spammers
9. Another story (a parable)
“We had a great software interface, but we went
out of business because our customers were too
stupid to figure out how to use it.”
Moral
• Even if a user were stupid or lazy, we still lose
• By accepting our own responsibility, we create
another opportunity to fix the problem…
– Cynical view: idiot-proofing
Matt Lease <ml@utexas.edu>
9
10. What is our responsibility?
• Ill-defined/incomplete/ambiguous/subjective task?
• Confusing, difficult, or unusable interface?
• Incomplete or unclear instructions?
• Insufficient or unhelpful examples given?
• Gold standard with low or unknown inter-assessor
agreement (i.e. measurement error in assessing
response quality)?
• Task design matters! (garbage in = garbage out)
– Report it for review, completeness, & reproducibility
Matt Lease <ml@utexas.edu>
10
11. A Few Simple Suggestions (1 of 2)
1. Make task self-contained: everything the worker
needs to know should be visible in-task
2. Short, simple, & clear instructions with examples
3. Avoid domain-specific & advanced terminology;
write for typical people (e.g., your mom)
4. Engage worker / avoid boring stuff. If possible,
select interesting content for people to work on
5. Always ask for open-ended feedback
Matt Lease <ml@utexas.edu>
11
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
12. Suggested Sequencing (2 of 2)
1. Simulate first draft of task with your in-house personnel.
Assess, revise, & iterate (ARI)
2. Run task using relatively few workers & examples (ARI)
1. Do workers understand the instructions?
2. How long does it take? Is pay effective & ethical?
3. Replicate results on another dataset (generalization). (ARI)
4. [Optional] qualification test. (ARI)
5. Increase items. Look for boundary items & noisy gold (ARI)
6. Increase # of workers (ARI)
Matt Lease <ml@utexas.edu>
12
Omar Alonso. Guidelines for Designing Crowdsourcing-based Relevance Experiments. 2009.
13. Toward Better Crowdsourcing Science
Goal: Strengthen individual studies and minimize
unwarranted spread of bias in our scientific literature
• Occam’s Razor: avoid making assumptions beyond
what the data actually tells us (avoid prejudice!)
• Enumerate hypotheses for possible causes of low data
quality, assess supporting evidence for each hypothesis,
and for any claims made, cite supporting evidence
• Recognize uncertainty of analyses and convey this via
hedge statements such as, “the data suggests that…”
• Avoid derogatory language use without very strong
supporting evidence. The crowd enables our work!!
– Acknowledge your workers!
Matt Lease <ml@utexas.edu>
13
14. • Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
14
Matt Lease <ml@utexas.edu>
Roadmap
15. Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
15
Matt Lease <ml@utexas.edu>
16. CACM August, 2013
16
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.
Matt Lease <ml@utexas.edu>
17. • “Contribute to society and human well-being”
• “Avoid harm to others”
“As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with membership in the ACM”
17
Matt Lease <ml@utexas.edu>
“Which approaches are less expensive and is this sensible? With the advent of
outsourcing and off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning exploitation…
“…legal, social, professional and ethical [topics] should feature in all computing degrees.”
2008 ACM/IEEE Curriculum Update
18. • Mistakes are made in HITs rejection, worker blocking
– e.g., student error, bug, poor task design, noisy gold, etc.
• Workers have limited recourse for appeal
• Our errors impact real people’s lives
• What is the loss function to optimize?
• Should anyone hold researchers accountable? IRB?
• How do we balance the risk of human harm vs.
the potential benefit if our research succeeds?
Power Asymmetry on MTurk
18
Matt Lease <ml@utexas.edu>
19. ACM: “Contribute to society and human
well-being; avoid harm to others”
• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
• Does it matter if work is performed by
– Political refugees? Children? Prisoners? Disabled?
• What (if any) moral obligation do crowdsourcing
researchers have to consider broader impacts of
our research (either good or bad) on the lives of
those we depend on to power our systems?
Matt Lease <ml@utexas.edu>
19
20. Who Are We Building a Better Future For?
• “Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders
of innovative technologies, rather than employers
unconcerned with working conditions.”
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of the
people we ask to power our computing?”
20
21. Could Effective Human Computation
Sometimes Be a Bad Idea?
• The Googler who Looked at the Worst of the Internet
• Policing the Web’s Lurid Precincts
• Facebook content moderation
• The dirty job of keeping Facebook clean
• Even linguistic annotators report stress &
nightmares from reading news articles!
21
Matt Lease <ml@utexas.edu>
22. Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for researchers
interested in ethical issues in crowd work
22
Matt Lease <ml@utexas.edu>
23. • Task Design, Language, & Occam’s Razor
• What About the Humans?
• Predicting Annotator Performance
23
Matt Lease <ml@utexas.edu>
Roadmap
Hyun Joon Jung
24. Quality Control in Crowdsourcing
7/10/2015 24
Crowd workers
Label
Aggregation
Workflow
Design
Worker
Management
Existing Quality Control Methods
Task Design
Who is more accurate?
(worker performance estimation
and prediction)
Requester
Online marketplace
Crowd
workers
26. Equally Accurate Workers?
1 0 1 0
7/10/2015 26
1 0 1 0 1 0
0 0 0 0 1 0 1 1 1 1
Alice
Bob
time t
Correctness of the ith task instance
1 -> correct , 0 -> wrong
Accuracy(Alice) = Accuracy(Bob) = 0.5
But should we expect equal work quality in the future?
What if examples are not i.i.d.?
Bob seems to be improving over time.
27. 1: Time-series model
27
Latent Autoregressive
Real observation
Noise Model
Latent variable
𝑦𝑡 = f(𝑥 𝑡)
𝑥𝑡
Temporal correlation
How frequently y has
changed over time
𝜑
Offset
Sign navigates direction
between correct vs. not
𝑐
1 0 1 0
-0.3 0.4 -0.10.8𝑥𝑡
𝑦𝑡
𝑐 φ 𝑐 φ 𝑐 φ𝑐 φ
EM Variant (LAMORE, Park et al. 2014)
Jung et al. Predicting Next Label
Quality: A Time-Series Model of
Crowdwork. AAAI HCOMP 2014.
28. 7/10/2015 28
Integrate multi-dimensional features of a
crowd assessor
Multiple features
Alice
accuracy time
temporal
effect
topic
familiarity
# of
labels
00.7 10.3 0.6 0.8 20
0.6 8.5 0.5 0.2 21 1
0.65 7.5 0.4 0.4 22 0
0.63 11.5 0.3 0.5 23 ?
Predict an assessor’s next label
quality based on a single feature
Alice
0.6
0.5
0.4
0.3
0
1
0
?
temporal
effect
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
2: Modeling More Features
29. Features
7/10/2015 29
[1] Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. SIGIR ’10
[2] Ipeirotis, P.G., Gabrilovich, E.: Quizz: targeted crowdsourcing with a billion (potential) users. WWW’14
[3] Jung, H., et al.: Predicting Next Label Quality: A Time-Series Model of Crowdwork. HCOMP’14
How do we flexibly capture a wider range of assessor behaviors by
incorporating multi-dimensional features?
[1]
[1]
[2]
[3]
[3]
[3]
Various
accuracy
measures
Task features
Temporal
features
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
30. Model
7/10/2015 30
Input: X (features for crowd assessor model)
Learning Framework [ ]
Output: Y (likelihood of getting correct label at t)
Generalizable feature-based Assessor Model (GAM)
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
31. Which Features Matter?
7/10/2015 31
. Prediction performance (MAE) of assessors’ next judgments and corresponding cov
s varying decision rejection options (δ=[0⇠0.25] by 0.05). While theother methodss
cant decreasein coverage, under all thegiven reject options, GAM showsbetter cov
l asprediction performance.
49#
43#
39#
28#
27#
23#
22#
20#
19#
16#
10#
7#
5#
0# 10# 20# 30# 40# 50#
AA#
BA_opt#
BA_PES#
C#
NumLabels#
CurrentLabelQuality#
AccChangeDirecHon#
SA#
Phi#
BA_uni#
TaskTime#
TopicChange#
TopicEverSeen#
Fig.4. Summary of relativefeature importance across 54 regression models.
ases (27), which implicitly indicates that task familiarity affects an assessor’s
A GAM with the only top 5 features shows good performance
(7-10% less than full-featured GAM )
Relative feature importance across 54 individual prediction models.
Jung & Lease. A Discriminative Approach to Predicting Assessor Accuracy. ECIR 2015.
32. 3: Reducing Supervision
Matt Lease <ml@utexas.edu>
32
Jung & Lease. Modeling Temporal Crowd Work Quality with Limited Supervision. HCOMP 2015.