Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
1. Quality
Crowdsourcing for Human
Computer Interaction Research
Ed H. Chi
Research Scientist
Google
(work done while at [Xerox] PARC)
Aniket Kittur, Ed H. Chi, Bongwon Suh.
Crowdsourcing User Studies With Mechanical Turk. In CHI2008.
1
3. Historical Footnote
• De Prony, 1794, hired hairdressers
• (unemployed after French revolution; knew only
addition and subtraction)
• to create logarithmic and trigonometric tables.
• He managed the process by splitting the
work into very detailed workflows.
!#$% '#()$)*'%+ ,'
• !#$%/ 0
– Grier, When computers were human, 2005
56'#()12
#$)3 6'#(
• !#$%'()
– 9$*2$)+ $
'#()1-
6'#1) '2?
(2'?91#A
-$./ '4 %
6'#()$)
$/)2'%'#
C2*12+ D
3
4. Using Mechanical Turk for user studies
Traditional user Mechanical Turk
studies
Task complexity
Complex
Simple
Long
Short
Task subjectivity
Subjective
Objective
Opinions
Verifiable
User information
Targeted demographics
Unknown demographics
High interactivity
Limited interactivity
Can Mechanical Turk be usefully used for user studies?
4
5. Task
• Assess quality of Wikipedia articles
• Started with ratings from expert Wikipedians
– 14 articles (e.g., Germany , Noam Chomsky )
– 7-point scale
• Can we get matching ratings with mechanical turk?
5
6. Experiment 1
• Rate articles on 7-point scales:
– Well written
– Factually accurate
– Overall quality
• Free-text input:
– What improvements does the article need?
• Paid $0.05 each
6
7. Experiment 1: Good news
• 58 users made 210 ratings (15 per article)
– $10.50 total
• Fast results
– 44% within a day, 100% within two days
– Many completed within minutes
7
8. Experiment 1: Bad news
• Correlation between turkers and Wikipedians
only marginally significant (r=.50, p=.07)
• Worse, 59% potentially invalid responses
Experiment 1
Invalid 49%
comments
1 min 31%
responses
• Nearly 75% of these done by only 8 users
8
9. Not a good start
• Summary of Experiment 1:
– Only marginal correlation with experts.
– Heavy gaming of the system by a minority
• Possible Response:
– Can make sure these gamers are not rewarded
– Ban them from doing your hits in the future
– Create a reputation system [Delores Lab]
• Can we change how we collect user input ?
9
10. Design changes
• Use verifiable questions to signal monitoring
– How many sections does the article have?
– How many images does the article have?
– How many references does the article have?
10
11. Design changes
• Use verifiable questions to signal monitoring
• Make malicious answers as high cost as good-faith
answers
– Provide 4-6 keywords that would give someone a
good summary of the contents of the article
11
12. Design changes
• Use verifiable questions to signal monitoring
• Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing
task
– Used tasks similar to how Wikipedians evaluate quality
(organization, presentation, references)
12
13. Design changes
• Use verifiable questions to signal monitoring
• Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing
task
• Put verifiable tasks before subjective responses
– First do objective tasks and summarization
– Only then evaluate subjective quality
– Ecological validity?
13
14. Experiment 2: Results
• 124 users provided 277 ratings (~20 per article)
• Significant positive correlation with Wikipedians
– r=.66, p=.01
• Smaller proportion malicious responses
• Increased time on task
Experiment 1
Experiment 2
Invalid 49%
3%
comments
1 min 31%
7%
responses
Median time
1:30
4:06
14
15. Generalizing to other MTurk studies
• Combine objective and subjective questions
– Rapid prototyping: ask verifiable questions about content/
design of prototype before subjective evaluation
– User surveys: ask common-knowledge questions before
asking for opinions
• Filtering for Quality
– Put in a field for Free-Form Responses and Filter out
data without answers
– Results that came in too quickly
– Sort by WorkerID and look for cut and paste answers
– Look for outliers in the data that are suspicious
15
16. Quick Summary of Tips
• Mechanical Turk offers the practitioner a way to access a
large user pool and quickly collect data at low cost
• Good results require careful task design
1. Use verifiable questions to signal monitoring
2. Make malicious answers as high cost as good-faith answers
3. Make verifiable answers useful for completing task
4. Put verifiable tasks before subjective responses
16
17. Managing Quality
• Quality through redundancy: Combining votes
– Majority vote [work best when similar worker quality]
– Worker-Quality‐adjusted vote
– Managing dependencies
• Quality through gold data
– Advantaged when imbalanced dataset bad workers
• Estimating worker quality (Redundancy + Gold)
– Calculate the confusion matrix and see if you actually
get some information from the worker
• Toolkit: http://code.google.com/p/get‐another‐label/
Source: Ipeirotis, WWW2011 17
18. Coding and Machine Learning
!#$% '(%)*(+
• Integration with Machine Learning
• ,)#-+' %-.% */-++0 1-*- using
– Build automatic classification models
crowdsourced data
• 2' */-++0 1-*- *( .)%1 #(1%
Data from existing
crowdsourced answers
N
New C
Case Automatic Model Automatic
(through machine learning) Answer
Source: Ipeirotis, WWW2011
18
19. Limitations of Mechanical Turk
• No control of users environment
– Potential for different browsers, physical distractions
– General problem with online experimentation
• Not designed for user studies
– Difficult to do between-subjects design
– May need some programming
• Users
– Somewhat hard to control demographics, expertise
19
20. Crowdsourcing for HCI Research
• Does my interface/visualization work?
– WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data
– What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms
– Intrinsic vs. Extrinsic rewards: Games vs. Pay
– [Horton Chilton, 2010 for Mturk] and [Ariely, 2009] in general
20
21. Crowdsourcing for HCI Research
• Does my interface/visualization work?
– WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data
– What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms
– Intrinsic vs. Extrinsic rewards: Games vs. Pay
– [Horton Chilton, 2010 for MTurk] and Satisficing
– [Ariely, 2009] in general: Higher pay != Better work
21
22. Crowdsourcing for HCI Research
• Does my interface/visualization work?
– WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data
– What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms
– Intrinsic vs. Extrinsic rewards: Games vs. Pay
– [Horton Chilton, 2010 for MTurk] and Satisficing
– [Ariely, 2009] in general: Higher pay != Better work
22
23. Crowd Programming for Complex Tasks
• Decompose tasks into smaller tasks
– Digital Taylorism
– Frederick Winslow Taylor (1856-1915)
– 1911 'Principles Of Scientific Management’
• Crowd Programming Explorations
– MapReduce Models
• Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge.
• Kulkarni, Can, Hartmann, CHI2011 workshop WIP
– Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In
KDD 2010 Workshop on Human Computation.
23
24. 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada
Crowd Programming for Complex Tasks
!
!
#!$%!'%()(*!%!(+,-.-+/!01,+((-#2!('+!-(!%3-+/!'1! %0'-'-1#!('+!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!
+%,4!-'+$!-#!'4+!%0'-'-1#5!64+(+!'%()(!%0+!-/+%337! 0+0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!
• Crowd Programming Explorations
(-$3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!
(410'!%$18#'!1.!'-$+5!;10!+%$3+*!%!$%!'%()!.10!
EF-('107G!%#/!EH+120%47G5!#!%#!+#=-01#$+#'!:4+0+!
:10)+0(!:183/!,1$3+'+!4-24!+..10'!'%()(*!'4+!#+'!('+!
– Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on
%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!
%!2-=+#!'1-,!-#!'4+!%0'-,3+(!18'3-#+5!?83'-3+!-#('%#,+(!
$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!%0%20%4!.10!+%,4!
(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!
CrowdForge.
1.!%!$%!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!%0'-'-1#@!
+525*!$83'-3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!
.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$3+'+!
%0%20%4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!
+%,4!1#!%!'1-,!-#!%0%33+35! ,%%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!
– Kulkarni, Can, Hartmann, CHI2011 workshop WIP
8!.80'4+0*!(+%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!
;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#! :0-'-#2!(89'%()(5!B+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!
$%!'%()!%#/!,1#(13-/%'+!'4+$*!'7-,%337!-#'1!%!(-#23+! .01$!'4+!%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%!'%()(!-#!
0+(83'5!#!'4+!%0'-,3+!:0-'-#2!+%$3+*!%!0+/8,+!('+!
$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1-,!97!$%#7!
:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!%0%20%45! “Please solve the 16-question SAT located at
A#7!1.!'4+(+!('+(!,%#!9+!-'+0%'-=+5!;10!+%$3+*!'4+! http://bit.ly/SATexam”. In both cases, we paid wo
'1-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!%0'-'-1#!
between $0.10 and $0.40 per HIT. Each “subdivid
,%#!-'(+3.!9+!%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!
%0%20%4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+!,%#!-#! “merge” HIT received answers within 4 hours; sol
'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+5!
to the initial task were complete within 72 hours.
!#$%#'()$#%
C+!+310+/!%(!%!,%(+!('8/7!'4+!,1$3+!'%()!1.!
:0-'-#2!%#!+#,7,31+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%! Results
,4%33+#2-#2!%#/!-#'+0/++#/+#'!'%()!'4%'!-#=13=+(!$%#7! The decompositions produced by Turkers while ru
/-..+0+#'!(89'%()(D!3%##-#2!'4+!(,1+!1.!'4+!%0'-,3+*!
41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2! Turkomatic are displayed in Figure 1 (essay-writin
-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8!'4%'!-#.10$%'-1#*!
.-#/-#2!%#/!.--#2!20%$$%0!%#/!(+33-#2*!%#/!$%)-#2!
and Figure 4 (SAT).
'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!
Figure 4. For the SAT task, we uploaded
:0-'-#2!%!,4%33+#2-#2!98'!0+0+(+#'%'-=+!'+('!,%(+!.10!
sixteen questions from a high school
180!%01%,45! In the essay task, each “subdivide” HIT was poste
Scholastic Aptitude Test to the web and three times by Turkomatic and the best of the thr
61!(13=+!'4-(!0193+$!:+!,0+%'+/!%!(-$3+!.31:! *)+',$%-.%/,)0%,$#'0#%12%%310041,)5$%
was selected by experimenters (simulating Turker 24
posed ,1#(-('-#2!1.!%!%0'-'-1#*!$%*!%#/!0+/8,+!('+5!!64+!
the following task to Turkomatic: 6,))7+%#89%
25. Future Directions in Crowdsourcing
• Real-time Crowdsourcing
– Bigham, et al. VizWiz, UIST 2010
What color is this pillow? What denomination is Do you see picnic tables What temperature is my Can you please tell me W
this bill? across the parking lot? oven set to? what this can is?
(89s) . (24s) 20 (13s) no (69s) it looks like 425 (183s) chickpeas. (9
(105s) multiple shades (29s) 20 (46s) no degrees but the image (514s) beans (9
of soft green, blue and is difficult to see. (552s) Goya Beans p
gold (84s) 400 (2
(122s) 450
Figure 2: Six questions asked by participants, the photographs they took, and answers received with latenc
25
26. Future Directions in Crowdsourcing
• Real-time Crowdsourcing
– Bigham, et al. VizWiz, UIST 2010
• Embedding of Crowdwork inside Tools
– Bernstein, et al. Solyent, UIST 2010
26
27. the goals of learning, engagement, a
improvement, we first analyze the im
Future Directions in Crowdsourcing
dimensions of the design space for cr
(Figure 2).
Timeliness: When should feedback be
• Real-time Crowdsourcing
In micro-task work, workers stay with
– Bigham, et al. VizWiz, UIST 2010
while, then move on. This implies two
synchronously deliver feedback when
• Embedding of Crowdwork inside Tools
engaged in a set of tasks, or asynchr
– Bernstein, et al. Solyent, UIST 2010
feedback after workers have complet
• Shepherding Crowdwork
Synchronous feedback may have mor
– Dow et al. CHI2011 WIP
task performance s
while workers are s
the task domain. It
probability that wor
onto similar tasks. H
synchronous feedba
burden on the feedb
they have little time
This implies a need
scheduling algorithm
near real-time feed
Asynchronous feedb
27
28. Tutorials
• Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/
• AAAI 2011 (w HCOMP 2011): Human Computation: Core Research
Questions and State of the Art (E. Law Luis von Ahn)
• WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to
Work for You (Omar Alonso and Matthew Lease)
– http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf
• LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob
Carpenter and Massimo Poesio)
– http://lingpipe-blog.com/2010/05/17/
• ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso)
– http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html
• CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐
Fei Li)
– http://sites.google.com/site/turkforvision/
• CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose)
– http://videolectures.net/cikm08_rose_cfre/
• WWW2011: Managing Crowdsourced Human Computation (Panos
Ipeirotis)
– http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation
28
29. Thanks!
• chi@acm.org
• http://edchi.net
• @edchi
• Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies
With Mechanical Turk. In Proceedings of the ACM Conference on Human-
factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008.
Florence, Italy.
• Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki?
Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-
Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008.
San Diego, CA. [Best Note Award]
29