Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

10,279 views
10,189 views

Published on

Tutorial given at HCIC 2011 workshop

Published in: Technology, Business
1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total views
10,279
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
88
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

  1. 1. Quality Crowdsourcing for HumanComputer Interaction Research Ed H. Chi Research Scientist Google (work done while at [Xerox] PARC) Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008. 1
  2. 2. Example Task from Amazon MTurk 2
  3. 3. Historical Footnote •  De Prony, 1794, hired hairdressers •  (unemployed after French revolution; knew only addition and subtraction) •  to create logarithmic and trigonometric tables. •  He managed the process by splitting the work into very detailed workflows. !#$% #()$)*%+ , • !#$%/ 0 –  Grier, When computers were human, 2005 56#()12 #$)3 6#( • !#$%() – 9$*2$)+ $ #()1- 6#1) 2? (2?91#A -$./ 4 % 6#()$) $/)2%# C2*12+ D 3
  4. 4. Using Mechanical Turk for user studies Traditional user Mechanical Turk studies Task complexity Complex Simple Long Short Task subjectivity Subjective Objective Opinions Verifiable User information Targeted demographics Unknown demographics High interactivity Limited interactivity Can Mechanical Turk be usefully used for user studies? 4
  5. 5. Task •  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians –  14 articles (e.g., Germany , Noam Chomsky ) –  7-point scale •  Can we get matching ratings with mechanical turk? 5
  6. 6. Experiment 1 •  Rate articles on 7-point scales: –  Well written –  Factually accurate –  Overall quality •  Free-text input: –  What improvements does the article need? •  Paid $0.05 each 6
  7. 7. Experiment 1: Good news •  58 users made 210 ratings (15 per article) –  $10.50 total •  Fast results –  44% within a day, 100% within two days –  Many completed within minutes 7
  8. 8. Experiment 1: Bad news •  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07) •  Worse, 59% potentially invalid responses Experiment 1 Invalid 49% comments 1 min 31% responses•  Nearly 75% of these done by only 8 users 8
  9. 9. Not a good start •  Summary of Experiment 1: –  Only marginal correlation with experts. –  Heavy gaming of the system by a minority •  Possible Response: –  Can make sure these gamers are not rewarded –  Ban them from doing your hits in the future –  Create a reputation system [Delores Lab] •  Can we change how we collect user input ? 9
  10. 10. Design changes •  Use verifiable questions to signal monitoring –  How many sections does the article have? –  How many images does the article have? –  How many references does the article have? 10
  11. 11. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers –  Provide 4-6 keywords that would give someone a good summary of the contents of the article 11
  12. 12. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians evaluate quality (organization, presentation, references) 12
  13. 13. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task •  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality –  Ecological validity? 13
  14. 14. Experiment 2: Results •  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians –  r=.66, p=.01 •  Smaller proportion malicious responses •  Increased time on task Experiment 1 Experiment 2 Invalid 49% 3% comments 1 min 31% 7% responses Median time 1:30 4:06 14
  15. 15. Generalizing to other MTurk studies •  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/ design of prototype before subjective evaluation –  User surveys: ask common-knowledge questions before asking for opinions •  Filtering for Quality –  Put in a field for Free-Form Responses and Filter out data without answers –  Results that came in too quickly –  Sort by WorkerID and look for cut and paste answers –  Look for outliers in the data that are suspicious 15
  16. 16. Quick Summary of Tips •  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost •  Good results require careful task design 1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task 4.  Put verifiable tasks before subjective responses 16
  17. 17. Managing Quality •  Quality through redundancy: Combining votes –  Majority vote [work best when similar worker quality] –  Worker-Quality‐adjusted vote –  Managing dependencies •  Quality through gold data –  Advantaged when imbalanced dataset bad workers •  Estimating worker quality (Redundancy + Gold) –  Calculate the confusion matrix and see if you actually get some information from the worker •  Toolkit: http://code.google.com/p/get‐another‐label/ Source: Ipeirotis, WWW2011 17
  18. 18. Coding and Machine Learning !#$% (%)*(+ •  Integration with Machine Learning • ,)#-+ %-.% */-++0 1-*- using –  Build automatic classification models crowdsourced data • 2 */-++0 1-*- *( .)%1 #(1% Data from existing crowdsourced answersNNew C Case Automatic Model Automatic (through machine learning) Answer Source: Ipeirotis, WWW2011 18
  19. 19. Limitations of Mechanical Turk •  No control of users environment –  Potential for different browsers, physical distractions –  General problem with online experimentation •  Not designed for user studies –  Difficult to do between-subjects design –  May need some programming •  Users –  Somewhat hard to control demographics, expertise 19
  20. 20. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for Mturk] and [Ariely, 2009] in general 20
  21. 21. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work 21
  22. 22. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.] –  Replicating Perceptual Experiments [Heer et al., CHI2010] •  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi] •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton Chilton, 2010 for MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work 22
  23. 23. Crowd Programming for Complex Tasks •  Decompose tasks into smaller tasks –  Digital Taylorism –  Frederick Winslow Taylor (1856-1915) –  1911 Principles Of Scientific Management’ •  Crowd Programming Explorations –  MapReduce Models •  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. •  Kulkarni, Can, Hartmann, CHI2011 workshop WIP –  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation. 23
  24. 24. 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada Crowd Programming for Complex Tasks ! ! #!$%!%()(*!%!(+,-.-+/!01,+((-#2!(+!-(!%3-+/!1! %0--1#!(+!%()+/!:10)+0(!1!,0+%+!%#!%0-,3+!183-#+*! +%,4!-+$!-#!4+!%0--1#5!64+(+!%()(!%0+!-/+%337! 0+0+(+#+/!%(!%#!%00%7!1.!(+,-1#!4+%/-#2(!(8,4!%(! •  Crowd Programming Explorations (-$3+!+#1824!1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%! (410!%$18#!1.!-$+5!;10!+%$3+*!%!$%!%()!.10! EF-(107G!%#/!EH+120%47G5!#!%#!+#=-01#$+#!:4+0+! :10)+0(!:183/!,1$3++!4-24!+..10!%()(*!4+!#+!(+! –  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on %0-,3+!:0--#2!,183/!%()!%!:10)+0!1!,133+,!1#+!.%,!1#! %!2-=+#!1-,!-#!4+!%0-,3+(!183-#+5!?83-3+!-#(%#,+(! $-24!9+!1!4%=+!(1$+1#+!:0-+!%!%0%20%4!.10!+%,4! (+,-1#5!F1:+=+0*!4+!/-..-,837!%#/!-$+!-#=13=+/!-#! CrowdForge. 1.!%!$%!%()(!,183/!9+!-#(%#-%+/!.10!+%,4!%0--1#@! +525*!$83-3+!:10)+0(!,183/!9+!%()+/!1!,133+,!1#+!.%,! .-#/-#2!4+!-#.10$%-1#!.10!%#/!:0--#2!%!,1$3++! %0%20%4!.10!%!4+%/-#2!-(!%!$-($%,4!1!4+!31:!:10)! +%,4!1#!%!1-,!-#!%0%33+35! ,%%,-7!1.!$-,01I%()!$%0)+(5!648(!:+!901)+!4+!%()! –  Kulkarni, Can, Hartmann, CHI2011 workshop WIP 8!.804+0*!(+%0%-#2!4+!-#.10$%-1#!,133+,-1#!%#/! ;-#%337*!0+/8,+!%()(!%)+!%33!4+!0+(83(!.01$!%!2-=+#! :0--#2!(89%()(5!B+,-.-,%337*!+%,4!(+,-1#!4+%/-#2! $%!%()!%#/!,1#(13-/%+!4+$*!7-,%337!-#1!%!(-#23+! .01$!4+!%0--1#!:%(!8(+/!1!2+#+0%+!$%!%()(!-#! 0+(835!#!4+!%0-,3+!:0--#2!+%$3+*!%!0+/8,+!(+! $-24!%)+!.%,(!,133+,+/!.10!%!2-=+#!1-,!97!$%#7! :10)+0(!%#/!4%=+!%!:10)+0!80#!4+$!-#1!%!%0%20%45! “Please solve the 16-question SAT located at A#7!1.!4+(+!(+(!,%#!9+!-+0%-=+5!;10!+%$3+*!4+! http://bit.ly/SATexam”. In both cases, we paid wo 1-,!.10!%#!%0-,3+!(+,-1#!/+.-#+/!-#!%!.-0(!%0--1#! between $0.10 and $0.40 per HIT. Each “subdivid ,%#!-(+3.!9+!%0--1#+/!-#1!(89(+,-1#(5!B-$-3%037*!4+! %0%20%4(!0+80#+/!.01$!1#+!0+/8,-1#!(+!,%#!-#! “merge” HIT received answers within 4 hours; sol 80#!9+!0+10/+0+/!401824!%!(+,1#/!0+/8,-1#!(+5! to the initial task were complete within 72 hours. !#$%#()$#% C+!+310+/!%(!%!,%(+!(8/7!4+!,1$3+!%()!1.! :0--#2!%#!+#,7,31+/-%!%0-,3+5!C0--#2!%#!%0-,3+!-(!%! Results ,4%33+#2-#2!%#/!-#+0/++#/+#!%()!4%!-#=13=+(!$%#7! The decompositions produced by Turkers while ru /-..+0+#!(89%()(D!3%##-#2!4+!(,1+!1.!4+!%0-,3+*! 41:!-!(4183/!9+!(08,80+/*!.-#/-#2!%#/!.-3+0-#2! Turkomatic are displayed in Figure 1 (essay-writin -#.10$%-1#!1!-#,38/+*!:0--#2!8!4%!-#.10$%-1#*! .-#/-#2!%#/!.--#2!20%$$%0!%#/!(+33-#2*!%#/!$%)-#2! and Figure 4 (SAT). 4+!%0-,3+!,14+0+#5!64+(+!,4%0%,+0-(-,(!$%)+!%0-,3+! Figure 4. For the SAT task, we uploaded :0--#2!%!,4%33+#2-#2!98!0+0+(+#%-=+!+(!,%(+!.10! sixteen questions from a high school 180!%01%,45! In the essay task, each “subdivide” HIT was poste Scholastic Aptitude Test to the web and three times by Turkomatic and the best of the thr 61!(13=+!4-(!0193+$!:+!,0+%+/!%!(-$3+!.31:! *)+,$%-.%/,)0%,$#0#%12%%310041,)5$% was selected by experimenters (simulating Turker 24 posed ,1#(-(-#2!1.!%!%0--1#*!$%*!%#/!0+/8,+!(+5!!64+! the following task to Turkomatic: 6,))7+%#89%
  25. 25. Future Directions in Crowdsourcing •  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010 What color is this pillow? What denomination is Do you see picnic tables What temperature is my Can you please tell me W this bill? across the parking lot? oven set to? what this can is? (89s) . (24s) 20 (13s) no (69s) it looks like 425 (183s) chickpeas. (9 (105s) multiple shades (29s) 20 (46s) no degrees but the image (514s) beans (9 of soft green, blue and is difficult to see. (552s) Goya Beans p gold (84s) 400 (2 (122s) 450Figure 2: Six questions asked by participants, the photographs they took, and answers received with latenc 25
  26. 26. Future Directions in Crowdsourcing •  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010 •  Embedding of Crowdwork inside Tools –  Bernstein, et al. Solyent, UIST 2010 26
  27. 27. the goals of learning, engagement, a improvement, we first analyze the imFuture Directions in Crowdsourcing dimensions of the design space for cr (Figure 2). Timeliness: When should feedback be•  Real-time Crowdsourcing In micro-task work, workers stay with –  Bigham, et al. VizWiz, UIST 2010 while, then move on. This implies two synchronously deliver feedback when•  Embedding of Crowdwork inside Tools engaged in a set of tasks, or asynchr –  Bernstein, et al. Solyent, UIST 2010 feedback after workers have complet•  Shepherding Crowdwork Synchronous feedback may have mor –  Dow et al. CHI2011 WIP task performance s while workers are s the task domain. It probability that wor onto similar tasks. H synchronous feedba burden on the feedb they have little time This implies a need scheduling algorithm near real-time feed Asynchronous feedb 27
  28. 28. Tutorials •  Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/ •  AAAI 2011 (w HCOMP 2011): Human Computation: Core Research Questions and State of the Art (E. Law Luis von Ahn) •  WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Omar Alonso and Matthew Lease) –  http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf •  LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio) –  http://lingpipe-blog.com/2010/05/17/ •  ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) –  http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html •  CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐ Fei Li) –  http://sites.google.com/site/turkforvision/ •  CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) –  http://videolectures.net/cikm08_rose_cfre/ •  WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis) –  http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation 28
  29. 29. Thanks! •  chi@acm.org •  http://edchi.net •  @edchi •  Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human- factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy. •  Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer- Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award] 29

×