Crowdsourcing for Human ComputerInteraction ResearchEd H. ChiResearch ScientistGoogle(work done while at [Xerox] PARC with...
User studies•  Getting input from users is important in HCI   –    surveys   –    rapid prototyping   –    usability tests...
User studies•  Getting input from users is expensive   –  Time costs   –  Monetary costs•  Often have to trade off costs w...
Online solutions•    Online user surveys•    Remote usability testing•    Online experiments•    But still have difficulti...
Crowdsourcing•  Make tasks available for anyone online to complete•  Quickly access a large user pool, collect data, and  ...
Amazon s Mechanical turk•  Market for human intelligence tasks•  Typically short, objective tasks   –  Tag an image   –  F...
Example task
Using Mechanical Turk for user studies                       Traditional user        Mechanical Turk                      ...
Task•  Assess quality of Wikipedia articles•  Started with ratings from expert Wikipedians    –  14 articles (e.g., German...
Experiment 1•  Rate articles on 7-point scales:   –  Well written   –  Factually accurate   –  Overall quality•  Free-text...
Experiment 1: Good news•  58 users made 210 ratings (15 per article)   –  $10.50 total•  Fast results   –  44% within a da...
Experiment 1: Bad news•  Correlation between turkers and Wikipedians   only marginally significant (r=.50, p=.07)•  Worse,...
Not a good start•  Summary of Experiment 1:   –  Only marginal correlation with experts.   –  Heavy gaming of the system b...
Design changes•  Use verifiable questions to signal monitoring   –  How many sections does the article have?   –  How many...
Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as   good-faith answe...
Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as   good-faith answe...
Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as   good-faith answe...
Experiment 2: Results   •  124 users provided 277 ratings (~20 per article)   •  Significant positive correlation with Wik...
Generalizing to other user studies•  Combine objective and subjective questions   –  Rapid prototyping: ask verifiable que...
Limitations of mechanical turk•  No control of users environment   –  Potential for different browsers, physical      dist...
Quick Summary•  Mechanical Turk offers the practitioner a way to   access a large user pool and quickly collect data at   ...
Crowdsourcing for HCI Research•  Does my interface/visualization work?   –  WikiDashboard: transparency visualization for ...
•  @edchi•  chi@acm.org•  http://edchi.net
What would make you trust Wikipedia more?                                        24
What is Wikipedia?    Wikipedia is the best thing ever. Anyone in the world can writeanything they want about any subject,...
What would make you trust Wikipedia more?              Nothing                                        26
What would make you trust Wikipedia more?       Wikipedia, just by its nature, is      impossible to trust completely. I d...
WikiDashboard       Transparency of social dynamics can reduce conflict and coordination        issues       Attribution...
Hillary	  Clinton	  Crowdsourcing Meetup (Stanford   292011)                                 29
Top	  Editor	  -­‐	  Wasted	  Time	  R	            Crowdsourcing Meetup (Stanford   30          2011)
Surfacing information•  Numerous studies mining Wikipedia revision   history to surface trust-relevant information   –  Ad...
Hypotheses1.  Visualization will impact perceptions of trust2.  Compared to baseline, visualization will    impact trust b...
Design        •  3 x 2 x 2 design                          Controversial    UncontroversialVisualization              Abor...
Example: High trust visualization                                    34
Example: Low trust visualization                                   35
Summary info          •  % from anonymous             users                                36
Summary info          •  % from anonymous             users          •  Last change by             anonymous or           ...
Summary info          •  % from anonymous             users          •  Last change by             anonymous or           ...
Graph•  Instability                         39
Method•  Users recruited via Amazon s Mechanical Turk   –    253 participants   –    673 ratings   –    7 cents per rating...
Results                                    7       High stability        Baseline        Low stability                    ...
Results                                   7       High stability        Baseline        Low stability                     ...
Results1.  Significant effect of                                  7       High stability        Baseline        Low stabil...
Results1.  Significant effect of                                  7       High stability        Baseline        Low stabil...
Results1.  Significant effect of                                  7       High stability        Baseline        Low stabil...
Upcoming SlideShare
Loading in …5
×

Crowdsourcing for HCI Research with Amazon Mechanical Turk

2,016 views

Published on

Crowdsourcing Meetup at Stanford
May 3, 2011

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,016
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
30
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Crowdsourcing for HCI Research with Amazon Mechanical Turk

  1. 1. Crowdsourcing for Human ComputerInteraction ResearchEd H. ChiResearch ScientistGoogle(work done while at [Xerox] PARC with Aniket Kittur)
  2. 2. User studies•  Getting input from users is important in HCI –  surveys –  rapid prototyping –  usability tests –  cognitive walkthroughs –  performance measures –  quantitative ratings
  3. 3. User studies•  Getting input from users is expensive –  Time costs –  Monetary costs•  Often have to trade off costs with sample size
  4. 4. Online solutions•  Online user surveys•  Remote usability testing•  Online experiments•  But still have difficulties –  Rely on practitioner for recruiting participants –  Limited pool of participants
  5. 5. Crowdsourcing•  Make tasks available for anyone online to complete•  Quickly access a large user pool, collect data, and compensate users•  Example: NASA Clickworkers –  100k+ volunteers identified Mars craters from space photographs –  Aggregate results virtually indistinguishable from expert geologists experts crowds http://clickworkers.arc.nasa.gov
  6. 6. Amazon s Mechanical turk•  Market for human intelligence tasks•  Typically short, objective tasks –  Tag an image –  Find a webpage –  Evaluate relevance of search results•  Users complete for a few pennies each
  7. 7. Example task
  8. 8. Using Mechanical Turk for user studies Traditional user Mechanical Turk studiesTask complexity Complex Simple Long ShortTask subjectivity Subjective Objective Opinions VerifiableUser information Targeted demographics Unknown demographics High interactivity Limited interactivity Can Mechanical Turk be usefully used for user studies?
  9. 9. Task•  Assess quality of Wikipedia articles•  Started with ratings from expert Wikipedians –  14 articles (e.g., Germany , Noam Chomsky ) –  7-point scale•  Can we get matching ratings with mechanical turk?
  10. 10. Experiment 1•  Rate articles on 7-point scales: –  Well written –  Factually accurate –  Overall quality•  Free-text input: –  What improvements does the article need?•  Paid $0.05 each
  11. 11. Experiment 1: Good news•  58 users made 210 ratings (15 per article) –  $10.50 total•  Fast results –  44% within a day, 100% within two days –  Many completed within minutes
  12. 12. Experiment 1: Bad news•  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)•  Worse, 59% potentially invalid responses Experiment 1 Invalid 49% comments <1 min 31% responses•  Nearly 75% of these done by only 8 users
  13. 13. Not a good start•  Summary of Experiment 1: –  Only marginal correlation with experts. –  Heavy gaming of the system by a minority•  Possible Response: –  Can make sure these gamers are not rewarded –  Ban them from doing your hits in the future –  Create a reputation system [Delores Lab]•  Can we change how we collect user input ?
  14. 14. Design changes•  Use verifiable questions to signal monitoring –  How many sections does the article have? –  How many images does the article have? –  How many references does the article have?
  15. 15. Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as good-faith answers –  Provide 4-6 keywords that would give someone a good summary of the contents of the article
  16. 16. Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as good-faith answers•  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians described evaluating quality (organization, presentation, references)
  17. 17. Design changes•  Use verifiable questions to signal monitoring•  Make malicious answers as high cost as good-faith answers•  Make verifiable answers useful for completing task•  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality –  Ecological validity?
  18. 18. Experiment 2: Results •  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians (r=. 66, p=.01) •  Smaller proportion malicious responses •  Increased time on task Experiment 1 Experiment 2 Invalid 49% 3%comments <1 min 31% 7%responsesMedian time 1:30 4:06
  19. 19. Generalizing to other user studies•  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/design of prototype before subjective evaluation –  User surveys: ask common-knowledge questions before asking for opinions
  20. 20. Limitations of mechanical turk•  No control of users environment –  Potential for different browsers, physical distractions –  General problem with online experimentation•  Not designed for user studies –  Difficult to do between-subjects design –  Involves some programming•  Users –  Uncertainty about user demographics, expertise
  21. 21. Quick Summary•  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost•  Good results require careful task design 1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task 4.  Put verifiable tasks before subjective responses
  22. 22. Crowdsourcing for HCI Research•  Does my interface/visualization work? –  WikiDashboard: transparency visualization for Wikipedia –  J. Heer’s work at Stanford at looking at perceptual effects•  Coding of large amount of user data –  What is a question? In Twitter, Sharoda Paul at PARC•  Decompose tasks into smaller tasks –  Digital Taylorism –  Frederick Winslow Taylor (1856-1915) 1911 book Principles Of Scientific Management•  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards –  Games vs. Pay
  23. 23. •  @edchi•  chi@acm.org•  http://edchi.net
  24. 24. What would make you trust Wikipedia more? 24
  25. 25. What is Wikipedia? Wikipedia is the best thing ever. Anyone in the world can writeanything they want about any subject, so you know you re getting the best possible information. – Steve Carell, The Office 25
  26. 26. What would make you trust Wikipedia more? Nothing 26
  27. 27. What would make you trust Wikipedia more? Wikipedia, just by its nature, is impossible to trust completely. I dont think this can necessarily be changed. 27
  28. 28. WikiDashboard   Transparency of social dynamics can reduce conflict and coordination issues   Attribution encourages contribution –  WikiDashboard: Social dashboard for wikis –  Prototype system: http://wikidashboard.parc.com   Visualization for every wiki page showing edit history timeline and top individual editors   Can drill down into activity history for specific editors and view edits to see changes side-by-sideCitation: Suh et al.CHI 2008 Proceedings Crowdsourcing Meetup (Stanford 28
  29. 29. Hillary  Clinton  Crowdsourcing Meetup (Stanford 292011) 29
  30. 30. Top  Editor  -­‐  Wasted  Time  R   Crowdsourcing Meetup (Stanford 30 2011)
  31. 31. Surfacing information•  Numerous studies mining Wikipedia revision history to surface trust-relevant information –  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007; Viegas et al., 2004; Zeng et al., 2006 Suh, Chi, Kittur, & Pendleton, CHI2008•  But how much impact can this have on user perceptions in a system which is inherently mutable? 31
  32. 32. Hypotheses1.  Visualization will impact perceptions of trust2.  Compared to baseline, visualization will impact trust both positively and negatively3.  Visualization should have most impact when high uncertainty about article •  Low quality •  High controversy 32
  33. 33. Design •  3 x 2 x 2 design Controversial UncontroversialVisualization Abortion Volcano High quality•  High stability George Bush Shark•  Low stability•  Baseline (none) Pro-life feminism Disk defragmenter Low quality Scientology and celebrities Beeswax 33
  34. 34. Example: High trust visualization 34
  35. 35. Example: Low trust visualization 35
  36. 36. Summary info •  % from anonymous users 36
  37. 37. Summary info •  % from anonymous users •  Last change by anonymous or established user 37
  38. 38. Summary info •  % from anonymous users •  Last change by anonymous or established user •  Stability of words 38
  39. 39. Graph•  Instability 39
  40. 40. Method•  Users recruited via Amazon s Mechanical Turk –  253 participants –  673 ratings –  7 cents per rating –  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies•  To ensure salience and valid answers, participants answered: –  In what time period was this article the least stable? –  How stable has this article been for the last month? –  Who was the last editor? –  How trustworthy do you consider the above editor? 40
  41. 41. Results 7 High stability Baseline Low stability 6 Trustworthiness rating 5 4 3 2 1 Low qual High qual Low qual High qual Uncontroversial Controversialmain effects of quality and controversy:• high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001)• uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031) 41
  42. 42. Results 7 High stability Baseline Low stability 6 Trustworthiness rating 5 4 3 2 1 Low qual High qual Low qual High qual Uncontroversial Controversialinteraction effects of quality and controversy:• high quality articles were rated equally trustworthy whether controversialor not, while• low quality articles were rated lower when they were controversial thanwhen they were uncontroversial. 42
  43. 43. Results1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 52.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual3.  No interaction of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 43
  44. 44. Results1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 52.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual3.  No interaction of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 44
  45. 45. Results1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 52.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual3.  No interaction effect of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 45

×