Crowdsourcing for HCI Research with Amazon Mechanical Turk

Ed Chi
Ed ChiPrincipal Scientist at Google
Crowdsourcing for Human Computer
Interaction Research


Ed H. Chi

Research Scientist
Google

(work done while at [Xerox] PARC with Aniket Kittur)
User studies

•  Getting input from users is important in HCI
   –    surveys
   –    rapid prototyping
   –    usability tests
   –    cognitive walkthroughs
   –    performance measures
   –    quantitative ratings
User studies

•  Getting input from users is expensive
   –  Time costs
   –  Monetary costs
•  Often have to trade off costs with sample size
Online solutions

•    Online user surveys
•    Remote usability testing
•    Online experiments
•    But still have difficulties
     –  Rely on practitioner for recruiting participants
     –  Limited pool of participants
Crowdsourcing

•  Make tasks available for anyone online to complete
•  Quickly access a large user pool, collect data, and
   compensate users
•  Example: NASA Clickworkers
    –  100k+ volunteers identified Mars craters from
       space photographs
    –  Aggregate results virtually indistinguishable from
       expert geologists

                                                   experts

                                                   crowds

                http://clickworkers.arc.nasa.gov
Amazon s Mechanical turk

•  Market for human intelligence tasks
•  Typically short, objective tasks
   –  Tag an image
   –  Find a webpage
   –  Evaluate relevance of search results
•  Users complete for a few pennies each
Example task
Using Mechanical Turk for user studies

                       Traditional user        Mechanical Turk
                           studies
Task complexity           Complex                   Simple
                           Long                     Short
Task subjectivity         Subjective               Objective
                          Opinions                 Verifiable
User information    Targeted demographics   Unknown demographics
                       High interactivity     Limited interactivity


    Can Mechanical Turk be usefully used for user studies?
Task

•  Assess quality of Wikipedia articles
•  Started with ratings from expert Wikipedians
    –  14 articles (e.g., Germany , Noam Chomsky )
    –  7-point scale
•  Can we get matching ratings with mechanical turk?
Experiment 1

•  Rate articles on 7-point scales:
   –  Well written
   –  Factually accurate
   –  Overall quality
•  Free-text input:
   –  What improvements does the article need?
•  Paid $0.05 each
Experiment 1: Good news

•  58 users made 210 ratings (15 per article)
   –  $10.50 total
•  Fast results
   –  44% within a day, 100% within two days
   –  Many completed within minutes
Experiment 1: Bad news

•  Correlation between turkers and Wikipedians
   only marginally significant (r=.50, p=.07)
•  Worse, 59% potentially invalid responses
                         Experiment 1
           Invalid           49%
         comments
           <1 min            31%
         responses

•  Nearly 75% of these done by only 8 users
Not a good start
•  Summary of Experiment 1:
   –  Only marginal correlation with experts.
   –  Heavy gaming of the system by a minority
•  Possible Response:
   –  Can make sure these gamers are not rewarded
   –  Ban them from doing your hits in the future
   –  Create a reputation system [Delores Lab]
•  Can we change how we collect user input ?
Design changes

•  Use verifiable questions to signal monitoring
   –  How many sections does the article have?
   –  How many images does the article have?
   –  How many references does the article have?
Design changes

•  Use verifiable questions to signal monitoring
•  Make malicious answers as high cost as
   good-faith answers
   –  Provide 4-6 keywords that would give someone a
     good summary of the contents of the article
Design changes

•  Use verifiable questions to signal monitoring
•  Make malicious answers as high cost as
   good-faith answers
•  Make verifiable answers useful for completing
   task
   –  Used tasks similar to how Wikipedians described
      evaluating quality (organization, presentation,
      references)
Design changes

•  Use verifiable questions to signal monitoring
•  Make malicious answers as high cost as
   good-faith answers
•  Make verifiable answers useful for completing
   task
•  Put verifiable tasks before subjective
   responses
   –  First do objective tasks and summarization
   –  Only then evaluate subjective quality
   –  Ecological validity?
Experiment 2: Results

   •  124 users provided 277 ratings (~20 per article)
   •  Significant positive correlation with Wikipedians (r=.
      66, p=.01)

   •  Smaller proportion malicious responses
   •  Increased time on task

                      Experiment 1        Experiment 2
  Invalid                49%                  3%
comments
  <1 min                 31%                  7%
responses
Median time              1:30                4:06
Generalizing to other user studies

•  Combine objective and subjective questions
   –  Rapid prototyping: ask verifiable questions about
      content/design of prototype before subjective
      evaluation
   –  User surveys: ask common-knowledge questions
      before asking for opinions
Limitations of mechanical turk

•  No control of users environment
   –  Potential for different browsers, physical
      distractions
   –  General problem with online experimentation
•  Not designed for user studies
   –  Difficult to do between-subjects design
   –  Involves some programming
•  Users
   –  Uncertainty about user demographics, expertise
Quick Summary

•  Mechanical Turk offers the practitioner a way to
   access a large user pool and quickly collect data at
   low cost
•  Good results require careful task design
  1.  Use verifiable questions to signal monitoring
  2.  Make malicious answers as high cost as good-faith
      answers
  3.  Make verifiable answers useful for completing task
  4.  Put verifiable tasks before subjective responses
Crowdsourcing for HCI Research


•  Does my interface/visualization work?
   –  WikiDashboard: transparency visualization for Wikipedia
   –  J. Heer’s work at Stanford at looking at perceptual effects
•  Coding of large amount of user data
   –  What is a question? In Twitter, Sharoda Paul at PARC
•  Decompose tasks into smaller tasks
   –  Digital Taylorism
   –  Frederick Winslow Taylor (1856-1915) 1911 book
      'Principles Of Scientific Management'
•  Incentive mechanisms
   –  Intrinsic vs. Extrinsic rewards
   –  Games vs. Pay
•  @edchi
•  chi@acm.org
•  http://edchi.net
What would make you trust Wikipedia more?




                                        24
What is Wikipedia?




    Wikipedia is the best thing ever. Anyone in the world can write
anything they want about any subject, so you know you re getting the
                      best possible information.
                      – Steve Carell, The Office


                                                                   25
What would make you trust Wikipedia more?




              Nothing



                                        26
What would make you trust Wikipedia more?




       Wikipedia, just by its nature, is
      impossible to trust completely. I don't
      think this can necessarily be
      changed.




                                                27
WikiDashboard
       Transparency of social dynamics can reduce conflict and coordination
        issues
       Attribution encourages contribution
         –  WikiDashboard: Social dashboard for wikis
         –  Prototype system: http://wikidashboard.parc.com



       Visualization for every wiki page
        showing edit history timeline and
        top individual editors

       Can drill down into activity history
        for specific editors and view edits
        to see changes side-by-side

Citation: Suh et al.
CHI 2008 Proceedings


                                Crowdsourcing Meetup (Stanford                 28
Hillary	
  Clinton	
  




Crowdsourcing Meetup (Stanford   29
2011)                                 29
Top	
  Editor	
  -­‐	
  Wasted	
  Time	
  R	
  




          Crowdsourcing Meetup (Stanford   30
          2011)
Surfacing information

•  Numerous studies mining Wikipedia revision
   history to surface trust-relevant information
   –  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
      Viegas et al., 2004; Zeng et al., 2006




                                          Suh, Chi, Kittur, & Pendleton, CHI2008


•  But how much impact can this have on user
   perceptions in a system which is inherently
   mutable?
                                                                              31
Hypotheses

1.  Visualization will impact perceptions of trust
2.  Compared to baseline, visualization will
    impact trust both positively and negatively
3.  Visualization should have most impact when
    high uncertainty about article
   •    Low quality
   •    High controversy




                                                     32
Design

        •  3 x 2 x 2 design


                          Controversial    Uncontroversial


Visualization              Abortion          Volcano
                                                             High quality
•    High stability     George Bush           Shark
•    Low stability
•    Baseline (none)   Pro-life feminism        Disk
                                           defragmenter      Low quality
                       Scientology and
                          celebrities        Beeswax




                                                                           33
Example: High trust visualization




                                    34
Example: Low trust visualization




                                   35
Summary info

          •  % from anonymous
             users




                                36
Summary info

          •  % from anonymous
             users
          •  Last change by
             anonymous or
             established user




                                37
Summary info

          •  % from anonymous
             users
          •  Last change by
             anonymous or
             established user
          •  Stability of words




                                  38
Graph

•  Instability




                         39
Method

•  Users recruited via Amazon s Mechanical Turk
   –    253 participants
   –    673 ratings
   –    7 cents per rating
   –    Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
•  To ensure salience and valid answers, participants
   answered:
   –    In what time period was this article the least stable?
   –    How stable has this article been for the last month?
   –    Who was the last editor?
   –    How trustworthy do you consider the above editor?




                                                                 40
Results

                                    7       High stability        Baseline        Low stability


                                    6
           Trustworthiness rating
                                    5

                                    4

                                    3

                                    2

                                    1
                                        Low qual      High qual       Low qual        High qual

                                           Uncontroversial                   Controversial


main effects of quality and controversy:
• high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001)
• uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .
031)

                                                                                                  41
Results

                                   7       High stability        Baseline        Low stability


                                   6
          Trustworthiness rating
                                   5

                                   4

                                   3

                                   2

                                   1
                                       Low qual      High qual       Low qual        High qual

                                          Uncontroversial                   Controversial


interaction effects of quality and controversy:
• high quality articles were rated equally trustworthy whether controversial
or not, while
• low quality articles were rated lower when they were controversial than
when they were uncontroversial.
                                                                                                 42
Results

1.  Significant effect of                                  7       High stability        Baseline        Low stability


    visualization                                          6




                                  Trustworthiness rating
   –  High > low, p < .001                                 5


2.  Viz has both positive and                              4


    negative effects                                       3


   –  High > baseline, p < .001                            2


   –  Low > baseline, p < .01                              1
                                                               Low qual      High qual       Low qual        High qual

3.  No interaction of                                             Uncontroversial                   Controversial


    visualization with either
    quality or controversy
   –  Robust across conditions



                                                                                                                     43
Results

1.  Significant effect of                                  7       High stability        Baseline        Low stability


    visualization                                          6




                                  Trustworthiness rating
   –  High > low, p < .001                                 5


2.  Viz has both positive and                              4


    negative effects                                       3


   –  High > baseline, p < .001                            2


   –  Low > baseline, p < .01                              1
                                                               Low qual      High qual       Low qual        High qual

3.  No interaction of                                             Uncontroversial                   Controversial


    visualization with either
    quality or controversy
   –  Robust across conditions



                                                                                                                     44
Results

1.  Significant effect of                                  7       High stability        Baseline        Low stability


    visualization                                          6




                                  Trustworthiness rating
   –  High > low, p < .001                                 5


2.  Viz has both positive and                              4


    negative effects                                       3


   –  High > baseline, p < .001                            2


   –  Low > baseline, p < .01                              1
                                                               Low qual      High qual       Low qual        High qual

3.  No interaction effect of                                      Uncontroversial                   Controversial


    visualization with either
    quality or controversy
   –  Robust across conditions



                                                                                                                     45
1 of 45

Recommended

Crowdsourcing using MTurk for HCI research by
Crowdsourcing using MTurk for HCI researchCrowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI researchEd Chi
2.2K views62 slides
How to get what you really want from Testing' with Michael Bolton by
How to get what you really want from Testing' with Michael BoltonHow to get what you really want from Testing' with Michael Bolton
How to get what you really want from Testing' with Michael BoltonTEST Huddle
1.4K views50 slides
Designing The Social In by
Designing The Social InDesigning The Social In
Designing The Social InErin Malone
32.9K views123 slides
Gestural Interaction, Is it Really Natural? by
Gestural Interaction, Is it Really Natural?Gestural Interaction, Is it Really Natural?
Gestural Interaction, Is it Really Natural?Jean Vanderdonckt
86 views82 slides
How to Measure the Metrics that Determine Real Progress by
How to Measure the Metrics that Determine Real ProgressHow to Measure the Metrics that Determine Real Progress
How to Measure the Metrics that Determine Real ProgressLean Startup Machine
5.9K views30 slides
Building Serious Games for Medical Intervention and Training by
Building Serious Games for Medical Intervention and TrainingBuilding Serious Games for Medical Intervention and Training
Building Serious Games for Medical Intervention and TrainingBrock Dubbels
569 views46 slides

More Related Content

Similar to Crowdsourcing for HCI Research with Amazon Mechanical Turk

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research by
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchTutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchEd Chi
3.4K views29 slides
Pragmatisk softwareinnovation, Ivan Aaen, AAU by
Pragmatisk softwareinnovation, Ivan Aaen, AAUPragmatisk softwareinnovation, Ivan Aaen, AAU
Pragmatisk softwareinnovation, Ivan Aaen, AAUInfinIT - Innovationsnetværket for it
766 views9 slides
Session1 methods research_question by
Session1 methods research_questionSession1 methods research_question
Session1 methods research_questionmilolostinspace
882 views38 slides
ICS3211_lecture 9_2022.pdf by
ICS3211_lecture 9_2022.pdfICS3211_lecture 9_2022.pdf
ICS3211_lecture 9_2022.pdfVanessa Camilleri
712 views41 slides
ICS3211 Lecture 9 by
ICS3211 Lecture 9ICS3211 Lecture 9
ICS3211 Lecture 9Vanessa Camilleri
1.1K views41 slides
Understanding The Value Of User Research, Usability Testing, and Information ... by
Understanding The Value Of User Research, Usability Testing, and Information ...Understanding The Value Of User Research, Usability Testing, and Information ...
Understanding The Value Of User Research, Usability Testing, and Information ...Kyle Soucy
2.1K views44 slides

Similar to Crowdsourcing for HCI Research with Amazon Mechanical Turk(20)

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research by Ed Chi
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchTutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Ed Chi3.4K views
Session1 methods research_question by milolostinspace
Session1 methods research_questionSession1 methods research_question
Session1 methods research_question
milolostinspace882 views
Understanding The Value Of User Research, Usability Testing, and Information ... by Kyle Soucy
Understanding The Value Of User Research, Usability Testing, and Information ...Understanding The Value Of User Research, Usability Testing, and Information ...
Understanding The Value Of User Research, Usability Testing, and Information ...
Kyle Soucy2.1K views
Validating Ideas Through Prototyping by Chris Risdon
Validating Ideas Through PrototypingValidating Ideas Through Prototyping
Validating Ideas Through Prototyping
Chris Risdon3.8K views
Human computation, crowdsourcing and social: An industrial perspective by oralonso
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
oralonso1.2K views
Usability and User Experience Training Seminar by labecvar
Usability and User Experience Training SeminarUsability and User Experience Training Seminar
Usability and User Experience Training Seminar
labecvar2.4K views
COSC 426 Lect. 7: Evaluating AR Applications by Mark Billinghurst
COSC 426 Lect. 7: Evaluating AR ApplicationsCOSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR Applications
Mark Billinghurst1.5K views
Uconn Coiro Assessment 2008 by Julie Coiro
Uconn Coiro Assessment 2008Uconn Coiro Assessment 2008
Uconn Coiro Assessment 2008
Julie Coiro603 views
Aect2018 workshop-v6ij-compressed by Isa Jahnke
Aect2018 workshop-v6ij-compressedAect2018 workshop-v6ij-compressed
Aect2018 workshop-v6ij-compressed
Isa Jahnke388 views
Getting Started with User Research by Diane Loviglio
Getting Started with User ResearchGetting Started with User Research
Getting Started with User Research
Diane Loviglio465 views

More from Ed Chi

2017 10-10 (netflix ml platform meetup) learning item and user representation... by
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...Ed Chi
20.3K views48 slides
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing by
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social ComputingHCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social ComputingEd Chi
1.2K views56 slides
Location and Language in Social Media (Stanford Mobi Social Invited Talk) by
Location and Language in Social Media (Stanford Mobi Social Invited Talk)Location and Language in Social Media (Stanford Mobi Social Invited Talk)
Location and Language in Social Media (Stanford Mobi Social Invited Talk)Ed Chi
1.2K views59 slides
CIKM 2011 Social Computing Industry Invited Talk by
CIKM 2011 Social Computing Industry Invited TalkCIKM 2011 Social Computing Industry Invited Talk
CIKM 2011 Social Computing Industry Invited TalkEd Chi
1.1K views42 slides
WikiSym 2011 Closing Keynote by
WikiSym 2011 Closing KeynoteWikiSym 2011 Closing Keynote
WikiSym 2011 Closing KeynoteEd Chi
989 views43 slides
CSCL 2011 Keynote on Social Computing and eLearning by
CSCL 2011 Keynote on Social Computing and eLearningCSCL 2011 Keynote on Social Computing and eLearning
CSCL 2011 Keynote on Social Computing and eLearningEd Chi
1.6K views51 slides

More from Ed Chi(20)

2017 10-10 (netflix ml platform meetup) learning item and user representation... by Ed Chi
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi20.3K views
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing by Ed Chi
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social ComputingHCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing
HCI Korea 2012 Keynote Talk on Model-Driven Research in Social Computing
Ed Chi1.2K views
Location and Language in Social Media (Stanford Mobi Social Invited Talk) by Ed Chi
Location and Language in Social Media (Stanford Mobi Social Invited Talk)Location and Language in Social Media (Stanford Mobi Social Invited Talk)
Location and Language in Social Media (Stanford Mobi Social Invited Talk)
Ed Chi1.2K views
CIKM 2011 Social Computing Industry Invited Talk by Ed Chi
CIKM 2011 Social Computing Industry Invited TalkCIKM 2011 Social Computing Industry Invited Talk
CIKM 2011 Social Computing Industry Invited Talk
Ed Chi1.1K views
WikiSym 2011 Closing Keynote by Ed Chi
WikiSym 2011 Closing KeynoteWikiSym 2011 Closing Keynote
WikiSym 2011 Closing Keynote
Ed Chi989 views
CSCL 2011 Keynote on Social Computing and eLearning by Ed Chi
CSCL 2011 Keynote on Social Computing and eLearningCSCL 2011 Keynote on Social Computing and eLearning
CSCL 2011 Keynote on Social Computing and eLearning
Ed Chi1.6K views
Replication is more than Duplication: Position slides for CHI2011 panel on re... by Ed Chi
Replication is more than Duplication: Position slides for CHI2011 panel on re...Replication is more than Duplication: Position slides for CHI2011 panel on re...
Replication is more than Duplication: Position slides for CHI2011 panel on re...
Ed Chi572 views
Eddi: Topic Browsing of Twitter Streams by Ed Chi
Eddi: Topic Browsing of Twitter StreamsEddi: Topic Browsing of Twitter Streams
Eddi: Topic Browsing of Twitter Streams
Ed Chi1.8K views
Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ... by Ed Chi
Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ...Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ...
Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented ...
Ed Chi3.2K views
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u... by Ed Chi
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Ed Chi1.1K views
Zerozero88 Twitter URL Item Recommender by Ed Chi
Zerozero88 Twitter URL Item RecommenderZerozero88 Twitter URL Item Recommender
Zerozero88 Twitter URL Item Recommender
Ed Chi1.2K views
Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006 by Ed Chi
Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006
Smart eBooks: ScentIndex and ScentHighlight research published at VAST2006
Ed Chi912 views
Model-Driven Research in Social Computing by Ed Chi
Model-Driven Research in Social ComputingModel-Driven Research in Social Computing
Model-Driven Research in Social Computing
Ed Chi4.4K views
ASC Disaster Response Proposal from Aug 2007 by Ed Chi
ASC Disaster Response Proposal from Aug 2007ASC Disaster Response Proposal from Aug 2007
ASC Disaster Response Proposal from Aug 2007
Ed Chi562 views
Using Information Scent to Model Users in Web1.0 and Web2.0 by Ed Chi
Using Information Scent to Model Users in Web1.0 and Web2.0Using Information Scent to Model Users in Web1.0 and Web2.0
Using Information Scent to Model Users in Web1.0 and Web2.0
Ed Chi1.2K views
China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC... by Ed Chi
China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC...China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC...
China HCI Symposium 2010 March: Augmented Social Cognition Research from PARC...
Ed Chi771 views
2010-03-10 PARC Augmented Social Cognition Research Overview by Ed Chi
2010-03-10 PARC Augmented Social Cognition Research Overview2010-03-10 PARC Augmented Social Cognition Research Overview
2010-03-10 PARC Augmented Social Cognition Research Overview
Ed Chi530 views
2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica by Ed Chi
2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
Ed Chi741 views
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor... by Ed Chi
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Ed Chi940 views
Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia... by Ed Chi
Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia...Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia...
Slowing Growth of Wikipedia and Models of its Dynamic (Presented at Wikimedia...
Ed Chi544 views

Recently uploaded

Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
50 views15 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
54 views18 slides
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
84 views15 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
25 views13 slides
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
71 views20 slides
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...ShapeBlue
55 views12 slides

Recently uploaded(20)

Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue54 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue84 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue25 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue71 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue55 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue60 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue44 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue44 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue26 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue81 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson126 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue28 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue106 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue89 views

Crowdsourcing for HCI Research with Amazon Mechanical Turk

  • 1. Crowdsourcing for Human Computer Interaction Research Ed H. Chi Research Scientist Google (work done while at [Xerox] PARC with Aniket Kittur)
  • 2. User studies •  Getting input from users is important in HCI –  surveys –  rapid prototyping –  usability tests –  cognitive walkthroughs –  performance measures –  quantitative ratings
  • 3. User studies •  Getting input from users is expensive –  Time costs –  Monetary costs •  Often have to trade off costs with sample size
  • 4. Online solutions •  Online user surveys •  Remote usability testing •  Online experiments •  But still have difficulties –  Rely on practitioner for recruiting participants –  Limited pool of participants
  • 5. Crowdsourcing •  Make tasks available for anyone online to complete •  Quickly access a large user pool, collect data, and compensate users •  Example: NASA Clickworkers –  100k+ volunteers identified Mars craters from space photographs –  Aggregate results virtually indistinguishable from expert geologists experts crowds http://clickworkers.arc.nasa.gov
  • 6. Amazon s Mechanical turk •  Market for human intelligence tasks •  Typically short, objective tasks –  Tag an image –  Find a webpage –  Evaluate relevance of search results •  Users complete for a few pennies each
  • 8. Using Mechanical Turk for user studies Traditional user Mechanical Turk studies Task complexity Complex Simple Long Short Task subjectivity Subjective Objective Opinions Verifiable User information Targeted demographics Unknown demographics High interactivity Limited interactivity Can Mechanical Turk be usefully used for user studies?
  • 9. Task •  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians –  14 articles (e.g., Germany , Noam Chomsky ) –  7-point scale •  Can we get matching ratings with mechanical turk?
  • 10. Experiment 1 •  Rate articles on 7-point scales: –  Well written –  Factually accurate –  Overall quality •  Free-text input: –  What improvements does the article need? •  Paid $0.05 each
  • 11. Experiment 1: Good news •  58 users made 210 ratings (15 per article) –  $10.50 total •  Fast results –  44% within a day, 100% within two days –  Many completed within minutes
  • 12. Experiment 1: Bad news •  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07) •  Worse, 59% potentially invalid responses Experiment 1 Invalid 49% comments <1 min 31% responses •  Nearly 75% of these done by only 8 users
  • 13. Not a good start •  Summary of Experiment 1: –  Only marginal correlation with experts. –  Heavy gaming of the system by a minority •  Possible Response: –  Can make sure these gamers are not rewarded –  Ban them from doing your hits in the future –  Create a reputation system [Delores Lab] •  Can we change how we collect user input ?
  • 14. Design changes •  Use verifiable questions to signal monitoring –  How many sections does the article have? –  How many images does the article have? –  How many references does the article have?
  • 15. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers –  Provide 4-6 keywords that would give someone a good summary of the contents of the article
  • 16. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians described evaluating quality (organization, presentation, references)
  • 17. Design changes •  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith answers •  Make verifiable answers useful for completing task •  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality –  Ecological validity?
  • 18. Experiment 2: Results •  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians (r=. 66, p=.01) •  Smaller proportion malicious responses •  Increased time on task Experiment 1 Experiment 2 Invalid 49% 3% comments <1 min 31% 7% responses Median time 1:30 4:06
  • 19. Generalizing to other user studies •  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/design of prototype before subjective evaluation –  User surveys: ask common-knowledge questions before asking for opinions
  • 20. Limitations of mechanical turk •  No control of users environment –  Potential for different browsers, physical distractions –  General problem with online experimentation •  Not designed for user studies –  Difficult to do between-subjects design –  Involves some programming •  Users –  Uncertainty about user demographics, expertise
  • 21. Quick Summary •  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost •  Good results require careful task design 1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task 4.  Put verifiable tasks before subjective responses
  • 22. Crowdsourcing for HCI Research •  Does my interface/visualization work? –  WikiDashboard: transparency visualization for Wikipedia –  J. Heer’s work at Stanford at looking at perceptual effects •  Coding of large amount of user data –  What is a question? In Twitter, Sharoda Paul at PARC •  Decompose tasks into smaller tasks –  Digital Taylorism –  Frederick Winslow Taylor (1856-1915) 1911 book 'Principles Of Scientific Management' •  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards –  Games vs. Pay
  • 24. What would make you trust Wikipedia more? 24
  • 25. What is Wikipedia? Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you re getting the best possible information. – Steve Carell, The Office 25
  • 26. What would make you trust Wikipedia more? Nothing 26
  • 27. What would make you trust Wikipedia more? Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed. 27
  • 28. WikiDashboard   Transparency of social dynamics can reduce conflict and coordination issues   Attribution encourages contribution –  WikiDashboard: Social dashboard for wikis –  Prototype system: http://wikidashboard.parc.com   Visualization for every wiki page showing edit history timeline and top individual editors   Can drill down into activity history for specific editors and view edits to see changes side-by-side Citation: Suh et al. CHI 2008 Proceedings Crowdsourcing Meetup (Stanford 28
  • 29. Hillary  Clinton   Crowdsourcing Meetup (Stanford 29 2011) 29
  • 30. Top  Editor  -­‐  Wasted  Time  R   Crowdsourcing Meetup (Stanford 30 2011)
  • 31. Surfacing information •  Numerous studies mining Wikipedia revision history to surface trust-relevant information –  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007; Viegas et al., 2004; Zeng et al., 2006 Suh, Chi, Kittur, & Pendleton, CHI2008 •  But how much impact can this have on user perceptions in a system which is inherently mutable? 31
  • 32. Hypotheses 1.  Visualization will impact perceptions of trust 2.  Compared to baseline, visualization will impact trust both positively and negatively 3.  Visualization should have most impact when high uncertainty about article •  Low quality •  High controversy 32
  • 33. Design •  3 x 2 x 2 design Controversial Uncontroversial Visualization Abortion Volcano High quality •  High stability George Bush Shark •  Low stability •  Baseline (none) Pro-life feminism Disk defragmenter Low quality Scientology and celebrities Beeswax 33
  • 34. Example: High trust visualization 34
  • 35. Example: Low trust visualization 35
  • 36. Summary info •  % from anonymous users 36
  • 37. Summary info •  % from anonymous users •  Last change by anonymous or established user 37
  • 38. Summary info •  % from anonymous users •  Last change by anonymous or established user •  Stability of words 38
  • 40. Method •  Users recruited via Amazon s Mechanical Turk –  253 participants –  673 ratings –  7 cents per rating –  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies •  To ensure salience and valid answers, participants answered: –  In what time period was this article the least stable? –  How stable has this article been for the last month? –  Who was the last editor? –  How trustworthy do you consider the above editor? 40
  • 41. Results 7 High stability Baseline Low stability 6 Trustworthiness rating 5 4 3 2 1 Low qual High qual Low qual High qual Uncontroversial Controversial main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = . 031) 41
  • 42. Results 7 High stability Baseline Low stability 6 Trustworthiness rating 5 4 3 2 1 Low qual High qual Low qual High qual Uncontroversial Controversial interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial. 42
  • 43. Results 1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 5 2.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual 3.  No interaction of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 43
  • 44. Results 1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 5 2.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual 3.  No interaction of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 44
  • 45. Results 1.  Significant effect of 7 High stability Baseline Low stability visualization 6 Trustworthiness rating –  High > low, p < .001 5 2.  Viz has both positive and 4 negative effects 3 –  High > baseline, p < .001 2 –  Low > baseline, p < .01 1 Low qual High qual Low qual High qual 3.  No interaction effect of Uncontroversial Controversial visualization with either quality or controversy –  Robust across conditions 45