Crowdsourcing for Online Data Collection


Published on

Slides on how to use crowdsourcing and Amazon's Mechanical Turk for collecting online data, particularly for psychologists. Presented at the Online Data Collection Workshop at ICSTE in Lisbon, Portgual on Jan 9, 2012.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Start: 5:45 end: 6:08
  • Small screen shot?
  • Small screen shot?
  • Open to anyone globally? Paid in dollars or rupees?Picture of cycle?
  • Main point of this slide is: who are your competitors
  • Picture here
  • Picture here
  • Picture here
  • Why not over-report sixes?
  • Last bullet point probably belongs elsewhere
  • Picture here
  • Crowdsourcing for Online Data Collection

    1. 1. Conducting Behavioral Research with Crowdsourcing(especially Amazon’s Mechanical Turk) Winter Mason Siddharth Suri Stevens Institute of Yahoo! Research Technology
    2. 2. Outline Peer Production vs. Human Computation vs. Crowdsourcing Peer Production & Citizen Science Crowdsourcing Mechanical Turk Basics Internal HITs  Preference Elicitation  Surveys External HITs  Random Assignment  Synchronous Experiments Conclusion
    3. 3. Definitions Peer Production  Creation through distributed contributions Human Computation  Computation with “humans in the loop” (Law & von Ahn, ‘11) Crowdsourcing  Jobs outsourced to a group through an open call (Howe ‘06)
    4. 4. Examples of Modern Peer Production Open source software  Crowdsourcing  Linux, Apache, Fire Fox  ESP Game  Mash-ups  Fold-it! Prediction Markets  galaxyZoo  Iowa electronics markets,  threadless Hollywood stock exchange  Tagasauris Collaborative Knowledge  Innocentive  Wikipedia, Intellipedia  TopCoder  Yahoo! Answers  oDesk  Amazon, Yelp, Epinions  Mechanical Turk Social Tagging Communities  Flickr,
    5. 5. ESP Game Two player online game Players do not know who they are playing with Players cannot communicate Object of the game:  Type the same word given an image
    6. 6. Games With a Purpose The outcome of the ESP game is labeled images. Google Images bought the ESP game, and has used it to improve image search. The contributions of the crowd are completely free for Google.
    7. 7. Fold.It! is an online game in which players fold proteins into different configurations Certain configurations earn more points than others The configurations correspond to physical structures:  some amino acids must be near the center, and others outside  some pairs of amino acids must be close together and others far apart Players of the game recently unlocked the structure of an AIDS-related enzyme that the scientific community had been unable to unlock for a decade
    8. 8. galaxyZoo  “Citizen Science”  The number of images of galaxies taken by Hubble is immense.  Computers can identify whether something is a galaxy, but not what type of galaxy it is (reliably).  By employing the crowd, galaxyZoo has classified over 50M galaxies.  Astronomers used to assume that if a galaxy appears red in color, it is also probably an elliptical galaxy. Galaxy Zoo has shown that up to a third of red galaxies are actually spirals.
    9. 9. Tagasauris  Magnum Photos has a very large collection of mis- or unclassified photos  To get a handle on it, they asked crowd-workers to tag their photos  Through this process, in combination with a knowledge base, they discovered lost photos from the movie, “American Graffiti”  The actors were tagged individually in the photos (like the one on the right), and the system linked them together and discovered they were all related to the film.
    10. 10. Innocentive  A “Seeker” creates a “challenge”, typically requiring serious skill and technical ability  Multiple “Solvers” submit detailed solutions to the challenge. If the solution is selected, they win the (typically sizable) reward.  For instance, by creating a durable & inexpensive solar flashlight that could double as a lamp, a retired engineer won $20,000 and brought lighting to many rural Africans.
    11. 11. topCoder  Programming jobs are offered as contests  Coders submit their work, and the winner earns the reward  Aside from the direct payoff, there are anecdotal reports of people being hired for permanent positions as a result of their contributions on TopCoder
    12. 12. oDesk  Skilled crowdsourcing:  for any job that requires some skills, but can be done entirely on a computer.  Jobs are paid either as a flat, one-time reward, or on an hourly basis for longer contracts.  Workers have extensive profiles & reputations, and wages are negotiated between Employer and Worker.  Jobs cover a vary large spectrum, and pay varies with skill
    13. 13. Amazon’s Mechanical Turk  The original crowdsourcing platform  “The human inside the machine”; built to programmatically incorporate human input  Jobs are meant to be doable by any human, and every worker is meant to be completely interchangeable.
    14. 14. Generally-Shared Features of ExistingSystems Contributions highly modular  Minimal contribution is small  Single edit, single line of code, single tag  Low interdependence between separate contributions  Same document or function Distribution of contributions highly skewed  Small number of heavy contributors  Wikipedia, AMT, Digg  Large number of “free riders”  Very common feature of public goods
    15. 15. What is Mechanical Turk? Crowdsourcing  Jobs outsourced to a group through an open call (Howe ‘06) Online Labor Market  Place for requesters to post jobs and workers to do them for pay Participant recruitment and reimbursement  How can we use MTurk for behavioral research?  What kinds of behavioral research can we use MTurk for?
    16. 16. Why Mechanical Turk? Subject pool size  Central place for > 100,000 workers (Pontin „07)  Always-available subject pool Subject pool diversity  Open to anyone globally with a computer, internet connection Low cost  Reservation Wage: $1.38/hour (Chilton et al „10)  Effective Wage: $4.80/hour (Ipeirotis, ‟10) Faster theory/experiment cycle  Hypothesis formulation  Testing & evaluation of hypothesis  New hypothesis tests
    17. 17. Validity of Worker Behavior (Quality-controlled) worker output can be as good as experts, sometimes better  Labeling text with emotion (Snow, et al, 2008)  Audio transcriptions (Marge, et al, 2010)  Similarity judgments for music (Urbano, et al, 2010)  Search relevance judgments (Alonso & Mizzaro, 2009) Experiments with workers replicate studies conducted in laboratory or other online settings  Standard psychometric tests (Buhrmester, et al, 2011)  Response in judgment and decision-making tests (Paolacci, et al, 2010)  Responses in public good games (Suri & Watts, 2011)
    18. 18. Worker Demographics Self reported demographic information from 2,896 workers over 3 years (MW „09, MW „11, SW ‟10) 55% Female, 45% Male  Similar to other internet panels (e.g. Goldstein) Age:  Mean: 30 yrs,  Median: 32 yrs Mean Income: $30,000 / yr Similar to Ipeirotis „10, Ross et al ‟10
    19. 19. Internal Consistency of Demographics 207 out of 2,896 workers did 2 of our studies  Only 1 inconsistency on gender, age, income (0.4%) 31 workers did ≥ 3 of our studies  3 changed gender  1 changed age (by 6 years)  7 changed income bracket Strong internal consistency
    20. 20. Why Do Work on Mechanical Turk? “Mturk money is always necessary to make ends meet.”  5% U.S. 13% India “Mturk money is irrelevant.”  12% U.S. 10% India “Mturk is a fruitful way to spend free time and get some cash.”  69% U.S. 59% India (Ross et al ‟10, Ipeirotis ‟10)
    21. 21. Requesters Companies crowdsourcing part of their business  Search companies: relevance  Online stores: similar products from different stores (identifying competition)  Online directories: accuracy, freshness of listings  Researchers Intermediaries  CrowdFlower (formerly Delores Labs) 
    22. 22. Common Tasks Image labeling Audio transcription Object / Website / Image classification Product evaluation
    23. 23. Uncommon tasks Workflow optimization Copy editing Product description Technical writing
    24. 24. Soylent Word processing with an embedded crowd (Bernstein et al, UIST 2010) Crowd proofreads each paragraph “Find-Fix-Verify” prevents “lazy worker” from ruining output
    25. 25. Find–Fix–Verify Find  Identify one area that can be shortened without changing the meaning of the paragraph Fix  Edit the highlighted section to shorten its length without changing the meaning of the paragraph Verify  Choose one rewrite that fixes style errors and one that changes the meaning
    26. 26. Iterative processes By building on each other‟s work, the crowd can achieve remarkable outcomes Some tasks benefit from iterative processes, others from parallel (Little, et al, 2010)
    27. 27. TurkoMatic Crowd creates workflows1. Ask workers to decompose task into steps2. Ask if a step can be completed in 10 minutes  If so, solve it  If not, decompose the sub-task3. Combine outputs of sub-tasks into final output (Kalkani et al, CHI 2011)
    28. 28. Turker Community Asymmetry in reputation mechanism Reputation of Workers is given by approval rating  Requesters can reject work  Requesters can refuse workers with low approval rates Reputation of Requesters is not built in to Mturk  Turkopticon: Workers rate requesters on communicativity, generosity, fairness and promptness  Turker Nation: Online forum for workers  Requesters should introduce themselves here Reputation matters, so abusive studies will fail quickly
    29. 29. Anatomy of a HIT HITs with the same title, description, pay rate, etc. are the same HIT type HITs are broken up into Assignments A worker cannot do more than 1 assignment of a HIT
    30. 30. Anatomy of a HIT HITs with the same title, description, pay rate, etc. are the same HIT type HITs are broken up into Assignments Requesters can set qualifications that determine who A worker can work on the HIT cannot do e.g., Only US workers, workers with approval rating > more than 1 90% assignment of a HIT
    31. 31. Anatomy of a HIT HITs with the same title, description, pay rate, etc. are the same HIT type HITs are broken up into Assignments A worker cannot do more than 1 assignment of a HIT
    32. 32. HIT GROUP Assignment 1 “Black” Alice Assignment 2 Which is the better translation for Táy ? “Night”HIT 1 o Black o Night Bob Which is the better translation for NedjHIT 2 ? o Clean o White Assignment 3 • “Black” • Charlie •
    33. 33. HIT GROUP Assignment 1 “White” Which is the better translation for Táy ?HIT 1 o Black Alice o Night Assignment 2 Which is the better translation for Nedj “White”HIT 2 ? o Clean o White • Bob • • Assignment 3 “White” David
    34. 34. Requester Worker Build HIT Search for Test HIT HITs Post HIT Accept HIT Do work Reject or Submit HITApprove HIT
    35. 35. Lifecycle of a HIT Requester builds a HIT  Internal HITs are hosted by Amazon  External HITs are hosted by the requester  HITs can be tested on {requester, worker} Requester posts HIT on  Can post as many HITs as account can cover Workers do HIT and submit work Requester approves/rejects work  Payment is rendered  Amazon charges requesters 10% HIT completes when it expires or all assignments are completed
    36. 36. How Much to Pay? Pay rate can affect quantity of work Pay rate does not have a big impact on quality (MW ‟09) Number of Tasks Completed Accuracy Pay per Task Pay per Task
    37. 37. Completion Time 3, 6-question multiple choice surveys Launched same time of day, day of week $0.01, $0.03, $0.05 Past a threshold, pay rate does not increase speed Start with low pay rate work up
    38. 38. Internal HITs
    39. 39. Internal HITs on AMT Template tool Variables Preference Elicitation Honesty study
    40. 40. AMT Templates• Hosted by Amazon• Set parameters for HIT • Title • Description • Keywords • Reward • Assignments per HIT • Qualifications • Time per assignment • HIT expiration • Auto-approve time• Design an HTML form
    41. 41. Variables in Templates Example: Preference Elicitation ${movie1 ${movie2 } }HIT 1 img1.jpg img2.jpg Which would you prefer toHIT 2 img1.jpg img3.jpg watch?HIT 3 img1.jpg img4.jpg <img${movie1}>HIT 4 img2.jpg img3.jpg <img${movie2}>HIT 5 img2.jpg img4.jpgHIT 6 img3.jpg img4.jpg
    42. 42. Variables in Templates Example: Preference Elicitation HIT 1Which would you prefer to watch? HIT 6 Which would you prefer to watch?
    43. 43. How to build an Internal HIT
    44. 44. Cross Cultural Studies: 2 Methods Self-reported:  Ask workers demographic questions, do experiment Qualifications:  Restrict HITs to worker‟s country of origin using MTurk qualifications Honesty experiment:  Ask workers to roll a die (or go to a website that simulates one), pay $0.25 times the self-reported roll.
    45. 45. One die, $0.25 + $0.25 / pip Average reported roll significantly higher than expected  M = 3.91, p < 0.0005 Players under-reported ones and twos and over-reported fives Replicates F & H
    46. 46. Dishonesty by Gender Men are more likely to over-report sixes Women are more likely to over-report fives
    47. 47. Dishonesty by Country Indians are more likely to over-report sixes Americans are more likely to over-report fives Might be conflated with gender
    48. 48. Dishonesty by Gender & Country
    49. 49. External HITs
    50. 50. External HITs on AMT Flexible survey Random Assignment Synchronous Experiments Security
    51. 51. Random Assignment One HIT, multiple Assignments  Only post once, or delete repeat submissions Preview page neutral for all conditions Once HIT accepted:  If new, record WorkerID, Assignment ID assign to condition  If old, get condition, “push” worker to last seen state of study Wage conditions = pay through bonus Intent to treat:  Keep track of attrition by condition  Example: Noisy sites decrease reading comprehension  BUT find no difference between conditions  Why? Most people in noisy condition dropped out, only people left were deaf!
    52. 52. Javascript on Internal HIT<script type=“javascript”>var condition = Math.floor(Math.random()*2)switch (condition){ case 0: pagetext = “Condition 1”; break; case 1: pagetext = “Condition 2”; break;}document.getElementById(“page”).html() = pagetext;</script><html><div id=“page”></div></html>
    53. 53. Privacy survey External HIT  Random order of answers  Random order of questions  Pop-out questions based on answers Changed wording on question from Annenberg study:Do you want the websites you visit to show you ads that are {tailored, relevant} to your
    54. 54. Results  Replicated original study  Found effect of differences in wording Annenberg MTurk “Relevant” Yes No Maybe
    55. 55. Results BUT  Replicated original  Not representative study sample  Found effect of  Results not replicated in differences in wording subsequent phone Annenberg MTurk “Relevant” survey Yes No Maybe
    56. 56. Financial Incentives& the performance of crowdsManipulated Measured Task Value  Quantity  Amount earned per image  Number of image sets set submitted  $0.01, $0.05, $0.10  Quality  No additional pay for image  Proportion of image sets sets correctly sorted Difficulty  Rank correlation of image  Number of images per set sets with correct order  2, 3, 4
    57. 57. Results Pay rate can affect quantity of work Pay rate does not have a big impact on quality (MW ‟09) Number of Tasks Completed Accuracy Pay per Task Pay per Task
    58. 58. Quality Assurance Majority vote – Snow, O‟Connor, Jurafsky, & Ng (2008) Machine learning with responses – Sheng, Provost, & Ipeirotis (2008) Iterative vs. Parallel tasks – Little, Chilton, Goldman, & Miller (2010) Mutual Information – Ipeirotis, Provost, & Wang (2010) Verifiable answers – Kittur, Chi, Suh (2008) Time to completion Honeypot tasks Monitor discussion on forums. MW ’11: Players followed guidelines about what not to talk about.
    59. 59. How to build an External HIT
    60. 60. Synchronous Experiments Example research questions  Market behavior under new mechanism  Network dynamics (e.g., contagion)  Multi-player games Typical tasks on MTurk don‟t depend on each other  can be split up, done in parallel How does one get many workers to do an experiment at the same time?  Panel  Waiting Room
    61. 61. Social Dilemmas in Networks A social dilemma occurs when the interest of the individual is at odds with the interest of the collective. In social networking sites one‟s contributions are only seen by friends.  E.g. photos in Flickr, status updates in Facebook  More contributions, more engaged group, better for everyone  Why contribute when one can free ride?
    62. 62. 64 Cycle Cliques Paired CliquesSmall RandomWorld Regular
    63. 63. Effect of Seed Nodes• 10-seeds: 13 trials 65 0-seeds: 17 trials• Only human contributions are included in averages• People are conditional cooperators • Fischbacher et al. „01
    64. 64. Building the Panel Do experiments requiring 4-8 fresh players  Waiting time is not too high  Less consequences if there is a bug Ask if they would like to be notified of future studies  85% opt in rate for SW „10  78% opt in rate for MW „11
    65. 65. NotifyWorkers MTurk API call that sends an e-mail to workers Notify them a day early Experiments work well 11am-5pm EST If n subjects are needed, notify 3n  Done experiments with 45 players simultaneously
    66. 66. Waiting Room … Workers need to start a synchronous experiment at the same time Workers show up at slightly different times Have workers wait at a page until enough arrive False  Show how many they are waiting for True  After enough arrive tell the rest experiment is full  Funnel extra players into another instance of the experiment
    67. 67. Attrition In lab experiments subjects rarely walk out On the web:  Browsers/computers crash  Internet connections go down  Bosses walk in Need a timeout and a default action  Discard experiments with < 90% human actions  SW „10 discarded 21 of 94 experiments with 20-24 people  Discard experiment where one player acted < 50% of the time  MW „11 discarded 43 of 232 experiments with 16 people
    68. 68. Security of External HITs Code security  Code is exposed to entire internet, susceptible to attacks  SQL injection attacks: malicious user inputs database code to damage or get access to database  Scrub input for dB commands  Cross-site scripting attacks (XSS): malicious user injects code into HTTP request or HTML form  Scrub input and _GET and _POST variables
    69. 69. Checking results
    70. 70. Security of External HITs Code security  Code is exposed to entire internet, susceptible to attacks  SQL injection attacks: malicious user inputs database code to damage or get access to database  Scrub input for dB commands  Cross-site scripting attacks (XSS): malicious user injects code into HTTP request or HTML form  Scrub input and _GET and _POST variables Protocol Security  HITs vs Assignments  If you want fresh players in different runs (HITs) of a synchronous experiment, need to check workerIds  Made a synchronous experiment with many HITs, one assignment each  One worker accepted most of the HITs, did the quiz, got paid
    71. 71. Use CasesInternal HITs External HITs Pilot survey  Testing market Preference elicitation mechanisms Training data for  Behavioral game theory machine learning experiments algorithms  User-generated content “Polling” for wisdom of  Effects of incentives crowds / general knowledge ANY online study can be done on Turk Can be used as recruitment tool
    72. 72. Thank you!Conducting Behavioral Research on Amazons Mechanical Turk(2011) Behavior Research Methods
    73. 73. Main API Functions CreateHIT (Requirements, Pay rate, Description) – returns HIT Id and HIT Type Id SubmitAssignment (AssignmentId) – notifies Amazon that this assignment has been completed ApproveAssignment (AssignmentID) – Requester accepts assignment, money is transferred, also RejectAssignment GrantBonus (WorkerID, Amount, Message) – Give the worker the specified bonus and sends message, should have a failsafe NotifyWorkers (list of WorkerIds, Message) – e-mails message to the workers.
    74. 74. Command-line Tools Configuration files  – for interacting with MTurk API  [task name].input – variable name & values by row  [task name].properties – HIT parameters  [task name].question – XML file Shell scripts  – post HIT to Mechanical Turk (creates .success file)  – download results (using .success file)  – approve or reject assignments  – approve & delete all unreviewed HITs Output files  [task name].success – created HIT ID & Assignment IDs  [task name].results – tab-delimited output from workers
    75. 75. mturk.propertiesaccess_key=ABCDEF0123455676789secret_key=Fa234asOIU/as92345kasSDfq3rDSF#service_url= TurkRequesterservice_url= ster# You should not need to adjust these values.retriable_errors=Server.ServiceUnavailable,503retry_attempts=6retry_delay_millis=500
    76. 76. [task name].propertiestitle: Categorize Web Sitesdescription: Look at URLs, rate, and classify them. These websites have notbeen screened for adult content!keywords: URL, categorize, web sitesreward: 0.01assignments: 10annotation:# this Assignment Duration value is 30 * 60 = 0.5 hoursassignmentduration:1800# this HIT Lifetime value is 60*60*24*3 = 3 dayshitlifetime:259200# this Auto Approval period is 60*60*24*15 = 15 daysautoapprovaldelay:1296000
    77. 77. [task name].question<?xml version="1.0"?><ExternalQuestion xmlns=" 14/ExternalQuestion.xsd"> <ExternalURL></ExternalURL> <FrameHeight>600</FrameHeight></ExternalQuestion>
    78. 78. [task name].results feed Answer.hitid Assignment id Worker id accepted submitted back reject bonus14SBGDGM5ZHZ Sat Oct 02 Sat Oct 02 1BPE1URVWQKM6DSG40FE3OU2 A2IB92P5729K3Q 16:03:49 EDT 16:43:55 EDT 1.39 MWDVKIAJ93B46DJESC2 2010 20100DXKY14SBGDGM5ZHZ Sat Oct 02 Sat Oct 02 1GMFLPGSL0NMWZJSTFFE3OU2 A2LKKOAIMEF1PT 16:10:23 EDT 16:44:33 EDT 1.54 XNJ1FS74J6KW6DJESC2 2010 20100DXKY14SBGDGM5ZHZ Sat Oct 02 Sat Oct 02 1VQ5ID82X6TJXBU4EKXFE3OU2 A15T1WFW5B2OPR 16:13:22 EDT 16:44:56 EDT 1.49 YISVF8C4BWJ6DJESC2 2010 20100DXKY14SBGDGM5ZHZ Sat Oct 02 Sat Oct 02 16XXR2KPFCB31UOCMBFE3OU2 A16ME0W2U4THE0 16:00:21 EDT 16:45:08 EDT 1.67 G78KLMAD4HND6DJESC2 2010 20100DXKY