Crowdsourcing for Search Evaluation and Social-Algorithmic Search


Published on

Tutorial with Omar Alonso given at ACM SIGIR 2012 in Portland, OR (August 12, 2012)

Published in: Technology

Crowdsourcing for Search Evaluation and Social-Algorithmic Search

  1. 1. Crowdsourcing for Search Evaluation and Social-Algorithmic Search Matthew Lease University of Texas at Austin Omar Alonso Microsoft August 12, 2012August 12, 2012 1
  2. 2. Topics• Crowd-powered data collection & applications – Evaluation: relevance judging, interactive studies, log data – Training: e.g., active learning (e.g. learning to rank) – Search: answering, verification, collaborations, physical• Crowdsourcing & human computation• Crowdsourcing platforms• Incentive Engineering & Demographics• Designing for Crowds & Quality assurance• Future Challenges• Broader Issues and the Dark SideAugust 12, 2012 2
  3. 3. What is Crowdsourcing?• Let’s start with an example and work back toward a more general definition• Example: Amazon’s Mechanical Turk (MTurk)• Goal – See a concrete example of real crowdsourcing – Ground later discussion of abstract concepts – Provide a specific example with which we will contrast other forms of crowdsourcingAugust 12, 2012 3
  4. 4. Human Intelligence Tasks (HITs)August 12, 2012 4
  5. 5. August 12, 2012 5
  6. 6. Jane saw the man with the binocularsAugust 12, 2012 6
  7. 7. Traditional Data Collection• Setup data collection software / harness• Recruit participants / annotators / assessors• Pay a flat fee for experiment or hourly wage• Characteristics – Slow – Expensive – Difficult and/or Tedious – Sample Bias…August 12, 2012 7
  8. 8. “Hello World” Demo• Let’s create and run a simple MTurk HIT• This is a teaser highlighting concepts – Don’t worry about details; we’ll revisit them• Goal – See a concrete example of real crowdsourcing – Ground our later discussion of abstract concepts – Provide a specific example with which we will contrast other forms of crowdsourcingAugust 12, 2012 8
  9. 9. DEMOAugust 12, 2012 9
  10. 10. Flip a coin• Please flip a coin and report the results• Two questions 1. Coin type? 2. Head or tails• Results Row Labels Count Row Labels Counts Dollar 56 Euro 11 head 57 Other 30 tail 43 (blank) 3 Grand Total 100 Grand Total 100August 12, 2012 10
  11. 11. NOW WHAT CAN I DO WITH IT?August 12, 2012 11
  12. 12. PHASE 1: COLLECTING & LABELING DATAAugust 12, 2012 12
  13. 13. Data is King!• Massive free Web data changed how we train learning systems – Banko and Brill (2001). Human Language Tech. – Halevy et al. (2009). IEEE Intelligent Systems. • Crowds provide new access to cheap & labeled Big Data. But quality also matters! August 12, 2012 13
  14. 14. NLP: Snow et al. (EMNLP 2008)• MTurk annotation for 5 Tasks – Affect recognition – Word similarity – Recognizing textual entailment – Event temporal ordering – Word sense disambiguation• 22K labels for US $26• High agreement between consensus labels and gold-standard labelsAugust 12, 2012 14
  15. 15. Computer Vision: Sorokin & Forsythe (CVPR 2008)• 4K labels for US $60August 12, 2012 15
  16. 16. IR: Alonso et al. (SIGIR Forum 2008)• MTurk for Information Retrieval (IR) – Judge relevance of search engine results• Many follow-on studies (design, quality, cost)August 12, 2012 16
  17. 17. User Studies: Kittur, Chi, & Suh (CHI 2008)• “…make creating believable invalid responses as effortful as completing the task in good faith.” August 12, 2012 17
  18. 18. Social & Behavioral Sciences• A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online.• Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010)• Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management• Amazons Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists August 12, 2012 18
  19. 19. August 12, 2012 19
  20. 20. Remote Usability Testing• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012• Compares remote usability testing using MTurk and CrowdFlower (not uTest) vs. traditional on-site testing• Advantages – More (Diverse) Participants – High Speed – Low Cost• Disadvantages – Lower Quality Feedback – Less Interaction – Greater need for quality control – Less Focused User GroupsAugust 12, 2012 20
  21. 21. August 12, 2012 21
  22. 22. NLP Example – Dialect IdentificationAugust 12, 2012 22
  23. 23. NLP Example – Machine Translation• Manual evaluation on translation quality is slow and expensive• High agreement between non-experts and experts• $0.10 to translate a sentence C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.August 12, 2012 23
  24. 24. Computer Vision – Painting Similarity Kovashka & Lease, CrowdConf’10August 12, 2012 24
  25. 25. IR Example – Relevance and adsAugust 12, 2012 25
  26. 26. IR Example – Product SearchAugust 12, 2012 26
  27. 27. IR Example – Snippet Evaluation• Study on summary lengths• Determine preferred result length• Asked workers to categorize web queries• Asked workers to evaluate snippet quality• Payment between $0.01 and $0.05 per HIT M. Kaisser, M. Hearst, and L. Lowe. “Improving Search Results Quality by Customizing Summary Lengths”, ACL/HLT, 2008.August 12, 2012 27
  28. 28. IR Example – Relevance Assessment• Replace TREC-like relevance assessors with MTurk?• Selected topic “space program” (011)• Modified original 4-page instructions from TREC• Workers more accurate than original assessors!• 40% provided justification for each answer O. Alonso and S. Mizzaro. “Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment”, SIGIR Workshop on the Future of IR Evaluation, 2009. August 12, 2012 28
  29. 29. IR Example – Timeline Annotation• Workers annotate timeline on politics, sports, culture• Given a timex (1970s, 1982, etc.) suggest something• Given an event (Vietnam, World cup, etc.) suggest a timex K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”. ECIR 2010 August 12, 2012 29
  31. 31. Why Eytan Adar hates MTurk Research (CHI 2011 CHC Workshop)• Overly-narrow focus on MTurk – Identify general vs. platform-specific problems – Academic vs. Industrial problems• Inattention to prior work in other disciplines• Turks aren’t Martians – Just human behavior (more later…) August 12, 2012 31
  32. 32. ESP Game (Games With a Purpose)L. Von Ahn and L. Dabbish (2004)August 12, 2012 32
  33. 33. reCaptchaL. von Ahn et al. (2008). In Science.August 12, 2012 33
  34. 34. Human Sensing and Monitoring• Sullivan et al. (2009). Bio. Conservation (142):10• Keynote by Steve Kelling at ASIS&T 2011 August 12, 2012 34
  35. 35. • Learning to map from web pages to queries • Human computation game to elicit data • Home grown system (no AMT) • Try it! pagehunt.msrlivelabs.comSee also:• H. Ma. et al. “Improving Search Engines Using Human Computation Games”, CIKM 2009.• Law et al. SearchWar. HCOMP 2009.• Bennett et al. Picture This. HCOMP 2009. August 12, 2012 35
  36. 36. Tracking Sentiment in Online MediaBrew et al., PAIS 2010• Volunteer-crowd• Judge in exchange for access to rich content• Balance system needs with user interest• Daily updates to non- stationary distributionAugust 12, 2012 36
  38. 38. Human Computation• What was old is new• Crowdsourcing: A New Branch of Computer Science – D.A. Grier, March 29, 2011• Tabulating the heavens: computing the Nautical Almanac in 18th-century England - M. Croarken’03 Princeton University Press, 2005 August 12, 2012 38
  39. 39. The Mechanical TurkConstructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804) J. Pontin. Artificial Intelligence, With Help From the Humans. New York Times (March 25, 2007)August 12, 2012 39
  40. 40. The Human Processing Unit (HPU)• Davis et al. (2010) HPUAugust 12, 2012 40
  41. 41. Human Computation• Having people do stuff instead of computers• Investigates use of people to execute certain computations for which capabilities of current automated methods are more limited• Explores the metaphor of computation for characterizing attributes, capabilities, and limitations of human performance in executing desired tasks• Computation is required, crowd is not• von Ahn’s Thesis (2005), Law & von Ahn (2011)August 12, 2012 41
  43. 43. Crowd-Assisted Search: “Amazon Remembers” August 12, 2012 43
  44. 44. Crowd-Assisted Search (2)• Yan et al., MobiSys’10• CrowdTerrier (McCreadie et al., SIGIR’12) August 12, 2012 44/11
  45. 45. Translation by monolingual speakers• C. Hu, CHI 2009August 12, 2012 45
  46. 46. Soylent: A Word Processor with a Crowd Inside • Bernstein et al., UIST 2010 August 12, 2012 46
  47. 47. fold.itS. Cooper et al. (2010)Alice G. Walton. Online Gamers Help Solve Mystery ofCritical AIDS Virus Enzyme. The Atlantic, October 8, 2011.August 12, 2012 47
  48. 48. PlateMate (Noronha et al., UIST’10)August 12, 2012 48/11
  49. 49. Image Analysis and more: EateryAugust 12, 2012 49
  50. 50. VizWiz aaaaaaaaBingham et al. (UIST 2010)August 12, 2012 50/11
  51. 51. August 12, 2012 51/11
  52. 52. Crowd Sensing: WazeAugust 12, 2012 52
  53. 53. THE SOCIAL SIDE OF SEARCHAugust 12, 2012 53
  54. 54. People are more than HPUs• Why is Facebook popular? People are social.• Information needs are contextually grounded in our social experiences and social networks• The value of social search may be more than the relevance of the search results• Our social networks also embody additional knowledge about us, our needs, and the worldThe social dimension complements computationAugust 12, 2012 54
  55. 55. Community Q&AAugust 12, 2012 55/53
  56. 56. August 12, 2012 56
  57. 57. Complex Information Needs Who is Rahm Emanuel, Obamas Chief of Staff? How have dramatic shifts in terrorists resulted in anequally dramatic shift in terrorist organizations? How do I find what events were in the news on my sonsbirthday? Do you think the current drop in the Stock Market isrelated to Obamas election to President? Why are prisoners on death row given final medicals? Should George Bush attack Irans nuclear facilitybefore he leaves office? Why are people against gay marriage? Does anyone know anything interesting that happenednation wide in 2008? Should the fact that a prisoner has cancer have anybearing on an appeal for bail? August 12, 2012 Source: Yahoo! Answers, “News & Events”, Nov. 6 2008 57
  58. 58. Community Q&A• Ask the village vs. searching the archive• Posting and waiting can be slow – Find similar questions already answered• Best answer (winner-take-all) vs. voting• Challenges – Questions shorter than documents – Questions not queries, colloquial, errors – Latency & quality (e.g. question routing)• Cf. work by Bruce Croft & studentsAugust 12, 2012 58
  59. 59. Horowitz & Kamvar, WWW’10• Routing: Trust vs. Authority• Social networks vs. search engines – See also: Morris & Teevan, HCIC’12 August 12, 2012 59
  60. 60. Social Network integration• Facebook Questions (with Bing)• Google+ (acquired Aardvark)• Twitter (cf. Paul, Hong, and Chi, ICWSM’11)August 12, 2012 60
  61. 61. Search BuddiesHecht et al. ICWSM 2012; Morris MSR TalkAugust 12, 2012 61
  62. 62. {where to go on vacation} • Tons of results • MTurk: 50 answers, $1.80 • Read title + snippet + URL • Quora: 2 answers • Explore a few pages in • Y! Answers: 2 detail answersAugust 12, 2012 • FB: 1 answer 62
  63. 63. {where to go on vacation} Countries CitiesAugust 12, 2012 63
  64. 64. {where to go on vacation}• Let’s execute the same query in different days Execution #1 Execution #2 Execution #3 Las Vegas 3 Kerala 6 Las Vegas 4 Hawaii 2 Goa 4 Himachal pradesh 3 Kerala 2 Ooty 3 Mauritius 2 Key West 2 Switzerland 3 Ooty 2 Orlando 2 Agra 2 kodaikanal 2 New Zealand 2• Table show places with frequency >= 2• Every execution uses same template & 50 workers• Completion time more or less the same• Results may differ• Related work: Zhang et al., CHI 2012August 12, 2012 64
  65. 65. SO WHAT IS CROWDSOURCING?August 12, 2012 65
  66. 66. August 12, 2012 66
  67. 67. From Outsourcing to Crowdsourcing• Take a job traditionally performed by a known agent (often an employee)• Outsource it to an undefined, generally large group of people via an open call• New application of principles from open source movement• Evolving & broadly defined ... August 12, 2012 67
  68. 68. Crowdsourcing models• Micro-tasks & citizen science• Co-Creation• Open Innovation, Contests• Prediction Markets• Crowd Funding and Charity• “Gamification” (not serious gaming)• Transparent• cQ&A, Social Search, and Polling• Physical Interface/TaskAugust 12, 2012 68
  69. 69. What is Crowdsourcing?• A set of mechanisms and methods for scaling & directing crowd activities to achieve some goal(s)• Enabled by internet-connectivity• Many related topics/areas: – Human computation (next slide…) – Collective intelligence – Crowd/Social computing – Wisdom of Crowds – People services, Human Clouds, Peer-production, …August 12, 2012 69
  70. 70. What is not crowdsourcing?• Post-hoc use of pre-existing crowd data – Data mining – Visual analytics• Use of one or few people – Mixed-initiative design – Active learning• Conducting a survey or poll… (*)August 12, 2012 70
  71. 71. Crowdsourcing Key Questions• What are the goals? – Purposeful directing of human activity• How can you incentivize participation? – Incentive engineering – Who are the target participants?• Which model(s) are most appropriate? – How to adapt them to your context and goals?August 12, 2012 71
  72. 72. What do you want to accomplish?• Create• Execute task/computation• Fund• Innovate and/or discover• Learn• Monitor• PredictAugust 12, 2012 72
  73. 73. INCENTIVE ENGINEERINGAugust 12, 2012 73
  74. 74. Who arethe workers?• A. Baio, November 2008. The Faces of Mechanical Turk.• P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk• J. Ross, et al. Who are the Crowdworkers?... CHI 2010. August 12, 2012 74
  75. 75. MTurk Demographics• 2008-2009 studies found less global and diverse than previously thought – US – Female – Educated – Bored – Money is secondaryAugust 12, 2012 75
  76. 76. 2010 shows increasing diversity47% US, 34% India, 19% other (P. Ipeitorotis. March 2010) August 12, 2012 76
  77. 77. Why should your crowd participate?• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige (leaderboards, badges)• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceMultiple incentives can often operate in parallel (*caveat)August 12, 2012 77
  78. 78. Example: Wikipedia• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 78
  79. 79. Example: DuoLingo• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 79
  80. 80. Example:• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 80
  81. 81. Example: ESP• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 81
  82. 82. Example:• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 82
  83. 83. Example: FreeRice• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 83
  84. 84. Example: cQ&A• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 84
  85. 85. Example: reCaptcha• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism) Is there an existing human activity you can harness• Learn something new for another purpose?• Obtain something else• Create self-serving resourceAugust 12, 2012 85
  86. 86. Example: Mechanical Turk• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceAugust 12, 2012 86
  87. 87. How Much to Pay?• Price commensurate with task effort – Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback• Ethics & market-factors: W. Mason and S. Suri, 2010. – e.g. non-profit SamaSource contracts workers refugee camps – Predict right price given market & task: Wang et al. CSDM’11• Uptake & time-to-completion vs. Cost & Quality – Too little $$, no interest or slow – too much $$, attract spammers – Real problem is lack of reliable QA substrate• Accuracy & quantity – More pay = more work, not better (W. Mason and D. Watts, 2009)• Heuristics: start small, watch uptake and bargaining feedback• Worker retention (“anchoring”)See also: L.B. Chilton et al. KDD-HCOMP 2010. August 12, 2012 87
  88. 88. Dan Pink – YouTube video“The Surprising Truth about what Motivates us”August 12, 2012 88
  89. 89. PLATFORMSAugust 12, 2012 89
  90. 90. Mechanical What?August 12, 2012 90
  91. 91. Does anyone really use it? Yes! (P. Ipeirotis’10)From 1/09 – 4/10, 7M HITs from 10K requestorsworth $500,000 USD (significant under-estimate) August 12, 2012 91
  92. 92. MTurk: The Requester• Sign up with your Amazon account• Amazon payments• Purchase prepaid HITs• There is no minimum or up-front fee• MTurk collects a 10% commission• The minimum commission charge is $0.005 per HITAugust 12, 2012 92
  93. 93. MTurk Dashboard• Three tabs – Design – Publish – Manage• Design – HIT Template• Publish – Make work available• Manage – Monitor progressAugust 12, 2012 93
  94. 94. August 12, 2012 94
  95. 95. MTurk: Dashboard - IIAugust 12, 2012 95
  96. 96. MTurk API• Amazon Web Services API• Rich set of services• Command line tools• More flexibility than dashboardAugust 12, 2012 96
  97. 97. MTurk Dashboard vs. API• Dashboard – Easy to prototype – Setup and launch an experiment in a few minutes• API – Ability to integrate AMT as part of a system – Ideal if you want to run experiments regularly – Schedule tasksAugust 12, 2012 97
  98. 98. • Multiple Channels• Gold-based tests• Only pay for “trusted” judgments August 12, 2012 98
  99. 99. CloudFactory• Information below from Mark Sears (Oct. 18, 2011)• Cloud Labor API – Tools to design virtual assembly lines – workflows with multiple tasks chained together• Focus on self serve tools for people to easily design crowd-powered assembly lines that can be easily integrated into software applications• Interfaces: command-line, RESTful API, and Web• Each “task station” can have either a human or robot worker assigned – web software services (AlchemyAPI, SendGrid, Google APIs, Twilio, etc.) or local software can be combined with human computation• Many built-in "best practices" – “Tournament Stations” where multiple results are compared by a other cloud workers until confidence of best answer is reached – “Improver Stations” have workers improve and correct work by other workers – Badges are earned by cloud workers passing tests created by requesters – Training and tools to create skill tests will be flexible – Algorithms to detect and kick out spammers/cheaters/lazy/bad workersAugust 12, 2012 99
  100. 100. More Crowd Labor Platforms• Clickworker• CloudCrowd• CrowdSource• DoMyStuff• Humanoid (by Matt Swason et al.)• Microtask• MobileWorks (by Anand Kulkarni )• myGengo• SmartSheet• vWorker• Industry heavy-weights – Elance – Liveops – oDesk – uTest• and more… August 12, 2012 100
  101. 101. Platform alternatives• Why MTurk – Amazon brand, lots of research papers – Speed, price, diversity, payments• Why not – Crowdsourcing != Mturk – Spam, no analytics, must build tools for worker & task quality• Microsoft Universal Human Relevance System (UHRS)• How to build your own crowdsourcing platform – Back-end – Template language for creating experiments – Scheduler – Payments?August 12, 2012 101
  102. 102. Why Micro-Tasks?• Easy, cheap and fast• Ready-to use infrastructure, e.g. – MTurk payments, workforce, interface widgets – CrowdFlower quality control mechanisms, etc. – Many others …• Allows early, iterative, frequent trials – Iteratively prototype and test new ideas – Try new tasks, test when you want & as you go• Many successful examples of use reportedAugust 12, 2012 102
  103. 103. Micro-Task Issues• Process – Task design, instructions, setup, iteration• Choose crowdsourcing platform (or roll your own)• Human factors – Payment / incentives, interface and interaction design, communication, reputation, recruitment, retention• Quality Control / Data Quality – Trust, reliability, spam detection, consensus labelingAugust 12, 2012 103
  104. 104. WORKFLOW DESIGNAugust 12, 2012 104
  105. 105. PlateMate - ArchitectureAugust 12, 2012 105
  106. 106. Kulkarni et al.,CSCW 2012Turkomatic August 12, 2012 106
  107. 107. CrowdForge: Workers perform a task or further decompose them Kittur et al., CHI 2011August 12, 2012 107
  108. 108. Kittur et al., CrowdWeaver, CSCW 2012August 12, 2012 108
  109. 109. DESIGNING FOR CROWDSAugust 12, 2012 109
  110. 110. August 12, 2012 110
  111. 111. Typical Workflow• Define and design what to test• Sample data• Design the experiment• Run experiment• Collect data and analyze results• Quality controlAugust 12, 2012 111
  112. 112. Development Framework• Incremental approach• Measure, evaluate, and adjust as you go• Suitable for repeatable tasksAugust 12, 2012 112
  113. 113. Survey Design• One of the most important parts• Part art, part science• Instructions are key• Prepare to iterateAugust 12, 2012 113
  114. 114. Questionnaire Design• Ask the right questions• Workers may not be IR experts so don’t assume the same understanding in terms of terminology• Show examples• Hire a technical writer – Engineer writes the specification – Writer communicatesAugust 12, 2012 114
  115. 115. UX Design• Time to apply all those usability concepts• Generic tips – Experiment should be self-contained. – Keep it short and simple. Brief and concise. – Be very clear with the relevance task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box.August 12, 2012 115
  116. 116. UX Design - II• Presentation• Document design• Highlight important concepts• Colors and fonts• Need to grab attention• LocalizationAugust 12, 2012 116
  117. 117. Examples - I• Asking too much, task not clear, “do NOT/reject”• Worker has to do a lot of stuffAugust 12, 2012 117
  118. 118. Example - II• Lot of work for a few cents• Go here, go there, copy, enter, count …August 12, 2012 118
  119. 119. A Better Example• All information is available – What to do – Search result – Question to answerAugust 12, 2012 119
  120. 120. August 12, 2012 120
  121. 121. Form and Metadata• Form with a close question (binary relevance) and open-ended question (user feedback)• Clear title, useful keywords• Workers need to find your taskAugust 12, 2012 121
  122. 122. Relevance Judging – Example IAugust 12, 2012 122
  123. 123. Relevance Judging – Example IIAugust 12, 2012 123
  124. 124. Implementation• Similar to a UX• Build a mock up and test it with your team – Yes, you need to judge some tasks• Incorporate feedback and run a test on MTurk with a very small data set – Time the experiment – Do people understand the task?• Analyze results – Look for spammers – Check completion times• Iterate and modify accordinglyAugust 12, 2012 124
  125. 125. Implementation – II• Introduce quality control – Qualification test – Gold answers (honey pots)• Adjust passing grade and worker approval rate• Run experiment with new settings & same data• Scale on data• Scale on workersAugust 12, 2012 125
  126. 126. Experiment in Production• Lots of tasks on MTurk at any moment• Need to grab attention• Importance of experiment metadata• When to schedule – Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n+1August 12, 2012 126
  127. 127. Other design principles• Text alignment• Legibility• Reading level: complexity of words and sentences• Attractiveness (worker’s attention & enjoyment)• Multi-cultural / multi-lingual• Who is the audience (e.g. target worker community) – Special needs communities (e.g. simple color blindness)• Parsimony• Cognitive load: mental rigor needed to perform task• Exposure effectAugust 12, 2012 127
  128. 128. The human side• As a worker – I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement• As a requester – Attrition – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers; system should do some of that for me• Managing crowds and tasks is a daily activity – more difficult than managing computers August 12, 2012 128
  129. 129. Things that work• Qualification tests• Honey-pots• Good content and good presentation• Economy of attention• Things to improve – Manage workers in different levels of expertise including spammers and potential cases. – Mix different pools of workers based on different profile and expertise levels.August 12, 2012 129
  130. 130. Things that need work• UX and guidelines – Help the worker – Cost of interaction• Scheduling and refresh rate• Exposure effect• Sometimes we just don’t agree• How crowdsourcable is your taskAugust 12, 2012 130
  131. 131. RELEVANCE JUDGING & CROWDSOURCINGAugust 12, 2012 131
  132. 132. August 12, 2012 132
  133. 133. Motivating Example: Relevance Judging• Relevance of search results is difficult to judge – Highly subjective – Expensive to measure• Professional editors commonly used• Potential benefits of crowdsourcing – Scalability (time and cost) – Diversity of judgmentsAugust 12, 2012 133
  134. 134. August 12, 2012 134
  135. 135. Started with a joke …August 12, 2012 135
  136. 136. Results for {idiot} at WSDM 2011February 2011: 5/7 (R), 2/7 (NR) – Most of the time those TV reality stars have absolutely no talent. They do whatever they can to make a quick dollar. Most of the time the reality tv stars don not have a mind of their own. R – Most are just celebrity wannabees. Many have little or no talent, they just want fame. R – I can see this one going both ways. A particular sort of reality star comes to mind, though, one who was voted off Survivor because he chose not to use his immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R – Just because someone else thinks they are an "idiot", doesnt mean that is what the word means. I dont like to think that any one persons photo would be used to describe a certain term. NR – While some reality-television stars are genuinely stupid (or cultivate an image of stupidity), that does not mean they can or should be classified as "idiots." Some simply act that way to increase their TV exposure and potential earnings. Other reality-television stars are really intelligent people, and may be considered as idiots by people who dont like them or agree with them. It is too subjective an issue to be a good result for a search engine. NR – Have you seen the knuckledraggers on reality television? They should be required to change their names to idiot after appearing on the show. You could put numbers after the word idiot so we can tell them apart. R – Although I have not followed too many of these shows, those that I have encountered have for a great part a very common property. That property is that most of the participants involved exhibit a shallow self-serving personality that borders on social pathological behavior. To perform or act in such an abysmal way could only be an act of an idiot. R August 12, 2012 136
  137. 137. Two Simple Examples of MTurk1. Ask workers to classify a query2. Ask workers to judge document relevanceSteps• Define high-level task• Design & implement interface & backend• Launch, monitor progress, and assess work• Iterate designAugust 12, 2012 137
  138. 138. Query Classification Task• Ask the user to classify a query• Show a form that contains a few categories• Upload a few queries (~20)• Use 3 workersAugust 12, 2012 138
  139. 139. DEMOAugust 12, 2012 139
  140. 140. August 12, 2012 140
  141. 141. Relevance Judging Task• Use a few documents from a standard collection used for evaluating search engines• Ask user to make binary judgments• Modification: graded judging• Use 5 workersAugust 12, 2012 141
  142. 142. DEMOAugust 12, 2012 142
  143. 143. Content quality• People like to work on things that they like• TREC ad-hoc vs. INEX – TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS)• Topics – INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc.• Content and judgments according to modern times – Airport security docs are pre 9/11 – Antarctic exploration (global warming )August 12, 2012 143
  144. 144. Content quality - II• Document length• Randomize content• Avoid worker fatigue – Judging 100 documents on the same subject can be tiring, leading to decreasing qualityAugust 12, 2012 144
  145. 145. Presentation• People scan documents for relevance cues• Document design• Highlighting no more than 10%August 12, 2012 145
  146. 146. Presentation - IIAugust 12, 2012 146
  147. 147. Relevance justification• Why settle for a label?• Let workers justify answers – cf. Zaidan et al. (2007) “annotator rationales”• INEX – 22% of assignments with comments• Must be optional• Let’s see how people justifyAugust 12, 2012 147
  148. 148. “Relevant” answers [Salad Recipes] Doesnt mention the word salad, but the recipe is one that could be considered a salad, or a salad topping, or a sandwich spread. Egg salad recipe Egg salad recipe is discussed. History of salad cream is discussed. Includes salad recipe It has information about salad recipes. Potato Salad Potato salad recipes are listed. Recipe for a salad dressing. Salad Recipes are discussed. Salad cream is discussed. Salad info and recipe The article contains a salad recipe. The article discusses methods of making potato salad. The recipe is for a dressing for a salad, so the information is somewhat narrow for the topic but is still potentially relevant for a researcher. This article describes a specific salad. Although it does not list a specific recipe, it does contain information relevant to the search topic. gives a recipe for tuna salad relevant for tuna salad recipes relevant to salad recipes this is on-topic for salad recipesAugust 12, 2012 148
  149. 149. “Not relevant” answers[Salad Recipes]About gaming not salad recipes.Article is about Norway.Article is about Region Codes.Article is about forests.Article is about geography.Document is about forest and trees.Has nothing to do with salad or recipes.Not a salad recipeNot about recipesNot about salad recipesThere is no recipe, just a comment on how salads fit into meal formats.There is nothing mentioned about salads.While dressings should be mentioned with salads, this is an article on one specific type of dressing, no recipe for salads.article about a swiss tv showcompletely off-topic for salad recipesnot a salad recipenot about salad recipestotally off baseAugust 12, 2012 149
  150. 150. August 12, 2012 150
  151. 151. Feedback length• Workers will justify answers• Has to be optional for good feedback• In E51, mandatory comments – Length dropped – “Relevant” or “Not Relevant August 12, 2012 151
  152. 152. Was the task difficult?• Ask workers to rate difficulty of a search topic• 50 topics; 5 workers, $0.01 per taskAugust 12, 2012 152
  153. 153. QUALITY ASSURANCEAugust 12, 2012 153
  154. 154. When to assess quality of work• Beforehand (prior to main task activity) – How: “qualification tests” or similar mechanism – Purpose: screening, selection, recruiting, training• During – How: assess labels as worker produces them • Like random checks on a manufacturing line – Purpose: calibrate, reward/penalize, weight• After – How: compute accuracy metrics post-hoc – Purpose: filter, calibrate, weight, retain (HR) – E.g. Jung & Lease (2011), Tang & Lease (2011), ...August 12, 2012 154
  155. 155. How do we measure work quality?• Compare worker’s label vs. – Known (correct, trusted) label – Other workers’ labels • P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or Multiple Workers? Sept. 2010. – Model predictions of the above • Model the labels (Ryu & Lease, ASIS&T11) • Model the workers (Chen et al., AAAI’10)• Verify worker’s label – Yourself – Tiered approach (e.g. Find-Fix-Verify) • Quinn and B. Bederson’09, Bernstein et al.’10August 12, 2012 155
  156. 156. Typical Assumptions• Objective truth exists – no minority voice / rare insights – Can relax this to model “truth distribution”• Automatic answer comparison/evaluation – What about free text responses? Hope from NLP… • Automatic essay scoring • Translation (BLEU: Papineni, ACL’2002) • Summarization (Rouge: C.Y. Lin, WAS’2004) – Have people do it (yourself or find-verify crowd, etc.)August 12, 2012 156
  157. 157. Distinguishing Bias vs. Noise• Ipeirotis (HComp 2010)• People often have consistent, idiosyncratic skews in their labels (bias) – E.g. I like action movies, so they get higher ratings• Once detected, systematic bias can be calibrated for and corrected (yeah!)• Noise, however, seems random & inconsistent – this is the real issue we want to focus onAugust 12, 2012 157
  158. 158. Comparing to known answers• AKA: gold, honey pot, verifiable answer, trap• Assumes you have known answers• Cost vs. Benefit – Producing known answers (experts?) – % of work spent re-producing them• Finer points – Controls against collusion – What if workers recognize the honey pots?August 12, 2012 158
  159. 159. Comparing to other workers• AKA: consensus, plurality, redundant labeling• Well-known metrics for measuring agreement• Cost vs. Benefit: % of work that is redundant• Finer points – Is consensus “truth” or systematic bias of group? – What if no one really knows what they’re doing? • Low-agreement across workers indicates problem is with the task (or a specific example), not the workers – Risk of collusion• Sheng et al. (KDD 2008)August 12, 2012 159
  160. 160. Comparing to predicted label• Ryu & Lease, ASIS&T11 (CrowdConf’11 poster)• Catch-22 extremes – If model is really bad, why bother comparing? – If model is really good, why collect human labels?• Exploit model confidence – Trust predictions proportional to confidence – What if model very confident and wrong?• Active learning – Time sensitive: Accuracy / confidence changesAugust 12, 2012 160
  161. 161. Compare to predicted worker labels• Chen et al., AAAI’10• Avoid inefficiency of redundant labeling – See also: Dekel & Shamir (COLT’2009)• Train a classifier for each worker• For each example labeled by a worker – Compare to predicted labels for all other workers• Issues • Sparsity: workers have to stick around to train model… • Time-sensitivity: New workers & incremental updates?August 12, 2012 161
  162. 162. Methods for measuring agreement• What to look for – Agreement, reliability, validity• Inter-agreement level – Agreement between judges – Agreement between judges and the gold set• Some statistics – Percentage agreement – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha• With majority vote, what if 2 say relevant, 3 say not? – Use expert to break ties (Kochhar et al, HCOMP’10; GQR) – Collect more judgments as needed to reduce uncertaintyAugust 12, 2012 162
  163. 163. Inter-rater reliability• Lots of research• Statistics books cover most of the material• Three categories based on the goals – Consensus estimates – Consistency estimates – Measurement estimatesAugust 12, 2012 163
  164. 164. Sample code – R packages psy and irr >library(psy) >library(irr) >my_data <- read.delim(file="test.txt", head=TRUE, sep="t") >kappam.fleiss(my_data,exact=FALSE) >my_data2 <- read.delim(file="test2.txt", head=TRUE, sep="t") >ckappa(my_data2)August 12, 2012 164
  165. 165. k coefficient• Different interpretations of k• For practical purposes you need to be >= moderate• Results may vary k Interpretation <0 Poor agreement 0.01 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreementAugust 12, 2012 165
  166. 166. Detection Theory• Sensitivity measures – High sensitivity: good ability to discriminate – Low sensitivity: poor ability Stimulus “Yes” “No” Class S1 Hits Misses S2 False alarms Correct rejections Hit rate H = P(“yes”|S2) False alarm rate F = P(“yes”|S1)August 12, 2012 166
  167. 167. August 12, 2012 167
  168. 168. Finding Consensus• When multiple workers disagree on the correct label, how do we resolve this? – Simple majority vote (or average and round) – Weighted majority vote (e.g. naive bayes)• Many papers from machine learning…• If wide disagreement, likely there is a bigger problem which consensus doesn’t addressAugust 12, 2012 168
  169. 169. Quality Control on MTurk• Rejecting work & Blocking workers (more later…) – Requestors don’t want bad PR or complaint emails – Common practice: always pay, block as needed• Approval rate: easy to use, but value? – P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5 and 5 Minutes. Oct. 2010 – Many requestors don’t ever reject…• Qualification test – Pre-screen workers’ capabilities & effectiveness – Example and pros/cons in next slides…• Geographic restrictions• Mechanical Turk Masters (June 23, 2011) – Recent addition, degree of benefit TBD… August 12, 2012 169
  170. 170. August 12, 2012 170
  171. 171. Quality Control in General• Extremely important part of the experiment• Approach as “overall” quality; not just for workers• Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. August 12, 2012 171
  172. 172. Tools and Packages for MTurk• QA infrastructure layers atop MTurk promote useful separation-of-concerns from task – TurkIt • Quik Turkit provides nearly realtime services – Turkit-online (??) – Get Another Label (& qmturk) – Turk Surveyor – cv-web-annotation-toolkit (image labeling) – Soylent – Boto (python library) • Turkpipe: submit batches of jobs using the command line.• More needed…August 12, 2012 172
  173. 173. A qualification test snippet<Question> <QuestionIdentifier>question1</QuestionIdentifier> <QuestionContent> <Text>Carbon monoxide poisoning is</Text> </QuestionContent> <AnswerSpecification> <SelectionAnswer> <StyleSuggestion>radiobutton</StyleSuggestion> <Selections> <Selection> <SelectionIdentifier>1</SelectionIdentifier> <Text>A chemical technique</Text> </Selection> <Selection> <SelectionIdentifier>2</SelectionIdentifier> <Text>A green energy treatment</Text> </Selection> <Selection> <SelectionIdentifier>3</SelectionIdentifier> <Text>A phenomena associated with sports</Text> </Selection> <Selection> <SelectionIdentifier>4</SelectionIdentifier> <Text>None of the above</Text> </Selection> </Selections> </SelectionAnswer> </AnswerSpecification> August 12, 2012</Question> 173
  174. 174. Qualification tests: pros and cons• Advantages – Great tool for controlling quality – Adjust passing grade• Disadvantages – Extra cost to design and implement the test – May turn off workers, hurt completion time – Refresh the test on a regular basis – Hard to verify subjective tasks like judging relevance• Try creating task-related questions to get worker familiar with task before starting task in earnestAugust 12, 2012 174
  175. 175. More on quality control & assurance• HR issues: recruiting, selection, & retention – e.g., post/tweet, design a better qualification test, bonuses, …• Collect more redundant judgments… – at some point defeats cost savings of crowdsourcing – 5 workers is often sufficientAugust 12, 2012 175
  176. 176. Robots and Captchas• Some reports of robots on MTurk – E.g. McCreadie et al. (2011) – violation of terms of service – Artificial artificial artificial intelligence• Captchas seem ideal, but… – There is abuse of robots using turkers to solve captchas so they can access web resources – Turker wisdom is therefore to avoid such HITs• What to do? – Use standard captchas, notify workers – Block robots other ways (e.g. external HITs) – Catch robots through standard QC, response times – Use HIT-specific captchas (Kazai et al., 2011)August 12, 2012 176
  177. 177. Other quality heuristics• Justification/feedback as quasi-captcha – Successfully proven in past experiments – Should be optional – Automatically verifying feedback was written by a person may be difficult (classic spam detection task)• Broken URL/incorrect object – Leave an outlier in the data set – Workers will tell you – If somebody answers “excellent” on a graded relevance test for a broken URL => probably spammerAugust 12, 2012 177
  178. 178. Dealing with bad workers• Pay for “bad” work instead of rejecting it? – Pro: preserve reputation, admit if poor design at fault – Con: promote fraud, undermine approval rating system• Use bonus as incentive – Pay the minimum $0.01 and $0.01 for bonus – Better than rejecting a $0.02 task• If spammer “caught”, block from future tasks – May be easier to always pay, then block as needed August 12, 2012 178
  179. 179. Worker feedback• Real feedback received via email after rejection• Worker XXX I did. If you read these articles most of them have nothing to do with space programs. I’m not an idiot.• Worker XXX As far as I remember there wasnt an explanation about what to do when there is no name in the text. I believe I did write a few comments on that, too. So I think youre being unfair rejecting my HITs.August 12, 2012 179
  180. 180. Real email exchange with worker after rejectionWORKER: this is not fair , you made me work for 10 cents and i lost my 30 minutesof time ,power and lot more and gave me 2 rejections at least you may keep itpending. please show some respect to turkersREQUESTER: Im sorry about the rejection. However, in the directions given in thehit, we have the following instructions: IN ORDER TO GET PAID, you must judge all 5webpages below *AND* complete a minimum of three HITs.Unfortunately, because you only completed two hits, we had to reject those hits.We do this because we need a certain amount of data on which to make decisionsabout judgment quality. Im sorry if this caused any distress. Feel free to contact meif you have any additional questions or concerns.WORKER: I understood the problems. At that time my kid was crying and i went tolook after. thats why i responded like that. I was very much worried about a hitbeing rejected. The real fact is that i havent seen that instructions of 5 web pageand started doing as i do the dolores labs hit, then someone called me and i wentto attend that call. sorry for that and thanks for your kind concern. August 12, 2012 180
  181. 181. Exchange with worker• Worker XXX Thank you. I will post positive feedback for you at Turker Nation.Me: was this a sarcastic comment?• I took a chance by accepting some of your HITs to see if you were a trustworthy author. My experience with you has been favorable so I will put in a good word for you on that website. This will help you get higher quality applicants in the future, which will provide higher quality work, which might be worth more to you, which hopefully means higher HIT amounts in the future.August 12, 2012 181
  182. 182. Build Your Reputation as a Requestor• Word of mouth effect – Workers trust the requester (pay on time, clear explanation if there is a rejection) – Experiments tend to go faster – Announce forthcoming tasks (e.g. tweet)• Disclose your real identity?August 12, 2012 182
  183. 183. Other practical tips• Sign up as worker and do some HITs• “Eat your own dog food”• Monitor discussion forums• Address feedback (e.g., poor guidelines, payments, passing grade, etc.)• Everything counts! – Overall design only as strong as weakest linkAugust 12, 2012 183
  184. 184. Conclusions• But one may say “this is all good but looks like a ton of work”• The original goal: data is king• Data quality and experimental designs are preconditions to make sure we get the right stuff• Data will be later be used for rankers, ML models, evaluations, etc.• Don’t cut cornersAugust 12, 2012 184
  185. 185. THE ROAD AHEADAugust 12, 2012 185
  186. 186. What about sensitive data?• Not all data can be publicly disclosed – User data (e.g. AOL query log, Netflix ratings) – Intellectual property – Legal confidentiality• Need to restrict who is in your crowd – Separate channel (workforce) from technology – Hot question for adoption at enterprise levelAugust 12, 2012 186
  187. 187. Wisdom of Crowds (WoC)Requires• Diversity• Independence• Decentralization• AggregationInput: large, diverse sample (to increase likelihood of overall pool quality)Output: consensus or selection (aggregation)August 12, 2012 187
  188. 188. WoC vs. Ensemble Learning• Combine multiple models to improve performance over any constituent model – Can use many weak learners to make a strong one – Compensate for poor models with extra computation• Works better with diverse, independent learners• cf. NIPS 2010-2011 Workshops – Computational Social Science & the Wisdom of Crowds• More investigation needed of traditional feature- based machine learning & ensemble methods for consensus labeling with crowdsourcing August 12, 2012 188
  189. 189. Active Learning• Minimize number of labels to achieve goal accuracy rate of classifier – Select examples to label to maximize learning• Vijayanarasimhan and Grauman (CVPR 2011) – Simple margin criteria: select maximally uncertain examples to label next – Finding which examples are uncertain can be computationally intensive (workers have to wait) – Use locality-sensitive hashing to find uncertain examples in sub-linear timeAugust 12, 2012 189
  190. 190. Active Learning (2)• V&G report each learning iteration ~ 75 min – 15 minutes for model training & selection – 60 minutes waiting for crowd labels• Leaving workers idle may lose them, slowing uptake and completion times• Keep workers occupied – Mason and Suri (2010): paid waiting room – Laws et al. (EMNLP 2011): parallelize labeling and example selection via producer-consumer model • Workers consume examples, produce labels • Model consumes label, produces examplesAugust 12, 2012 190
  191. 191. Query execution• So you want to combine CPU + HPU in a DB?• Crowd can answer difficult queries• Query processing with human computation• Long term goal – When to switch from CPU to HPU and vice versaAugust 12, 2012 191
  192. 192. MapReduce with human computation• Commonalities – Large task divided into smaller sub-problems – Work distributed among worker nodes (workers) – Collect all answers and combine them – Varying performance of heterogeneous CPUs/HPUs• Variations – Human response latency / size of “cluster” – Some tasks are not suitableAugust 12, 2012 192
  193. 193. A Few Questions• How should we balance automation vs. human computation? Which does what?• Who’s the right person for the job?• How do we handle complex tasks? Can we decompose them into smaller tasks? How?August 12, 2012 193
  194. 194. Research problems – operational• Methodology – Budget, people, document, queries, presentation, incentives, etc. – Scheduling – Quality• What’s the best “mix” of HC for a task?• What are the tasks suitable for HC?• Can I crowdsource my task? – Eickhoff and de Vries, WSDM 2011 CSDM WorkshopAugust 12, 2012 194
  195. 195. More problems• Human factors vs. outcomes• Editors vs. workers• Pricing tasks• Predicting worker quality from observable properties (e.g. task completion time)• HIT / Requestor ranking or recommendation• Expert search : who are the right workers given task nature and constraints• Ensemble methods for Crowd Wisdom consensusAugust 12, 2012 195
  196. 196. Problems: crowds, clouds and algorithms• Infrastructure – Current platforms are very rudimentary – No tools for data analysis• Dealing with uncertainty (propagate rather than mask) – Temporal and labeling uncertainty – Learning algorithms – Search evaluation – Active learning (which example is likely to be labeled correctly)• Combining CPU + HPU – Human Remote Call? – Procedural vs. declarative? – Integration points with enterprise systems August 12, 2012 196
  197. 197. Algorithms• Bandit problems; explore-exploit• Optimizing amount of work by workers – Humans have limited throughput – Harder to scale than machines• Selecting the right crowds• Stopping ruleAugust 12, 2012 197
  199. 199. What about ethics?• Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of these people who we ask to power our computing?” – Power dynamics between parties • What are the consequences for a worker when your actions harm their reputation? – “Abstraction hides detail”• Fort, Adda, and Cohen (2011) – “…opportunities for our community to deliberately value ethics above cost savings.” August 12, 2012 199
  200. 200. Example: SamaSourceAugust 12, 2012 200
  201. 201. Davis et al. (2010) The HPU. HPUAugust 12, 2012 201
  202. 202. HPU: “Abstraction hides detail”• Not just turning a mechanical crankAugust 12, 2012 202
  203. 203. Micro-tasks & Task Decomposition• Small, simple tasks can be completed faster by reducing extraneous context and detail – e.g. “Can you name who is in this photo?”• Current workflow research investigates how to decompose complex tasks into simpler onesAugust 12, 2012 203
  204. 204. Context & Informed Consent• What is the larger task I’m contributing to?• Who will benefit from it and how?August 12, 2012 204
  205. 205. What about the regulation?• Wolfson & Lease (ASIS&T 2011)• As usual, technology is ahead of the law – employment law – patent inventorship – data security and the Federal Trade Commission – copyright ownership – securities regulation of crowdfunding• Take-away: don’t panic, but be mindful – Understand risks of “just in-time compliance” August 12, 2012 205
  206. 206. Digital Dirty Jobs• NY Times: Policing the Web’s Lurid Precincts• Gawker: Facebook content moderation• CultureDigitally: The dirty job of keeping Facebook cleanAugust 12, 2012 206
  207. 207. Jeff Howe Vision vs. Reality?• Vision of empowering worker freedom: – work whenever you want for whomever you want• When $$$ is at stake, populations at risk may be compelled to perform work by others – Digital sweat shops? Digital slaves? – We really don’t know (and need to learn more…) – Traction? Human Trafficking at MSR Summit’12August 12, 2012 207
  209. 209. Putting the shoe on the other foot: SpamAugust 12, 2012 209
  210. 210. What about trust?• Some reports of robot “workers” on MTurk – E.g. McCreadie et al. (2011) – Violates terms of service• Why not just use a captcha?August 12, 2012 210
  211. 211. Captcha FraudAugust 12, 2012 211
  212. 212. Requester Fraud on MTurk“Do not do any HITs that involve: filling inCAPTCHAs; secret shopping; test our web page;test zip code; free trial; click my link; surveys orquizzes (unless the requester is listed with asmiley in the Hall of Fame/Shame); anythingthat involves sending a text message; orbasically anything that asks for any personalinformation at all—even your zip code. If youfeel in your gut it’s not on the level, IT’S NOT.Why? Because they are scams...”August 12, 2012 212
  213. 213. Defeating CAPTCHAs with crowdsAugust 12, 2012 213
  214. 214. Gaming the System: SEO, etc.
  215. 215. WWW’12August 12, 2012 215
  216. 216. Robert Sim, MSR Summit’12August 12, 2012 216
  217. 217. Conclusions• Crowdsourcing works and is here to stay• Fast turnaround, easy to experiment, cheap• Still have to design the experiments carefully!• Usability considerations• Worker quality• User feedback extremely usefulAugust 12, 2012 217
  218. 218. Conclusions - II• Lots of opportunities to improve current platforms• Integration with current systems• While MTurk first to-market in micro-task vertical, many other vendors are emerging with different affordances or value-added features• Many open research problems …August 12, 2012 218
  219. 219. Conclusions – III• Important to know your limitations and be ready to collaborate• Lots of different skills and expertise required – Social/behavioral science – Human factors – Algorithms – Economics – Distributed systems – StatisticsAugust 12, 2012 219
  220. 220. REFERENCES & RESOURCESAugust 12, 2012 220
  221. 221. Surveys• Ipeirotis, Panagiotis G., R. Chandrasekar, and P. Bennett. (2009). “A report on the human computation workshop (HComp).” ACM SIGKDD Explorations Newsletter 11(2).• Alex Quinn and Ben Bederson. Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of CHI 2011.• Law and von Ahn (2011). Human Computation August 12, 2012 221
  222. 222. 2013 Events PlannedResearch events• 1st year of HComp as AAAI conference• 2nd annual Collective Intelligence?Industrial Events• 4th CrowdConf (San Francisco, Fall)• 1st Crowdsourcing Week (Singapore, April)August 12, 2012 222
  223. 223. TREC Crowdsourcing Track• Year 1 (2011) – horizontals – Task 1 (hci): collect crowd relevance judgments – Task 2 (stats): aggregate judgments – Organizers: Kazai & Lease – Sponsors: Amazon, CrowdFlower• Year 2 (2012) – content types – Task 1 (text): judge relevance – Task 2 (images): judge relevance – Organizers: Ipeirotis, Kazai, Lease, & Smucker – Sponsors: Amazon, CrowdFlower, MobileWorksAugust 12, 2012 223
  224. 224. 2012 Workshops & Conferences• AAAI: Human Computation (HComp) (July 22-23)• AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)• ACL: 3rd Workshop of the Peoples Web meets NLP (July 12-13)• AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities (August 9-12)• CHI: CrowdCamp (May 5-6)• CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)• Collective Intelligence (April 18-20)• CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)• CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)• EC: Social Computing and User Generated Content Workshop (June 7)• ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)• ICEC: Harnessing Collective Intelligence with Games (September)• ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)• ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)• KDD: Workshop on Crowdsourcing and Data Mining (August 12)• Multimedia: Crowdsourcing for Multimedia (Nov 2)• SocialCom: Social Media for Human Computation (September 6)• TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)• WWW: CrowdSearch: Crowdsourcing Web search (April 17) August 12, 2012 224
  225. 225. Journal Special Issues 2012 – Springer’s Information Retrieval (articles now online): Crowdsourcing for Information Retrieval – IEEE Internet Computing (articles now online): Crowdsourcing (Sept./Oct. 2012) – Hindawi’s Advances in Multimedia Journal: Multimedia Semantics Analysis via Crowdsourcing GeocontextAugust 12, 2012 225
  226. 226. 2011 Workshops & Conferences• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)• Crowdsourcing Technologies for Language and Cognition Studies (July 27)• CHI-CHC: Crowdsourcing and Human Computation (May 8)• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)• EC: Workshop on Social Computing and User Generated Content (June 5)• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)• Interspeech: Crowdsourcing for speech processing (August)• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)• TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9) August 12, 2012 226
  227. 227. 2011 Tutorials and Keynotes• By Omar Alonso and/or Matthew Lease – CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only) – CrowdConf: Crowdsourcing for Research and Engineering – IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only) – WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9) – SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)• AAAI: Human Computation: Core Research Questions and State of the Art – Edith Law and Luis von Ahn, August 7• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and Conservation – Steve Kelling, October 10, ebird• EC: Conducting Behavioral Research Using Amazons Mechanical Turk – Winter Mason and Siddharth Suri, June 5• HCIC: Quality Crowdsourcing for Human Computer Interaction Research – Ed Chi, June 14-18, about HCIC) – Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk• Multimedia: Frontiers in Multimedia Search – Alan Hanjalic and Martha Larson, Nov 28• VLDB: Crowdsourcing Applications and Platforms – Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)• WWW: Managing Crowdsourced Human Computation – Panos Ipeirotis and Praveen Paritosh August 12, 2012 227
  228. 228. Thank You!Crowdsourcing news & information: further questions, contact us at: ml@ischool.utexas.eduCartoons by Mateo Burtch ( 12, 2012 228
  229. 229. Additional Literature Reviews• Man-Ching Yuen, Irwin King, and Kwong-Sak Leung. A Survey of Crowdsourcing Systems. SocialCom 2011.• A. Doan, R. Ramakrishnan, A. Halevy. Crowdsourcing Systems on the World-Wide Web. Communications of the ACM, 2011.August 12, 2012 229
  230. 230. More Books July 2010, kindle-only: “This book introduces you to the top crowdsourcing sites and outlines step by step with photos the exact process to get started as a requester on Amazon Mechanical Turk.“August 12, 2012 230
  231. 231. ResourcesA Few Blogs Behind Enemy Lines (P.G. Ipeirotis, NYU) Deneme: a Mechanical Turk experiments blog (Gret Little, MIT) CrowdFlower Blog Jeff HoweA Few Sites The Crowdsortium CrowdsourceBase (for workers) Daily CrowdsourceMTurk Forums and Resources Turker Nation: (and its blog) Turkopticon: report/avoid shady requestors Amazon Forum for MTurkAugust 12, 2012 231
  232. 232. Bibliography J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006. Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award. Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics Interface (GI 2010), 39-46. N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004. C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009. P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010. J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human in the Loop (ACVHL), June 2010. M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008. D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579 JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009. J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010. P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010. J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008. P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT Workshop on Active Learning and NLP, 2009. B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009. P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet. P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010. P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010) August 12, 2012 232
  233. 233. Bibliography (2) A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008. Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011 Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010. K. Krippendorff. "Content Analysis", Sage Publications, 2003 G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009. T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence. 2009. W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009. J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994. A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009 J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting Demographics in Amazon Mechanical Turk”. CHI 2010. F. Scheuren. “What is a Survey” ( 2004. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks”. EMNLP-2008. V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers” KDD 2008. S. Weber. “The Success of Open Source”, Harvard University Press, 2004. L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006. L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.August 12, 2012 233
  234. 234. Bibliography (3) Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and Clustering on Teachers. AAAI 2010. Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011. Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk. EMNLP 2011. C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS), 2004. C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011. Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011. Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR), 2011. S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds. CVPR 2011. Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.August 12, 2012 234
  235. 235. Recent Work• Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.• Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 259-268.• Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2):413–420.• Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.• Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference on Artificial Intelligence.• Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and development in information retrieval.• Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).• Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint arXiv:1110.3564.• Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in Crowdsourcing Relevance Judgments.” Springers Information Retrieval Journal: Special Issue on Crowdsourcing. August 12, 2012 235
  236. 236. Recent Work (2)• Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.• Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).• Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).• Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.• Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal of Machine Learning Research 11:1297-1322.• Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual International Conference on Machine Learning (ICML), 889-896.• Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks.” Journal of Machine Learning Research 13:491-518.• Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural information processing systems (NIPS).• Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM symposium adjunct on User interface software and technology (UIST).• Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd Human Computation Workshop (HCOMP) at AAAI.• Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.• Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.• Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).• Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International Conference on Machine Learning (ICML). August 12, 2012 236
  237. 237. Crowdsourcing in IR: 2008-2010 2008  O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2. 2009  O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.  P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.  G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.  G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.  Law et al. “SearchWar”. HCOMP.  H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009. 2010  SIGIR Workshop on Crowdsourcing for Search Evaluation.  O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.  K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.  C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazons Mechanical Turk.  Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.  G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.  G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)  Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.  Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.  Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.  T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010. August 12, 2012 237