Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems


Published on

Tutorial given at SIAM Data Mining Conference (, May 3, 2013. Based on earlier tutorials given jointly with Omar Alonso from Microsoft Bing.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems

  1. 1. Matt LeaseSchool of Information @mattleaseUniversity of Texas at Austin ml@ischool.utexas.eduCrowdsourcing & Human ComputationLabeling Data & Building Hybrid SystemsSlides:
  2. 2. Roadmap• A Quick Example• Crowd-powered data collection & applications• Crowdsourcing, Incentives, & Demographics• Mechanical Turk & Other Platforms• Designing for Crowds & Statistical QA• Open Problems• Broader Considerations & a Darker Side2
  3. 3. What is Crowdsourcing?• Let’s start with a simple example!• Goal– See a concrete example of real crowdsourcing– Ground later discussion of abstract concepts– Provide a specific example with which we willcontrast other forms of crowdsourcing3
  4. 4. Human Intelligence Tasks (HITs)4
  5. 5. 5
  6. 6. 6Jane saw the man with the binoculars
  7. 7. Traditional Data Collection• Setup data collection software / harness• Recruit participants / annotators / assessors• Pay a flat fee for experiment or hourly wage• Characteristics– Slow– Expensive– Difficult and/or Tedious– Sample Bias…7
  8. 8. “Hello World” Demo• Let’s create and run a simple MTurk HIT• This is a teaser highlighting concepts– Don’t worry about details; we’ll revisit them• Goal– See a concrete example of real crowdsourcing– Ground our later discussion of abstract concepts– Provide a specific example with which we willcontrast other forms of crowdsourcing8
  9. 9. DEMO9
  11. 11. NLP: Snow et al. (EMNLP 2008)• MTurk annotation for 5 Tasks– Affect recognition– Word similarity– Recognizing textual entailment– Event temporal ordering– Word sense disambiguation• 22K labels for US $26• High agreement betweenconsensus labels andgold-standard labels11
  12. 12. Computer Vision:Sorokin & Forsythe (CVPR 2008)• 4K labels for US $6012
  13. 13. IR: Alonso et al. (SIGIR Forum 2008)• MTurk for Information Retrieval (IR)– Judge relevance of search engine results• Many follow-on studies (design, quality, cost)13
  14. 14. User Studies: Kittur, Chi, & Suh (CHI 2008)• “…make creating believable invalid responses aseffortful as completing the task in good faith.”14
  15. 15. Remote Usability Testing• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012• Remote usability testing via MTurk & CrowdFlowervs. traditional on-site testing• Advantages– More (Diverse) Participants– High Speed– Low Cost• Disadvantages– Lower Quality Feedback– Less Interaction– Greater need for quality control– Less Focused User Groups15
  16. 16. 16
  17. 17. Human Subjects Research:Surveys, Demographics, etc.• A Guide to Behavioral Experimentson Mechanical Turk– W. Mason and S. Suri (2010). SSRN online.• Crowdsourcing for Human Subjects Research– L. Schmidt (CrowdConf 2010)• Crowdsourcing Content Analysis for Behavioral Research:Insights from Mechanical Turk– Conley & Tosti-Kharas (2010). Academy of Management• Amazons Mechanical Turk : A New Source ofInexpensive, Yet High-Quality, Data?– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.– see also: Amazon Mechanical Turk Guide for Social Scientists 17
  18. 18. • PhD Thesis, December 2005• Law & von Ahn, Book, June 201118LUIS VON AHN, CMU
  19. 19. ESP Game (Games With a Purpose)L. Von Ahn and L. Dabbish (2004)19
  20. 20. reCaptchaL. von Ahn et al. (2008). In Science.20
  21. 21. DuoLingo (Launched Nov. 2011)21
  23. 23. Crowd Sensing• Steve Kelling, et al. A Human/Computer LearningNetwork to Improve Biodiversity Conservationand Research. AI Magazine 34.1 (2012): 10.23
  24. 24. Tracking Sentiment in Online MediaBrew et al., PAIS 2010• Volunteer-crowd• Judge in exchange foraccess to rich content• Balance system needswith user interest• Daily updates to non-stationary distribution24
  26. 26. What is a Computer?26
  27. 27. Princeton University Press, 2005• What was old is new• Crowdsourcing: A New Branchof Computer Science– D.A. Grier, March 29, 2011• Tabulating the heavens:computing the NauticalAlmanac in 18th-centuryEngland - M. Croarken’0327Human Computation
  28. 28. J. Pontin. Artificial Intelligence, With Help Fromthe Humans. New York Times (March 25, 2007)The Mechanical Turk28Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804)
  29. 29. The Human Processing Unit (HPU)• Davis et al. (2010)HPU29
  30. 30. Human Computation• Having people do stuff instead of computers• Investigates use of people to execute certaincomputations for which capabilities of currentautomated methods are more limited• Explores the metaphor of computation forcharacterizing attributes, capabilities, andlimitations of human task performance30
  32. 32. 32Crowd-Assisted Search: “Amazon Remembers”
  33. 33. Translation by monolingual speakers• C. Hu, CHI 200933
  34. 34. Soylent: A Word Processor with a Crowd Inside• Bernstein et al., UIST 201034
  35. 35. fold.itS. Cooper et al. (2010)Alice G. Walton. Online Gamers Help Solve Mystery ofCritical AIDS Virus Enzyme. The Atlantic, October 8, 2011.35
  36. 36. PlateMate (Noronha et al., UIST’10)36
  37. 37. Image Analysis and more: Eatery37
  38. 38. VizWiz aaaaaaaaBingham et al. (UIST 2010)38
  39. 39. 39
  40. 40. Crowd Sensing: Waze40
  42. 42. 42
  43. 43. From Outsourcing to Crowdsourcing• Take a job traditionallyperformed by a known agent(often an employee)• Outsource it to an undefined,generally large group ofpeople via an open call• New application of principlesfrom open source movement• Evolving & broadly defined ...43
  44. 44. Crowdsourcing models• Micro-tasks & citizen science• Co-Creation• Open Innovation, Contests• Prediction Markets• Crowd Funding and Charity• “Gamification” (not serious gaming)• Transparent• cQ&A, Social Search, and Polling• Physical Interface/Task44
  45. 45. What is Crowdsourcing?• Mechanisms and methodology for directingcrowd action to achieve some goal(s)– E.g., novel ways of collecting data from crowds• Powered by internet-connectivity• Related topics:– Human computation– Collective intelligence– Crowd/Social computing– Wisdom of Crowds– People services, Human Clouds, Peer-production, …45
  46. 46. What is not crowdsourcing?• Analyzing existing datasets (no matter source)– Data mining– Visual analytics• Use of few people– Mixed-initiative design– Active learning• Conducting a survey or poll… (*)– Novelty?46
  47. 47. Crowdsourcing Key Questions• What are the goals?– Purposeful directing of human activity• How can you incentivize participation?– Incentive engineering– Who are the target participants?• Which model(s) are most appropriate?– How to adapt them to your context and goals?47
  48. 48. Wisdom of Crowds (WoC)Requires• Diversity• Independence• Decentralization• AggregationInput: large, diverse sample(to increase likelihood of overall pool quality)Output: consensus or selection (aggregation)48
  49. 49. What do you want to accomplish?• Create• Execute task/computation• Fund• Innovate and/or discover• Learn• Monitor• Predict49
  51. 51. Why should your crowd participate?• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige (leaderboards, badges)• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resourceMultiple incentives can often operate in parallel (*caveat)51
  52. 52. Example: Wikipedia• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource52
  53. 53. Example: DuoLingo• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource53
  54. 54. Example:• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource54
  55. 55. Example: ESP55• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource
  56. 56. Example:• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource56
  57. 57. Example: FreeRice• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource57
  58. 58. Example: cQ&A• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource58
  59. 59. Example: reCaptcha• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource59Is there an existing humanactivity you can harnessfor another purpose?
  60. 60. Example: Mechanical Turk• Earn Money (real or virtual)• Have fun (or pass the time)• Socialize with others• Obtain recognition or prestige• Do Good (altruism)• Learn something new• Obtain something else• Create self-serving resource60
  61. 61. Dan Pink – YouTube video“The Surprising Truth about what Motivates us”61
  62. 62. Who arethe workers?• A. Baio, November 2008. The Faces of Mechanical Turk.• P. Ipeirotis. March 2010.The New Demographics of Mechanical Turk• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.62
  63. 63. MTurk Demographics• 2008-2009 studies foundless global and diversethan previously thought– US– Female– Educated– Bored– Money is secondary63
  64. 64. 2010 shows increasing diversity47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)64
  65. 65. How Much to Pay?• Price commensurate with task effort– Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback• Ethics & market-factors: W. Mason and S. Suri, 2010.– e.g. non-profit SamaSource involves workers in refugee camps– Predict right price given market & task: Wang et al. CSDM’11• Uptake & time-to-completion vs. Cost & Quality– Too little $$, no interest or slow – too much $$, attract spammers– Real problem is lack of reliable QA substrate• Accuracy & quantity– More pay = more work, not better (W. Mason and D. Watts, 2009)• Heuristics: start small, watch uptake and bargaining feedback• Worker retention (“anchoring”)65See also: L.B. Chilton et al. KDD-HCOMP 2010.
  67. 67. Does anyone really use it? Yes! (P. Ipeirotis’10)From 1/09 – 4/10, 7M HITs from 10K requestorsworth $500,000 USD (significant under-estimate)67
  68. 68. MTurk: The Requester• Sign up with your Amazon account• Amazon payments• Purchase prepaid HITs• There is no minimum or up-front fee• MTurk collects a 10% commission• The minimum commission charge is $0.005 per HIT68
  69. 69. MTurk Dashboard• Three tabs– Design– Publish– Manage• Design– HIT Template• Publish– Make work available• Manage– Monitor progress69
  70. 70. 70
  71. 71. MTurk: Dashboard - II71
  72. 72. MTurk API• Amazon Web Services API• Rich set of services• Command line tools• More flexibility than dashboard72
  73. 73. MTurk Dashboard vs. API• Dashboard– Easy to prototype– Setup and launch an experiment in a few minutes• API– Ability to integrate AMT as part of a system– Ideal if you want to run experiments regularly– Schedule tasks73
  74. 74. 74• Multiple Channels• Gold-based tests• Only pay for“trusted” judgments
  75. 75. More Crowd Labor Platforms• Clickworker• CloudCrowd• CloudFactory• CrowdSource• DoMyStuff• Microtask• MobileWorks (by Anand Kulkarni )• myGengo• SmartSheet• vWorker• Industry heavy-weights– Elance– Liveops– oDesk– uTest• and more…75
  76. 76. Many Factors Matter!• Process– Task design, instructions, setup, iteration• Choose crowdsourcing platform (or roll your own)• Human factors– Payment / incentives, interface and interaction design,communication, reputation, recruitment, retention• Quality Control / Data Quality– Trust, reliability, spam detection, consensus labeling• Don’t write a paper saying “we collected data fromMTurk & then…” – details of method matter!76
  78. 78. PlateMate - Architecture78
  79. 79. Kulkarni et al.,CSCW 2012Turkomatic79
  80. 80. CrowdForge: Workers perform a taskor further decompose them80Kittur et al., CHI 2011
  81. 81. Kittur et al., CrowdWeaver, CSCW 201281
  83. 83. Typical Workflow• Define and design what to test• Sample data• Design the experiment• Run experiment• Collect data and analyze results• Quality control83
  84. 84. Development Framework• Incremental approach (from Omar Alonso)• Measure, evaluate, and adjust as you go• Suitable for repeatable tasks84
  85. 85. Survey Design• One of the most important parts• Part art, part science• Instructions are key• Prepare to iterate85
  86. 86. Questionnaire Design• Ask the right questions• Workers may not be IR experts so don’tassume the same understanding in terms ofterminology• Show examples• Hire a technical writer– Engineer writes the specification– Writer communicates86
  87. 87. UX Design• Time to apply all those usability concepts• Generic tips– Experiment should be self-contained.– Keep it short and simple. Brief and concise.– Be very clear with the relevance task.– Engage with the worker. Avoid boring stuff.– Always ask for feedback (open-ended question) inan input box.87
  88. 88. UX Design - II• Presentation• Document design• Highlight important concepts• Colors and fonts• Need to grab attention• Localization88
  89. 89. Implementation• Similar to a UX• Build a mock up and test it with your team– Yes, you need to judge some tasks• Incorporate feedback and run a test on MTurkwith a very small data set– Time the experiment– Do people understand the task?• Analyze results– Look for spammers– Check completion times• Iterate and modify accordingly89
  90. 90. Implementation – II• Introduce quality control– Qualification test– Gold answers (honey pots)• Adjust passing grade and worker approval rate• Run experiment with new settings & same data• Scale on data• Scale on workers90
  91. 91. Other design principles• Text alignment• Legibility• Reading level: complexity of words and sentences• Attractiveness (worker’s attention & enjoyment)• Multi-cultural / multi-lingual• Who is the audience (e.g. target worker community)– Special needs communities (e.g. simple color blindness)• Parsimony• Cognitive load: mental rigor needed to perform task• Exposure effect91
  92. 92. The human side• As a worker– I hate when instructions are not clear– I’m not a spammer – I just don’t get what you want– Boring task– A good pay is ideal but not the only condition for engagement• As a requester– Attrition– Balancing act: a task that would produce the right results andis appealing to workers– I want your honest answer for the task– I want qualified workers; system should do some of that for me• Managing crowds and tasks is a daily activity– more difficult than managing computers92
  94. 94. When to assess quality of work• Beforehand (prior to main task activity)– How: “qualification tests” or similar mechanism– Purpose: screening, selection, recruiting, training• During– How: assess labels as worker produces them• Like random checks on a manufacturing line– Purpose: calibrate, reward/penalize, weight• After– How: compute accuracy metrics post-hoc– Purpose: filter, calibrate, weight, retain (HR)– E.g. Jung & Lease (2011), Tang & Lease (2011), ...94
  95. 95. How do we measure work quality?• Compare worker’s label vs.– Known (correct, trusted) label– Other workers’ labels• P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data orMultiple Workers? Sept. 2010.– Model predictions of the above• Model the labels (Ryu & Lease, ASIS&T11)• Model the workers (Chen et al., AAAI’10)• Verify worker’s label– Yourself– Tiered approach (e.g. Find-Fix-Verify)• Quinn and B. Bederson’09, Bernstein et al.’1095
  96. 96. Typical Assumptions• Objective truth exists– no minority voice / rare insights– Can relax this to model “truth distribution”• Automatic answer comparison/evaluation– What about free text responses? Hope from NLP…• Automatic essay scoring• Translation (BLEU: Papineni, ACL’2002)• Summarization (Rouge: C.Y. Lin, WAS’2004)– Have people do it (yourself or find-verify crowd, etc.)96
  97. 97. Distinguishing Bias vs. Noise• Ipeirotis (HComp 2010)• People often have consistent, idiosyncraticskews in their labels (bias)– E.g. I like action movies, so they get higher ratings• Once detected, systematic bias can becalibrated for and corrected (yeah!)• Noise, however, seems random & inconsistent– this is the real issue we want to focus on97
  98. 98. Comparing to known answers• AKA: gold, honey pot, verifiable answer, trap• Assumes you have known answers• Cost vs. Benefit– Producing known answers (experts?)– % of work spent re-producing them• Finer points– Controls against collusion– What if workers recognize the honey pots?98
  99. 99. Comparing to other workers• AKA: consensus, plurality, redundant labeling• Well-known metrics for measuring agreement• Cost vs. Benefit: % of work that is redundant• Finer points– Is consensus “truth” or systematic bias of group?– What if no one really knows what they’re doing?• Low-agreement across workers indicates problem is with thetask (or a specific example), not the workers– Risk of collusion• Sheng et al. (KDD 2008)99
  100. 100. Comparing to predicted label• Ryu & Lease, ASIS&T11• Catch-22 extremes– If model is really bad, why bother comparing?– If model is really good, why collect human labels?• Exploit model confidence– Trust predictions proportional to confidence– What if model very confident and wrong?• Active learning– Time sensitive: Accuracy / confidence changes100
  101. 101. Compare to predicted worker labels• Chen et al., AAAI’10• Avoid inefficiency of redundant labeling– See also: Dekel & Shamir (COLT’2009)• Train a classifier for each worker• For each example labeled by a worker– Compare to predicted labels for all other workers• Issues• Sparsity: workers have to stick around to train model…• Time-sensitivity: New workers & incremental updates?101
  102. 102. Methods for measuring agreement• What to look for– Agreement, reliability, validity• Inter-agreement level– Agreement between judges– Agreement between judges and the gold set• Some statistics– Percentage agreement– Cohen’s kappa (2 raters)– Fleiss’ kappa (any number of raters)– Krippendorff’s alpha• With majority vote, what if 2 say relevant, 3 say not?– Use expert to break ties (Kochhar et al, HCOMP’10; GQR)– Collect more judgments as needed to reduce uncertainty102
  103. 103. Other practical tips• Sign up as worker and do some HITs• “Eat your own dog food”• Monitor discussion forums• Address feedback (e.g., poor guidelines,payments, passing grade, etc.)• Everything counts!– Overall design only as strong as weakest link103
  104. 104. OPEN PROBLEMS104
  105. 105. Why Eytan Adar hates MTurk Research(CHI 2011 CHC Workshop)• Overly-narrow focus on MTurk– Identify general vs. platform-specific problems– Academic vs. Industrial problems• Inattention to prior work in other disciplines• Turks aren’t Martians– Just human behavior…105
  106. 106. What about sensitive data?• Not all data can be publicly disclosed– User data (e.g. AOL query log, Netflix ratings)– Intellectual property– Legal confidentiality• Need to restrict who is in your crowd– Separate channel (workforce) from technology– Hot question for adoption at enterprise level106
  107. 107. A Few Open Questions• How should we balance automation vs.human computation? Which does what?• Who’s the right person for the job?• How do we handle complex tasks? Can wedecompose them into smaller tasks? How?107
  108. 108. What about ethics?• Silberman, Irani, and Ross (2010)– “How should we… conceptualize the role of thesepeople who we ask to power our computing?”– Power dynamics between parties• What are the consequences for a workerwhen your actions harm their reputation?– “Abstraction hides detail”• Fort, Adda, and Cohen (2011)– “…opportunities for our community to deliberatelyvalue ethics above cost savings.”108
  109. 109. Example: SamaSource109
  110. 110. Davis et al. (2010) The HPU.HPU110
  111. 111. HPU: “Abstraction hides detail”• Not just turning a mechanical crank111
  112. 112. Micro-tasks & Task Decomposition• Small, simple tasks can be completed faster byreducing extraneous context and detail– e.g. “Can you name who is in this photo?”• Current workflow research investigates how todecompose complex tasks into simpler ones112
  113. 113. Context & Informed Consent• What is the larger task I’m contributing to?• Who will benefit from it and how?113
  114. 114. Worker PrivacyEach worker is assigned an alphanumeric ID114
  115. 115. Requesters see only Worker IDs115
  116. 116. Issues of Identity Fraud• Compromised & exploited worker accounts• Sybil attacks: use of multiple worker identities• Script bots masquerading as human workers116Robert Sim, MSR Faculty Summit’12
  117. 117. Safeguarding Personal Data•“What are the characteristics of MTurk workers?... the MTurksystem is set up to strictly protect workers’ anonymity….”117
  118. 118. `Amazon profile pageURLs use the sameIDs used on MTurk !Paper: MTurk isNot Anonymous 118
  119. 119. What about the regulation?• Wolfson & Lease (ASIS&T 2011)• As usual, technology is ahead of the law– employment law– patent inventorship– data security and the Federal Trade Commission– copyright ownership– securities regulation of crowdfunding• Take-away: don’t panic, but be mindful– Understand risks of “just in-time compliance”119
  120. 120. Digital Dirty Jobs• NY Times: Policing the Web’s Lurid Precincts• Gawker: Facebook content moderation• CultureDigitally: The dirty job of keepingFacebook clean• Even LDC annotators reading typicalnews articles report stress & nightmares!120
  121. 121. Jeff Howe Vision vs. Reality?• Vision of empowering worker freedom:– work whenever you want for whomever you want• When $$$ is at stake, populations at risk maybe compelled to perform work by others– Digital sweat shops? Digital slaves?– We really don’t know (and need to learn more…)– Traction? Human Trafficking at MSR Summit’12121
  123. 123. Putting the shoe on the other foot:Spam123
  124. 124. What about trust?• Some reports of robot “workers” on MTurk– E.g. McCreadie et al. (2011)– Violates terms of service• Why not just use a captcha?124
  125. 125. Captcha Fraud125
  126. 126. Requester Fraud on MTurk“Do not do any HITs that involve: filling inCAPTCHAs; secret shopping; test our web page;test zip code; free trial; click my link; surveys orquizzes (unless the requester is listed with asmiley in the Hall of Fame/Shame); anythingthat involves sending a text message; orbasically anything that asks for any personalinformation at all—even your zip code. If youfeel in your gut it’s not on the level, IT’S NOT.Why? Because they are scams...”126
  127. 127. Defeating CAPTCHAs with crowds127
  128. 128. Gaming the System: SEO, etc.
  129. 129. WWW’12129
  130. 130. Robert Sim, MSR Summit’12130
  131. 131. Conclusion• Crowdsourcing is quickly transforming practicein industry and academia via greater efficiency• Crowd computing enables a new design spacefor applications, augmenting state-of-the-art AIwith human computation to offernew capabilities and user experiences• With people at the center of this new computingparadigm, important research questionsbridge technological & social considerations131
  132. 132. The Future of Crowd WorkPaper @ ACM CSCW 2013Kittur, Nickerson, Bernstein, Gerber,Shaw, Zimmerman, Lease, and Horton 132
  133. 133. Brief Digression: Information Schools• At 30 universities in N. America, Europe, Asia• Study human-centered aspects of informationtechnologies: design, implementation, policy, …133www.ischools.orgWobbrock etal., 2009
  135. 135. • Jeff Nickerson Aniket Kittur, Michael S. Bernstein, ElizabethGerber, Aaron Shaw, John Zimmerman, Matthew Lease, andJohn J. Horton. The Future of Crowd Work. In ACM ComputerSupported Cooperative Work (CSCW), February 2013.• Alex Quinn and Ben Bederson. Human Computation: A Surveyand Taxonomy of a Growing Field. In Proceedings of CHI 2011.• Law and von Ahn (2011). Human Computation135Surveys
  136. 136. 2013 Crowdsourcing• 1st year of HComp as AAAI conference• TREC 2013 Crowdsourcing Track• Springer’s Information Retrieval (articles online):Crowdsourcing for Information Retrieval• 4th CrowdConf (San Francisco, Fall)• 1st Crowdsourcing Week (Singapore, April)136
  137. 137. TREC Crowdsourcing Track• Year 1 (2011) – horizontals– Task 1 (hci): collect crowd relevance judgments– Task 2 (stats): aggregate judgments– Organizers: Kazai & Lease– Sponsors: Amazon, CrowdFlower• Year 2 (2012) – content types– Task 1 (text): judge relevance– Task 2 (images): judge relevance– Organizers: Ipeirotis, Kazai, Lease, & Smucker– Sponsors: Amazon, CrowdFlower, MobileWorks137
  138. 138. 2012 Workshops & Conferences• AAAI: Human Computation (HComp) (July 22-23)• AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)• ACL: 3rd Workshop of the Peoples Web meets NLP (July 12-13)• AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities(August 9-12)• CHI: CrowdCamp (May 5-6)• CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)• Collective Intelligence (April 18-20)• CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)• CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)• EC: Social Computing and User Generated Content Workshop (June 7)• ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)• ICEC: Harnessing Collective Intelligence with Games (September)• ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)• ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)• KDD: Workshop on Crowdsourcing and Data Mining (August 12)• Multimedia: Crowdsourcing for Multimedia (Nov 2)• SocialCom: Social Media for Human Computation (September 6)• TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)• WWW: CrowdSearch: Crowdsourcing Web search (April 17)138
  139. 139. 2011 Workshops & Conferences• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)• Crowdsourcing Technologies for Language and Cognition Studies (July 27)• CHI-CHC: Crowdsourcing and Human Computation (May 8)• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)• EC: Workshop on Social Computing and User Generated Content (June 5)• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)• Interspeech: Crowdsourcing for speech processing (August)• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)• TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)139
  140. 140. 2011 Tutorials and Keynotes• By Omar Alonso and/or Matthew Lease– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)– CrowdConf: Crowdsourcing for Research and Engineering– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)• AAAI: Human Computation: Core Research Questions and State of the Art– Edith Law and Luis von Ahn, August 7• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research andConservation– Steve Kelling, October 10, ebird• EC: Conducting Behavioral Research Using Amazons Mechanical Turk– Winter Mason and Siddharth Suri, June 5• HCIC: Quality Crowdsourcing for Human Computer Interaction Research– Ed Chi, June 14-18, about HCIC)– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk• Multimedia: Frontiers in Multimedia Search– Alan Hanjalic and Martha Larson, Nov 28• VLDB: Crowdsourcing Applications and Platforms– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)• WWW: Managing Crowdsourced Human Computation– Panos Ipeirotis and Praveen Paritosh140
  141. 141. Students– Catherine Grady (iSchool)– Hyunjoon Jung (iSchool)– Jorn Klinger (Linguistics)– Adriana Kovashka (CS)– Abhimanu Kumar (CS)– Hohyon Ryu (iSchool)– Wei Tang (CS)– Stephen Wolfson (iSchool)Matt Lease - - @mattleaseThank You!
  142. 142. More BooksJuly 2010, kindle-only: “This book introduces you to thetop crowdsourcing sites and outlines step by step withphotos the exact process to get started as a requester onAmazon Mechanical Turk.“142
  143. 143. ResourcesA Few Blogs Behind Enemy Lines (P.G. Ipeirotis, NYU) Deneme: a Mechanical Turk experiments blog (Gret Little, MIT) CrowdFlower Blog Jeff HoweA Few Sites The Crowdsortium CrowdsourceBase (for workers) Daily CrowdsourceMTurk Forums and Resources Turker Nation: (and its blog) Turkopticon: report/avoid shady requestors Amazon Forum for MTurk143
  144. 144. Bibliography J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006. Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award. Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of GraphicsInterface (GI 2010), 39-46. N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004. C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009. P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010. J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Humanin the Loop (ACVHL), June 2010. M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008. D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579 JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009. J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010. P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010. J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008. P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLTWorkshop on Active Learning and NLP, 2009. B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009. P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet. P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010. P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)144
  145. 145. Bibliography (2) A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008. Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011 Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010. K. Krippendorff. "Content Analysis", Sage Publications, 2003 G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009. T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.2009. W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009. J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994. A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009 J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: ShiftingDemographics in Amazon Mechanical Turk”. CHI 2010. F. Scheuren. “What is a Survey” ( 2004. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotationsfor Natural Language Tasks”. EMNLP-2008. V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”KDD 2008. S. Weber. “The Success of Open Source”, Harvard University Press, 2004. L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006. L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.145
  146. 146. Bibliography (3) Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples andClustering on Teachers. AAAI 2010. Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011. Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.EMNLP 2011. C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on textsummarization branches out (WAS), 2004. C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011. Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011. Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIRWorkshop on Crowdsourcing for Information Retrieval (CIR), 2011. S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors withCrawled Data and Crowds. CVPR 2011. Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.146
  147. 147. Recent Work• Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a GoldStandard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.• Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning andcrowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multipleannotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources forselective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery anddata mining (KDD), 259-268.• Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? ComputationalLinguistics, 37(2):413–420.• Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detectionin User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.• Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conferenceon Artificial Intelligence.• Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers viaProbabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research anddevelopment in information retrieval.• Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” inProceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).• Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprintarXiv:1110.3564.• Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy inCrowdsourcing Relevance Judgments.” Springers Information Retrieval Journal: Special Issue on Crowdsourcing.147
  148. 148. Recent Work (2)• Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in inProceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.• Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing MultipleRatings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).• Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” inProceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).• Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.• Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journalof Machine Learning Research 11:1297-1322.• Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervisedlearning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th AnnualInternational Conference on Machine Learning (ICML), 889-896.• Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced LabelingTasks.” Journal of Machine Learning Research 13:491-518.• Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neuralinformation processing systems (NIPS).• Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACMsymposium adjunct on User interface software and technology (UIST).• Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rdHuman Computation Workshop (HCOMP) at AAAI.• Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom ofCrowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.• Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effectivelabels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.• Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integrationof Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).• Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual InternationalConference on Machine Learning (ICML).148
  149. 149. 149