7. Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• 22K labels for US $26
• High agreement between
consensus labels and
8. Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
– Judge relevance of search engine results
• Many follow-on studies (design, quality, cost)
13. Social & Behavioral Sciences
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
15. Remote Usability Testing
• D. Liu, R. Kuipers, M. Lease, and R. Bias (in preparation)
• On-site vs. crowdsourced usability testing
– More Participants
– More Diverse Participants
– High Speed
– Low Cost
– Lower Quality Feedback
– Less Interaction
– Greater need for quality control
– Less Focused User Groups
23. The Mechanical Turk
The original, constructed and
unveiled in 1770 by Wolfgang
von Kempelen (1734–1804)
J. Pontin. Artificial Intelligence, With Help From
the Humans. New York Times (March 25, 2007)
28. • What was old is new
• Crowdsourcing: A New
Branch of Computer Science
– D.A. Grier, March 29, 2011
• Tabulating the heavens:
computing the Nautical
Almanac in 18th-century
– M. Croarken (2003) Princeton University Press, 2005
30. Human Computation
• Luis von Ahn (2005)
• Investigates use of people to execute certain
computations for which capabilities of current
automated methods are more limited
• Explores the metaphor of computation for
characterizing attributes, capabilities, and
limitations of human performance in
executing desired tasks
• Having people do stuff (instead of computers)
33. Ethics Checking: The Next Frontier?
• Mark Johnson’s address at ACL 2003
– Transcript in Conduit 12(2) 2003
• Think how useful a little “ethics checker and
corrector” program integrated into a word
processor could be!
34. Soylent: A Word Processor with a Crowd Inside
• Bernstein et al., UIST 2010
• Jeff Howe. Wired, June 2006.
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
• New application of principles
from open source movement
42. What is Crowdsourcing?
• A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related areas
– Human computation
– Collective intelligence
– Social computing
– People services
43. Crowdsourcing Key Questions
• What are the goals?
– Purposeful directing of human activity
• How can you incentivize participation?
– Incentive engineering
– Who is your intended crowd?
• Which model(s) are most appropriate?
– How to adapt them to your context and goals?
45. What about sensitive data?
• Not all data can be publicly disclosed
– User data (e.g. AOL query log, Netflix ratings)
– Intellectual property
– Legal confidentiality
• Need to restrict who is in your crowd
– Separate channel (workforce) from technology
– Hot question for adoption at enterprise level
46. What about the law?
• Wolfson & Lease (ASIS&T 2011)
• As usual, technology is ahead of the law
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
– Understand risks of “just in-time compliance”
47. Nature of Micro-tasks
• Small, simple tasks which can be completed
without extraneous detail or context
– e.g. “Can you name who is in this photo?”
• Variety of current research investigating how
to decompose complex tasks into simpler ones
51. Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
52. Who are
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.
53. What about ethics?
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these
people who we ask to power our computing?”
– Power dynamics between parties
• What are the consequences for a worker
when your actions harm their reputation?
– “Abstraction hides detail”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately
value ethics above cost savings.”
58. What about quality?
• Many papers on statistical methods for MTurk
– Worker calibration, noise vs. bias, weighted voting
– Checking consensus may discourage worker honesty
• Garbage in = garbage out
– Only as strong as your weakest link (end-to-end)
– Is your shiny statistical hammer what’s really needed?
– Not all problems are technological
• Methods for consistent annotation still apply
59. What about quality? (2)
• Human factors matter
– Part of your experimental design (study & report)
– Instructions, design, interface, interaction
– Names, relationship, reputation (Klinger & Lease 2011)
– Fair pay, hourly vs. per-task, recognition, advancement
– For contrast, consider Kochhar (2010)
• How good is gold really?
– Klebanov and Beigman (NAACL 2010)
– Model uncertainty in ground truth
• Training and evaluation with uncertain labels
• Temporal and label uncertainty in active learning 59
60. Need for Benchmarks
• How well do different crowdsourcing methods
perform on comparable data?
– Shared datasets used to bring people together
• Common tasks to consider
– Relevance Judging (search engines)
– Generation (resources, SEO, reputation)
– Verification (correct, copyright, appropriate)
• NIST TREC Crowdsourcing Track exploring this
– Track will run for 2nd Year in 2012 60
62. Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Overly-narrow focus on MTurk
– Identify general vs. platform-specific problems
– Academic vs. Industrial problems
• Inattention to prior work in other disciplines
– Some problems well-studied in other areas
– Human behavior hasn’t changed much
• Turks aren’t Martians
– How many prior user studies need to be
reproduced on MTurk before we believe it?
66. MapReduce for human computation?
• Large task divided into smaller sub-problems
• Job distributed among multiple workers
• Collect all answers and combine them
• Varying performance of heterogeneous
67. Wisdom of Crowds Computing
Input: large, diverse sample (increases
likelihood of overall pool quality)
Output: consensus, selection, distribution
What about ensemble techniques and theory
we have for integrating multiple noisy models?
Computational Social Science & the Wisdom of Crowds
NIPS 2010-2011 workshops 67
68. • How to effectively compute in the crowd?
– What are all the knobs and dials?
– How to set them to navigate quality vs. cost vs. time?
• How do we design, implement, test, & maintain
crowd computing systems?
• What new capabilities can such systems provide?
• How does cheaper / faster / easier / noisier data
change the way we build intelligent systems?
– Relative costs of all other activities increase
69. Unreasonable Effectiveness of Data
• Massive free Web data
changed how we train
– Banko and Brill (2001).
Human Language Tech.
– Halevy et al. (2009). IEEE
• How might access to cheap & plentiful
labeled data change the balance again?
70. • Who is the right person for the job?
– Requesting or inferring skills / experience
– Interactive task selection or automatic routing
– How to represent, measure, model, estimate, and
utilize individual worker expertise/accuracy?
• How should crowd computing evolve from here?
– What should next generation infrastructure provide?
– What are the right programming constructs?
• With automation and human computation, who
does what? Mixed-initiative thinking.
71. Crowdsourcing in 2012
• Conferences and Workshops
– AAAI Symposium: Wisdom of the Crowd (March 26-28)
– Collective Intelligence (papers: Nov 18, date: April 18-20)
– Year 2 of TREC Crowdsourcing Track
– HComp and CrowdConf (details TBD)
• Journal Special Issues
– Springer’s Information Retrieval:
Crowdsourcing for Information Retrieval
– Hindawi’s Advances in Multimedia Journal:
Multimedia Semantics Analysis via Crowdsourcing Geocontext
– IEEE Internet Computing: Crowdsourcing (Sept./Oct. 2012)
– IEEE Transactions on Multimedia:
Crowdsourcing in Multimedia (proposal in review)
• Places for News
– Follow the Crowd
– The Crowdsortium 71
72. 2011 Tutorials and Keynotes
• By Omar Alonso and/or Matthew Lease
– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
– CrowdConf (Nov. 1, this is it!)
– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)
• AAAI: Human Computation: Core Research Questions and State of the Art
– Edith Law and Luis von Ahn, August 7
• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
– Steve Kelling, October 10, ebird
• EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
– Winter Mason and Siddharth Suri, June 5
• HCIC: Quality Crowdsourcing for Human Computer Interaction Research
– Ed Chi, June 14-18, about HCIC)
– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
• Multimedia: Frontiers in Multimedia Search
– Alan Hanjalic and Martha Larson, Nov 28
• VLDB: Crowdsourcing Applications and Platforms
– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
• WWW: Managing Crowdsourced Human Computation
– Panos Ipeirotis and Praveen Paritosh
73. 2011 Workshops & Conferences
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
74. Recent Overview Papers
• Alex Quinn and Ben Bederson. Human Computation: A
Survey and Taxonomy of a Growing Field. CHI 2011.
• Man-Ching Yuen, Irwin King, and Kwong-Sak Leung. A
Survey of Crowdsourcing Systems. SocialCom 2011.
• Rajarshi Das and Maja Vukovic. Emerging theories and
models of human computation systems: a brief survey.
• A. Doan, R. Ramakrishnan, A. Halevy. Crowdsourcing
Systems on the World-Wide Web. Communications of
the ACM, 2011.
• Omar Alonso, Gabriella Kazai, and Stefano
Mizzaro. (2012). Crowdsourcing for Search
Engine Evaluation: Why and How.
• Law and von Ahn (2011).
76. More Books
July 2010, kindle-only: “This book introduces you to the
top crowdsourcing sites and outlines step by step with
photos the exact process to get started as a requester on
Amazon Mechanical Turk.“
77. Thank You!
– Catherine Grady (iSchool) Matt Lease
– Hyunjoon Jung (ECE) firstname.lastname@example.org
– Jorn Klinger (Linguistics) @mattlease
– Adriana Kovashka (CS)
– Abhimanu Kumar (CS)
– Di Liu (iSchool)
– Hohyon Ryu (iSchool)
– William Tang (CS)
– Stephen Wolfson (iSchool)
• Omar Alonso, Microsoft Bing
– John P. Commons 77
J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
Bederson, B.B., Hu, C., & Resnik, P. Translation by Interactive Collaboration between Monolingual Users, Proceedings of Graphics
Interface (GI 2010), 39-46.
N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
in the Loop (ACVHL), June 2010.
M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
Workshop on Active Learning and NLP, 2009.
B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)
79. Bibliography (2)
A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
K. Krippendorff. "Content Analysis", Sage Publications, 2003
G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
Demographics in Amazon Mechanical Turk”. CHI 2010.
F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
for Natural Language Tasks”. EMNLP-2008.
V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.
80. Bibliography (3)
Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
Clustering on Teachers. AAAI 2010.
Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
summarization branches out (WAS), 2004.
C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
Crawled Data and Crowds. CVPR 2011.
Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.
81. Crowdsourcing in IR: 2008-2010
O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2.
O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.
P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.
G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.
G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.
Law et al. “SearchWar”. HCOMP.
H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009.
SIGIR Workshop on Crowdsourcing for Search Evaluation.
O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.
K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.
C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazon's Mechanical Turk.
Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.
G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.
G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)
Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.
Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.
Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.
T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010.
82. Crowdsourcing in IR: 2011
WSDM Workshop on Crowdsourcing for Search and Data Mining.
SIGIR Workshop on Crowdsourcing for Information Retrieval.
O. Alonso and R. Baeza-Yates. “Design and Implementation of Relevance Assessments using Crowdsourcing, ECIR 2011.
Roi Blanco, Harry Halpin, Daniel Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran. “Repeatable and
Reliable Search System Evaluation using Crowd-Sourcing”. SIGIR 2011.
Yen-Ta Huang, An-Jung Cheng, Liang-Chi Hsieh, Winston H. Hsu, Kuo-Wei Chang. “Region-Based Landmark Discovery by
Crowdsourcing Geo-Referenced Photos.” SIGIR 2011.
Hyun Joon Jung, Matthew Lease . “Improving Consensus Accuracy via Z-score and Weighted Voting”. HCOMP 2011.
G. Kasneci, J. Van Gael, D. Stern, and T. Graepel, CoBayes: Bayesian Knowledge Corroboration with Assessors of
Unknown Areas of Expertise, WSDM 2011.
Gabriella Kazai,. “In Search of Quality in Crowdsourcing for Search Engine Evaluation”, ECIR 2011.
Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling. “Crowdsourcing for Book Search Evaluation: Impact of Quality
on Comparative System Ranking.” SIGIR 2011.
Abhimanu Kumar, Matthew Lease . “Learning to Rank From a Noisy Crowd”. SIGIR 2011.
Edith Law, Paul N. Bennett, and Eric Horvitz. “The Effects of Choice in Routing Relevance Judgments”. SIGIR 2011.