Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems

Matt Lease
School of Information @mattlease
University of Texas at Austin ml@ischool.utexas.edu
Crowdsourcing & Human Computation
Labeling Data & Building Hybrid Systems
Slides: www.slideshare.net/mattlease

Roadmap
• A Quick Example
• Crowd-powered data collection & applications
• Crowdsourcing, Incentives, & Demographics
• Mechanical Turk & Other Platforms
• Designing for Crowds & Statistical QA
• Open Problems
• Broader Considerations & a Darker Side
2

What is Crowdsourcing?
• Let’s start with a simple example!
• Goal
– See a concrete example of real crowdsourcing
– Ground later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing
3

Human Intelligence Tasks (HITs)
4

6
Jane saw the man with the binoculars

Traditional Data Collection
• Setup data collection software / harness
• Recruit participants / annotators / assessors
• Pay a flat fee for experiment or hourly wage
• Characteristics
– Slow
– Expensive
– Difficult and/or Tedious
– Sample Bias…
7

“Hello World” Demo
• Let’s create and run a simple MTurk HIT
• This is a teaser highlighting concepts
– Don’t worry about details; we’ll revisit them
• Goal
– See a concrete example of real crowdsourcing
– Ground our later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing
8

NLP: Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• 22K labels for US $26
• High agreement between
consensus labels and
gold-standard labels
11

Computer Vision:
Sorokin & Forsythe (CVPR 2008)
• 4K labels for US $60
12

IR: Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
– Judge relevance of search engine results
• Many follow-on studies (design, quality, cost)
13

User Studies: Kittur, Chi, & Suh (CHI 2008)
• “…make creating believable invalid responses as
effortful as completing the task in good faith.”
14

Remote Usability Testing
• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012
• Remote usability testing via MTurk & CrowdFlower
vs. traditional on-site testing
• Advantages
– More (Diverse) Participants
– High Speed
– Low Cost
• Disadvantages
– Lower Quality Feedback
– Less Interaction
– Greater need for quality control
– Less Focused User Groups
15

Human Subjects Research:
Surveys, Demographics, etc.
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists 17

• PhD Thesis, December 2005
• Law & von Ahn, Book, June 2011
18
LUIS VON AHN, CMU

ESP Game (Games With a Purpose)
L. Von Ahn and L. Dabbish (2004)
19

reCaptcha
L. von Ahn et al. (2008). In Science.
20

DuoLingo (Launched Nov. 2011)
21

MORE DATA COLLECTION EXAMPLES
22

Crowd Sensing
• Steve Kelling, et al. A Human/Computer Learning
Network to Improve Biodiversity Conservation
and Research. AI Magazine 34.1 (2012): 10.
23

Tracking Sentiment in Online Media
Brew et al., PAIS 2010
• Volunteer-crowd
• Judge in exchange for
access to rich content
• Balance system needs
with user interest
• Daily updates to non-
stationary distribution
24

PHASE 2: FROM DATA COLLECTION
TO HUMAN COMPUTATION
25

Princeton University Press, 2005
• What was old is new
• Crowdsourcing: A New Branch
of Computer Science
– D.A. Grier, March 29, 2011
• Tabulating the heavens:
computing the Nautical
Almanac in 18th-century
England - M. Croarken’03
27
Human Computation

J. Pontin. Artificial Intelligence, With Help From
the Humans. New York Times (March 25, 2007)
The Mechanical Turk
28
Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804)

The Human Processing Unit (HPU)
• Davis et al. (2010)
HPU
29

Human Computation
• Having people do stuff instead of computers
• Investigates use of people to execute certain
computations for which capabilities of current
automated methods are more limited
• Explores the metaphor of computation for
characterizing attributes, capabilities, and
limitations of human task performance
30

APPLYING HUMAN COMPUTATION:
CROWD-POWERED APPLICATIONS
31

32
Crowd-Assisted Search: “Amazon Remembers”

Translation by monolingual speakers
• C. Hu, CHI 2009
33

Soylent: A Word Processor with a Crowd Inside
• Bernstein et al., UIST 2010
34

fold.it
S. Cooper et al. (2010)
Alice G. Walton. Online Gamers Help Solve Mystery of
Critical AIDS Virus Enzyme. The Atlantic, October 8, 2011.
35

PlateMate (Noronha et al., UIST’10)
36

Image Analysis and more: Eatery
37

VizWiz aaaaaaaa
Bingham et al. (UIST 2010)
38

From Outsourcing to Crowdsourcing
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
• New application of principles
from open source movement
• Evolving & broadly defined ...
43

Crowdsourcing models
• Micro-tasks & citizen science
• Co-Creation
• Open Innovation, Contests
• Prediction Markets
• Crowd Funding and Charity
• “Gamification” (not serious gaming)
• Transparent
• cQ&A, Social Search, and Polling
• Physical Interface/Task
44

What is Crowdsourcing?
• Mechanisms and methodology for directing
crowd action to achieve some goal(s)
– E.g., novel ways of collecting data from crowds
• Powered by internet-connectivity
• Related topics:
– Human computation
– Collective intelligence
– Crowd/Social computing
– Wisdom of Crowds
– People services, Human Clouds, Peer-production, …
45

What is not crowdsourcing?
• Analyzing existing datasets (no matter source)
– Data mining
– Visual analytics
• Use of few people
– Mixed-initiative design
– Active learning
• Conducting a survey or poll… (*)
– Novelty?
46

Crowdsourcing Key Questions
• What are the goals?
– Purposeful directing of human activity
• How can you incentivize participation?
– Incentive engineering
– Who are the target participants?
• Which model(s) are most appropriate?
– How to adapt them to your context and goals?
47

Wisdom of Crowds (WoC)
Requires
• Diversity
• Independence
• Decentralization
• Aggregation
Input: large, diverse sample
(to increase likelihood of overall pool quality)
Output: consensus or selection (aggregation)
48

What do you want to accomplish?
• Create
• Execute task/computation
• Fund
• Innovate and/or discover
• Learn
• Monitor
• Predict
49

Why should your crowd participate?
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige (leaderboards, badges)
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
Multiple incentives can often operate in parallel (*caveat)
51

Example: Wikipedia
• Obtain recognition or prestige
52

Example: DuoLingo
53

Example:
54

Example: ESP
55

Example: fold.it
56

Example: FreeRice
57

Example: cQ&A
58

Example: reCaptcha
59
Is there an existing human
activity you can harness
for another purpose?

Example: Mechanical Turk
60

Dan Pink – YouTube video
“The Surprising Truth about what Motivates us”
61

Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010.
The New Demographics of Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.
62

MTurk Demographics
• 2008-2009 studies found
less global and diverse
than previously thought
– US
– Female
– Educated
– Bored
– Money is secondary
63

2010 shows increasing diversity
47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)
64

How Much to Pay?
• Price commensurate with task effort
– Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback
• Ethics & market-factors: W. Mason and S. Suri, 2010.
– e.g. non-profit SamaSource involves workers in refugee camps
– Predict right price given market & task: Wang et al. CSDM’11
• Uptake & time-to-completion vs. Cost & Quality
– Too little $$, no interest or slow – too much $$, attract spammers
– Real problem is lack of reliable QA substrate
• Accuracy & quantity
– More pay = more work, not better (W. Mason and D. Watts, 2009)
• Heuristics: start small, watch uptake and bargaining feedback
• Worker retention (“anchoring”)
65
See also: L.B. Chilton et al. KDD-HCOMP 2010.

Does anyone really use it? Yes!
http://www.mturk-tracker.com (P. Ipeirotis’10)
From 1/09 – 4/10, 7M HITs from 10K requestors
worth $500,000 USD (significant under-estimate)
67

MTurk: The Requester
• Sign up with your Amazon account
• Amazon payments
• Purchase prepaid HITs
• There is no minimum or up-front fee
• MTurk collects a 10% commission
• The minimum commission charge is $0.005 per HIT
68

MTurk Dashboard
• Three tabs
– Design
– Publish
– Manage
• Design
– HIT Template
• Publish
– Make work available
• Manage
– Monitor progress
69

MTurk API
• Amazon Web Services API
• Rich set of services
• Command line tools
• More flexibility than dashboard
72

MTurk Dashboard vs. API
• Dashboard
– Easy to prototype
– Setup and launch an experiment in a few minutes
• API
– Ability to integrate AMT as part of a system
– Ideal if you want to run experiments regularly
– Schedule tasks
73

74
• Multiple Channels
• Gold-based tests
• Only pay for
“trusted” judgments

More Crowd Labor Platforms
• Clickworker
• CloudCrowd
• CloudFactory
• CrowdSource
• DoMyStuff
• Microtask
• MobileWorks (by Anand Kulkarni )
• myGengo
• SmartSheet
• vWorker
• Industry heavy-weights
– Elance
– Liveops
– oDesk
– uTest
• and more…
75

Many Factors Matter!
• Process
– Task design, instructions, setup, iteration
• Choose crowdsourcing platform (or roll your own)
• Human factors
– Payment / incentives, interface and interaction design,
communication, reputation, recruitment, retention
• Quality Control / Data Quality
– Trust, reliability, spam detection, consensus labeling
• Don’t write a paper saying “we collected data from
MTurk & then…” – details of method matter!
76

Kulkarni et al.,
CSCW 2012
Turkomatic
79

CrowdForge: Workers perform a task
or further decompose them
80
Kittur et al., CHI 2011

Kittur et al., CrowdWeaver, CSCW 2012
81

Typical Workflow
• Define and design what to test
• Sample data
• Design the experiment
• Run experiment
• Collect data and analyze results
• Quality control
83

Development Framework
• Incremental approach (from Omar Alonso)
• Measure, evaluate, and adjust as you go
• Suitable for repeatable tasks
84

Survey Design
• One of the most important parts
• Part art, part science
• Instructions are key
• Prepare to iterate
85

Questionnaire Design
• Ask the right questions
• Workers may not be IR experts so don’t
assume the same understanding in terms of
terminology
• Show examples
• Hire a technical writer
– Engineer writes the specification
– Writer communicates
86

UX Design
• Time to apply all those usability concepts
• Generic tips
– Experiment should be self-contained.
– Keep it short and simple. Brief and concise.
– Be very clear with the relevance task.
– Engage with the worker. Avoid boring stuff.
– Always ask for feedback (open-ended question) in
an input box.
87

UX Design - II
• Presentation
• Document design
• Highlight important concepts
• Colors and fonts
• Need to grab attention
• Localization
88

Implementation
• Similar to a UX
• Build a mock up and test it with your team
– Yes, you need to judge some tasks
• Incorporate feedback and run a test on MTurk
with a very small data set
– Time the experiment
– Do people understand the task?
• Analyze results
– Look for spammers
– Check completion times
• Iterate and modify accordingly
89

Implementation – II
• Introduce quality control
– Qualification test
– Gold answers (honey pots)
• Adjust passing grade and worker approval rate
• Run experiment with new settings & same data
• Scale on data
• Scale on workers
90

Other design principles
• Text alignment
• Legibility
• Reading level: complexity of words and sentences
• Attractiveness (worker’s attention & enjoyment)
• Multi-cultural / multi-lingual
• Who is the audience (e.g. target worker community)
– Special needs communities (e.g. simple color blindness)
• Parsimony
• Cognitive load: mental rigor needed to perform task
• Exposure effect
91

The human side
• As a worker
– I hate when instructions are not clear
– I’m not a spammer – I just don’t get what you want
– Boring task
– A good pay is ideal but not the only condition for engagement
• As a requester
– Attrition
– Balancing act: a task that would produce the right results and
is appealing to workers
– I want your honest answer for the task
– I want qualified workers; system should do some of that for me
• Managing crowds and tasks is a daily activity
– more difficult than managing computers
92

When to assess quality of work
• Beforehand (prior to main task activity)
– How: “qualification tests” or similar mechanism
– Purpose: screening, selection, recruiting, training
• During
– How: assess labels as worker produces them
• Like random checks on a manufacturing line
– Purpose: calibrate, reward/penalize, weight
• After
– How: compute accuracy metrics post-hoc
– Purpose: filter, calibrate, weight, retain (HR)
– E.g. Jung & Lease (2011), Tang & Lease (2011), ...
94

How do we measure work quality?
• Compare worker’s label vs.
– Known (correct, trusted) label
– Other workers’ labels
• P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or
Multiple Workers? Sept. 2010.
– Model predictions of the above
• Model the labels (Ryu & Lease, ASIS&T11)
• Model the workers (Chen et al., AAAI’10)
• Verify worker’s label
– Yourself
– Tiered approach (e.g. Find-Fix-Verify)
• Quinn and B. Bederson’09, Bernstein et al.’10
95

Typical Assumptions
• Objective truth exists
– no minority voice / rare insights
– Can relax this to model “truth distribution”
• Automatic answer comparison/evaluation
– What about free text responses? Hope from NLP…
• Automatic essay scoring
• Translation (BLEU: Papineni, ACL’2002)
• Summarization (Rouge: C.Y. Lin, WAS’2004)
– Have people do it (yourself or find-verify crowd, etc.)
96

Distinguishing Bias vs. Noise
• Ipeirotis (HComp 2010)
• People often have consistent, idiosyncratic
skews in their labels (bias)
– E.g. I like action movies, so they get higher ratings
• Once detected, systematic bias can be
calibrated for and corrected (yeah!)
• Noise, however, seems random & inconsistent
– this is the real issue we want to focus on
97

Comparing to known answers
• AKA: gold, honey pot, verifiable answer, trap
• Assumes you have known answers
• Cost vs. Benefit
– Producing known answers (experts?)
– % of work spent re-producing them
• Finer points
– Controls against collusion
– What if workers recognize the honey pots?
98

Comparing to other workers
• AKA: consensus, plurality, redundant labeling
• Well-known metrics for measuring agreement
• Cost vs. Benefit: % of work that is redundant
• Finer points
– Is consensus “truth” or systematic bias of group?
– What if no one really knows what they’re doing?
• Low-agreement across workers indicates problem is with the
task (or a specific example), not the workers
– Risk of collusion
• Sheng et al. (KDD 2008)
99

Comparing to predicted label
• Ryu & Lease, ASIS&T11
• Catch-22 extremes
– If model is really bad, why bother comparing?
– If model is really good, why collect human labels?
• Exploit model confidence
– Trust predictions proportional to confidence
– What if model very confident and wrong?
• Active learning
– Time sensitive: Accuracy / confidence changes
100

Compare to predicted worker labels
• Chen et al., AAAI’10
• Avoid inefficiency of redundant labeling
– See also: Dekel & Shamir (COLT’2009)
• Train a classifier for each worker
• For each example labeled by a worker
– Compare to predicted labels for all other workers
• Issues
• Sparsity: workers have to stick around to train model…
• Time-sensitivity: New workers & incremental updates?
101

Methods for measuring agreement
• What to look for
– Agreement, reliability, validity
• Inter-agreement level
– Agreement between judges
– Agreement between judges and the gold set
• Some statistics
– Percentage agreement
– Cohen’s kappa (2 raters)
– Fleiss’ kappa (any number of raters)
– Krippendorff’s alpha
• With majority vote, what if 2 say relevant, 3 say not?
– Use expert to break ties (Kochhar et al, HCOMP’10; GQR)
– Collect more judgments as needed to reduce uncertainty
102

Other practical tips
• Sign up as worker and do some HITs
• “Eat your own dog food”
• Monitor discussion forums
• Address feedback (e.g., poor guidelines,
payments, passing grade, etc.)
• Everything counts!
– Overall design only as strong as weakest link
103

Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Overly-narrow focus on MTurk
– Identify general vs. platform-specific problems
– Academic vs. Industrial problems
• Inattention to prior work in other disciplines
• Turks aren’t Martians
– Just human behavior…
105

What about sensitive data?
• Not all data can be publicly disclosed
– User data (e.g. AOL query log, Netflix ratings)
– Intellectual property
– Legal confidentiality
• Need to restrict who is in your crowd
– Separate channel (workforce) from technology
– Hot question for adoption at enterprise level
106

A Few Open Questions
• How should we balance automation vs.
human computation? Which does what?
• Who’s the right person for the job?
• How do we handle complex tasks? Can we
decompose them into smaller tasks? How?
107

What about ethics?
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these
people who we ask to power our computing?”
– Power dynamics between parties
• What are the consequences for a worker
when your actions harm their reputation?
– “Abstraction hides detail”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately
value ethics above cost savings.”
108

Davis et al. (2010) The HPU.
HPU
110

HPU: “Abstraction hides detail”
• Not just turning a mechanical crank
111

Micro-tasks & Task Decomposition
• Small, simple tasks can be completed faster by
reducing extraneous context and detail
– e.g. “Can you name who is in this photo?”
• Current workflow research investigates how to
decompose complex tasks into simpler ones
112

Context & Informed Consent
• What is the larger task I’m contributing to?
• Who will benefit from it and how?
113

Worker Privacy
Each worker is assigned an alphanumeric ID
114

Requesters see only Worker IDs
115

Issues of Identity Fraud
• Compromised & exploited worker accounts
• Sybil attacks: use of multiple worker identities
• Script bots masquerading as human workers
116
Robert Sim, MSR Faculty Summit’12

Safeguarding Personal Data
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to strictly protect workers’ anonymity….”
117

`
Amazon profile page
URLs use the same
IDs used on MTurk !
Paper: MTurk is
Not Anonymous 118

What about the regulation?
• Wolfson & Lease (ASIS&T 2011)
• As usual, technology is ahead of the law
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
– Understand risks of “just in-time compliance”
119

Digital Dirty Jobs
• NY Times: Policing the Web’s Lurid Precincts
• Gawker: Facebook content moderation
• CultureDigitally: The dirty job of keeping
Facebook clean
• Even LDC annotators reading typical
news articles report stress & nightmares!
120

Jeff Howe Vision vs. Reality?
• Vision of empowering worker freedom:
– work whenever you want for whomever you want
• When $$$ is at stake, populations at risk may
be compelled to perform work by others
– Digital sweat shops? Digital slaves?
– We really don’t know (and need to learn more…)
– Traction? Human Trafficking at MSR Summit’12
121

A DARK SIDE OF CROWDSOURCING
122

Putting the shoe on the other foot:
Spam
123

What about trust?
• Some reports of robot “workers” on MTurk
– E.g. McCreadie et al. (2011)
– Violates terms of service
• Why not just use a captcha?
124

Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
126

Defeating CAPTCHAs with crowds
127

Robert Sim, MSR Summit’12
130

Conclusion
• Crowdsourcing is quickly transforming practice
in industry and academia via greater efficiency
• Crowd computing enables a new design space
for applications, augmenting state-of-the-art AI
with human computation to offer
new capabilities and user experiences
• With people at the center of this new computing
paradigm, important research questions
bridge technological & social considerations
131

The Future of Crowd Work
Paper @ ACM CSCW 2013
Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton 132

Brief Digression: Information Schools
• At 30 universities in N. America, Europe, Asia
• Study human-centered aspects of information
technologies: design, implementation, policy, …
133
www.ischools.org
Wobbrock et
al., 2009

• Jeff Nickerson Aniket Kittur, Michael S. Bernstein, Elizabeth
Gerber, Aaron Shaw, John Zimmerman, Matthew Lease, and
John J. Horton. The Future of Crowd Work. In ACM Computer
Supported Cooperative Work (CSCW), February 2013.
• Alex Quinn and Ben Bederson. Human Computation: A Survey
and Taxonomy of a Growing Field. In Proceedings of CHI 2011.
• Law and von Ahn (2011). Human Computation
135
Surveys

2013 Crowdsourcing
• 1st year of HComp as AAAI conference
• TREC 2013 Crowdsourcing Track
• Springer’s Information Retrieval (articles online):
Crowdsourcing for Information Retrieval
• 4th CrowdConf (San Francisco, Fall)
• 1st Crowdsourcing Week (Singapore, April)
136

TREC Crowdsourcing Track
• Year 1 (2011) – horizontals
– Task 1 (hci): collect crowd relevance judgments
– Task 2 (stats): aggregate judgments
– Organizers: Kazai & Lease
– Sponsors: Amazon, CrowdFlower
• Year 2 (2012) – content types
– Task 1 (text): judge relevance
– Task 2 (images): judge relevance
– Organizers: Ipeirotis, Kazai, Lease, & Smucker
– Sponsors: Amazon, CrowdFlower, MobileWorks
137

2012 Workshops & Conferences
• AAAI: Human Computation (HComp) (July 22-23)
• AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)
• ACL: 3rd Workshop of the People's Web meets NLP (July 12-13)
• AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities(August 9-12)
• CHI: CrowdCamp (May 5-6)
• CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)
• Collective Intelligence (April 18-20)
• CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)
• CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)
• EC: Social Computing and User Generated Content Workshop (June 7)
• ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)
• ICEC: Harnessing Collective Intelligence with Games (September)
• ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)
• ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)
• KDD: Workshop on Crowdsourcing and Data Mining (August 12)
• Multimedia: Crowdsourcing for Multimedia (Nov 2)
• SocialCom: Social Media for Human Computation (September 6)
• TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)
• WWW: CrowdSearch: Crowdsourcing Web search (April 17)
138

2011 Workshops & Conferences
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
139

2011 Tutorials and Keynotes
• By Omar Alonso and/or Matthew Lease
– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
– CrowdConf: Crowdsourcing for Research and Engineering
– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)
• AAAI: Human Computation: Core Research Questions and State of the Art
– Edith Law and Luis von Ahn, August 7
• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
Conservation
– Steve Kelling, October 10, ebird
• EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
– Winter Mason and Siddharth Suri, June 5
• HCIC: Quality Crowdsourcing for Human Computer Interaction Research
– Ed Chi, June 14-18, about HCIC)
– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
• Multimedia: Frontiers in Multimedia Search
– Alan Hanjalic and Martha Larson, Nov 28
• VLDB: Crowdsourcing Applications and Platforms
– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
• WWW: Managing Crowdsourced Human Computation
– Panos Ipeirotis and Praveen Paritosh
140

Students
– Catherine Grady (iSchool)
– Hyunjoon Jung (iSchool)
– Jorn Klinger (Linguistics)
– Adriana Kovashka (CS)
– Abhimanu Kumar (CS)
– Hohyon Ryu (iSchool)
– Wei Tang (CS)
– Stephen Wolfson (iSchool)
Matt Lease - ml@ischool.utexas.edu - @mattlease
Thank You!
141
ir.ischool.utexas.edu/crowd

More Books
July 2010, kindle-only: “This book introduces you to the
top crowdsourcing sites and outlines step by step with
photos the exact process to get started as a requester on
Amazon Mechanical Turk.“
142

Resources
A Few Blogs
 Behind Enemy Lines (P.G. Ipeirotis, NYU)
 Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)
 CrowdFlower Blog
 http://experimentalturk.wordpress.com
 Jeff Howe
A Few Sites
 The Crowdsortium
 Crowdsourcing.org
 CrowdsourceBase (for workers)
 Daily Crowdsource
MTurk Forums and Resources
 Turker Nation: http://turkers.proboards.com
 http://www.turkalert.com (and its blog)
 Turkopticon: report/avoid shady requestors
 Amazon Forum for MTurk
143

Bibliography
 J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
 Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
 Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics
Interface (GI 2010), 39-46.
 N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
 C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
 P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
 J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
in the Loop (ACVHL), June 2010.
 M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
 D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
 JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
 J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
 P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
 J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
 P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
Workshop on Active Learning and NLP, 2009.
 B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
 P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
 P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
 P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)
144

Bibliography (2)
 A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
 Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
 Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
 K. Krippendorff. "Content Analysis", Sage Publications, 2003
 G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
 T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
2009.
 W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
 J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
 A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
 J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
Demographics in Amazon Mechanical Turk”. CHI 2010.
 F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
 R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
for Natural Language Tasks”. EMNLP-2008.
 V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
KDD 2008.
 S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
 L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
 L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.
145

Bibliography (3)
 Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
Clustering on Teachers. AAAI 2010.
 Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
 Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
EMNLP 2011.
 C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
summarization branches out (WAS), 2004.
 C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
 Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
 Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
 S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
Crawled Data and Crowds. CVPR 2011.
 Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.
146

Recent Work
• Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold
Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.
• Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and
crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple
annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for
selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD), 259-268.
• Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational
Linguistics, 37(2):413–420.
• Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection
in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.
• Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference
on Artificial Intelligence.
• Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via
Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and
development in information retrieval.
• Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in
Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
• Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint
arXiv:1110.3564.
• Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in
Crowdsourcing Relevance Judgments.” Springer's Information Retrieval Journal: Special Issue on Crowdsourcing.
147

Recent Work (2)
• Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in
Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.
• Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple
Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).
• Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in
Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).
• Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.
• Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal
of Machine Learning Research 11:1297-1322.
• Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised
learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual
International Conference on Machine Learning (ICML), 889-896.
• Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling
Tasks.” Journal of Machine Learning Research 13:491-518.
• Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural
information processing systems (NIPS).
• Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM
symposium adjunct on User interface software and technology (UIST).
• Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd
Human Computation Workshop (HCOMP) at AAAI.
• Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of
Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.
• Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective
labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.
• Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration
of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).
• Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International
Conference on Machine Learning (ICML).
148

Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems

Similar to Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems