Crowdsourcing Best Practices

University of Sheffield, NLP
Crowdsourcing Best Practices
Marta Sabou, Kalina Bontcheva
Leon Derczynski, Arno Scharl

The Science of Corpus Annotation
• Quite well understood best practice in how to create linguistic
annotation of consistently high quality by employing, training, and
managing groups of linguistic and/or domain experts
• Necessary in order to ensure reusability and repeatability of results
• The acquired corpora are of very high quality
• Costs are unfortunately also very high: estimated at between $0.36
and $1.0 per annotation (Zaidan and Callison-Burch, 2011; Poesio et
al., 2012)

Goals
What is crowdsourcing?
What is a typical workflow for crowdsoucing NLP tasks?
What are general solutions used by the state of the art?
How do different crowdsourcing genres compare?

Undefined and generally large group
Compared to in-house projects:
• cheaper (with 33%)
• reach to large number of users;
• reach to diverse user groups,
e.g., speakers of rare languages

Genre 1: Mechanised Labour
• Participants (workers) paid a small amount of money to
complete easy tasks (HIT = Human Intelligence Task)

Genre 2: Games with a purpose (GWAPs)

Genre 3: Altruistic Crowdsourcing

Workflow for Crowdsourcing (Corpora)
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery

Definition of semantic relations between concept pairs.
Coal Is a subcategory of Fossil Fuel

Trade-offs: Cost; Timescale; Worker skills
Small, simple tasks, fast completion => MLab
Complex, large tasks, slower completion => GWAP

• Data distribution: how “micro” is each microtask?
• Long paragraphs hard to digest, worker fatigue
• For most NLP tasks: one sentence corresponds to one task
• Single sentences not always appropriate: e.g. for co-ref
• Task Type
• Selection task: WSD, sentiment analysis, entity
disambiguation, relation typing.
• Sequence marking task: co-reference resolution.

• Categories per selection type task:
• Experts (Hovy,10): max 10, ideally 7
• In crowdsourcing less categories, typically 3-4
• To reduce cognitive load, focus on one category at a time
(e.g., one NE type)
• Number of workers per task:
• Depends on the subjective nature/complexity of the task
• Minimum 3, optimally 5
• Dynamic worker assignment for inconclusive tasks
• Lawson et al. (2010): number of required labels varies for different aspects of
the same NLP problem. Good results with only 4 annotators for Person NEs,
but require 6 for Location and 7 for Organizations

Reward scheme
• What to reward? - money, game points
• When to reward? - when work entered or after its evaluation
• How much to reward?
• Typically between $0.01 - $0.05/task (5 units)
• No clear, repeatable results for quality:reward relation
• High rewards get it done faster, but not better
• Pilot task gives timings, so pay at least minimum wage
• What to do with “bad” work? - detect at run-time and
exclude

Categories:10
Players/task:7
Payment:points
awarded based
on previously
contributed
judgments

Categories:10
Players/task:10
Payment:$0.05/5 units
Players filtered through gold-data

Workflow for Crowdsourcing Corpora
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery

• Pre-process the corpus linguistically, as needed, e.g.
• Tokenise text if user needs to select words
• Identify proper names/noun phrases if we want to classify these
• Bring additional context, if needed, e.g. text of user profile from
Twitter; link to wikipedia page
• For GWAPs:
• Collect interesting input data if possible, I.e.,texts that are fun to
read and work on
• clean input data to remove errors (these will lower player
satisfaction)
• MLab can be used for cleaning the data set

• Build and test the user interfaces
• Easy to medium difficulty in AMT/CF; templates provided for
some task types
• Medium to hard for GWAPs
• Job management interfaces
• Provided in MLab platforms
• Must be built from scratch for GWAPs
• Comparative interface set-up times:
• CF: 2 days; Climate Quiz: 2 months
• (Thaler et al., 12): OntoPronto: 5 months

Example: Job Management Interface

HINT: Add explicitly verifiable
questions to the UI:
- help filter out spammers
- force workers to read the task
input

Pilot the design, measure performance, try again
• Simple, clear design important
• Binary decision tasks get good results
Run bigger pilot studies with volunteers to test
everything and collect gold units for quality control later

Contributor recruitment:
• MLab - easy, given the platforms’ large worker pools and economic
incentives
• GWAPs - challenging, requires much PR.
• Social network based games allow inviting friends for leverage the viral
aspect of SNs
• Multi-channel advertisement: local and national press, science websites,
blogs, bookmarking websites, gaming forums, and social networking
sites
Contributor screening (only in MLab):
• MLab - by country, by skill (e.g., spoken language), by reliability
• MLab - screening through competency tests; answers to gold units

IN-TASK QUALITY CONTROL
Train contributors - through instructions:
• be clear and concise;
• avoid technical jargon;
• provide both positive and negative examples.
Train contributors - through gold data:
• CF - known data units (gold units) hidden in tasks
• When completing a gold unit, a worker is shown the expected answer thus
being trained “on the job”
• Workers who fail a certain percentage of gold units are automatically
excluded from the job
Great opportunity to train workers and amend expert data
Better gold data means better output quality, for the same cost

Example: CF Instructions

• For large tasks - Multi-batch methodology
• Submit tasks in multiple batches
• Ensure contributor diversity by starting batches at different times
• Needs less gold data
• Deal with worker disputes!

• Evaluate individual contributor inputs to produce final decision
• Majority vote
• Discard inputs from low-trusted contributors (e.g. Hsueh et al. (2009))
• Aggregation:
• Merge individual units from the microtasks (e.g. sentences) into
complete documents, including all crowdsourced markup
• Majority voting; average; collection
• Aggregation strategies:
• Climate Quiz: relation chosen between pairs if it has been voted
by 4 more players than the next most popular relation
• CF - Majority voting; confidence value computed taking into
account worker accuracy

• Evaluate corpus quality
• Compute inter-worker agreement;
• Compute inter-worker-trusted annotator agreement
• Compare to a gold standard baseline (P/R/F/Acc)
•To facilitate reuse:
• deliver corpus in a widely used format (XCES, CONLL, GATE XML)
• Share with research community

Evaluation of relation selection task:
Comparison with Gold Standard
Same data, different aggregation

Legal and Ethical Issues
1. Acknowledging the Crowd‘s contribution
S. Cooper, [other authors], and Foldit players: Predicting protein structures
with a multiplayer online game. Nature, 466(7307):756-760, 2010.
2. Ensuring privacy and wellbeing
1. Mechnised labour criticised for low wages, lack of worker rights
2. Majority of workers rely on microtasks as main income source
3. Prevent prolonged use & user exploitation (e.g. daily caps)
3. Licensing and consent
1. Some clearly state the use of Creative Common licenses
2. General failure to provide informed consent information

Thank you!
Questions?

Crowdsourcing Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Crowdsourcing Best Practices

Similar to Crowdsourcing Best Practices (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing Best Practices

Editor's Notes