1. Cheap and Fast - But is it Good?
Evaluating Nonexpert Annotations
for Natural Language Tasks
Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
2. The primacy of data
(Banko and Brill, 2001):
Scaling to Very Very Large Corpora
for Natural Language Disambiguation
3. Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank
word sense speech
disambiguation recognition
WordNet Switchboard
SemCor
statistical
textual
machine
entailment
Pascal RTE translation
UN Parallel Text
4. The advent of human
computation
• Open Mind Common Sense (Singh et al., 2002)
• Games with a Purpose (von Ahn and Dabbish, 2004)
• Online Word Games (Vickrey et al., 2008)
6. Using AMT for dataset
creation
• Su et al. (2007): name resolution, attribute extraction
• Nakov (2008): paraphrasing noun compounds
• Kaisser and Lowe (2008): sentence-level QA annotation
• Kaisser et al. (2008): customizing QA summary length
• Zaenen (2008): evaluating RTE agreement
7. Using AMT is cheap
Paper Labels Cents/Label
Su et al. (2007) 10,500 1.5
Nakov (2008) 19,018 unreported
Kaisser and Lowe (2008) 24,321 2.0
Kaisser et al. (2008) 45,300 3.7
Zaenen (2008) 4,000 2.0
9. But is it good?
• Objective: compare nonexpert annotation
quality on NLP tasks with gold standard,
expert-annotated data
• Method: pick 5 standard datasets, and
relabel each point with 10 new annotations
• Compare Turk agreement to dataset with
reported expert interannotator agreement
10. Tasks
• Affect recognition fear(“Tropical storm forms in Atlantic”) >
fear(“Goal delight for Sheva”)
• Strapparava and Mihalcea (2007)
• Word Similarity sim(boy, lad) > sim(rooster, noon)
• Miller and Charles (1991)
• Textual Entailment if “Microsoft was established in Italy in 1985”,
then “Microsoft was established in 1985” ?
• Dagan et al. (2006)
• WSD “a bass on the line” vs. “a funky bass line”
• Pradhan et al. (2007)
• Temporal Annotation ran happens before fell in:
• Pustejovsky et al. (2003) “The horse ran past the barn fell.”
13. Interannotator Agreement
Emotion 1-E ITA
Anger 0.459
Disgust 0.583
• 6 total experts.
Fear 0.711
• One expert’s ITA is calculated as
Joy 0.596
the average of Pearson correlations
from each annotator to the avg. of Sadness 0.645
the other 5 annotators.
Surprise 0.464
Valence 0.844
All 0.603
14. Nonexpert ITA
We average over k
annotations to create a
single “proto-labeler”.
We plot the ITA of this
proto-labeler for up to
10 annotations and
compare to the average
single expert ITA.
15. Interannotator Agreement
anger disgust
Emotion 1-E ITA 10-N ITA
0.75
0.65
Anger 0.459 0.675
correlation
correlation
0.65
0.55
Disgust 0.583 0.746
0.55
0.45
2 4 6 8 10 2 4 6 8 10
fear joy
Fear 0.711 0.689
0.65
0.70
0.45 0.55
correlation
correlation
0.50 0.60
Joy 0.596 0.632
0.35
Sadness 0.645 0.776
0.40
2 4 6 8 10 2 4 6 8 10
sadness surprise
0.50
Surprise 0.464 0.496
0.75
0.30 0.40
correlation
correlation
0.65
Valence 0.844 0.669
0.55
0.20
All 0.603 0.694
2 4 6 8 10 2 4 6 8 10
annotators annotators
Number of nonexpert annotators required to match expert ITA, on average: 4
17. Error Analysis: WSD
only 1 “mistake” out of 177 labels:
“The Egyptian president said
he would visit Libya today...”
Semeval Task 17 marks this as “executive officer of a firm” sense,
while Turkers voted for “head of a country” sense.
18. Error Analysis: RTE
~10 disagreements out of 100:
• Bob Carpenter: “Over half of the residual
disagreements between the Turker annotations and
the gold standard were of this highly suspect
nature and some were just wrong.”
• Bob Carpenter’s full analysis available at“Fool’s
Gold Standard”, http://lingpipe-blog.com/
Close Examples
T:
A car bomb that exploded outside a U.S. T: “Google files for its long awaited IPO.”
military base near Beiji, killed 11 Iraqis.
H: “Google goes public.”
H: A car bomb exploded outside a U.S. base in
the northern town of Beiji, killing 11 Iraqis.
Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1,
Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.
19. Weighting Annotators
• There are a small number of very prolific, very
noisy annotators. If we plot each annotator:
1.0
0.8
accuracy
0.6
0.4
0 200 400 600 800
number of annotations
Task: RTE
• We should be able to do better than majority voting.
20. Weighting Annotators
• To infer the true value x , we weight each
i
response yi from annotator w using a small gold
standard training set:
• We estimate annotator response from 5% of the gold
standard test set, and evaluate with 20-fold CV.
22. Cost Summary
Total Cost in Time in Labels / Labels /
Task
Labels USD hours USD Hour
Affect 7000 $2.00 5.93 3500 1180.4
Recognition
Word
300 $0.20 0.17 1500 1724.1
Similarity
Textual
8000 $8.00 89.3 1000 89.59
Entailment
Temporal
4620 $13.86 39.9 333.3 115.85
Annotation
WSD 1770 $1.76 8.59 1005.7 206.1
All 21690 $25.82 143.9 840.0 150.7
23. In Summary
• All collected data and annotator
instructions are available at:
http://ai.stanford.edu/~rion/annotations
• Summary blog post and comments on
the Dolores Labs blog:
http://blog.doloreslabs.com
nlp.stanford.edu doloreslabs.com ai.stanford.edu
25. Training systems on
nonexpert annotations
• A simple affect recognition classifier trained
on the averaged nonexpert votes
outperforms one trained on a single expert
annotation
26. Where are Turkers?
United States 77.1%
India 5.3%
Philippines 2.8%
Canada 2.8%
UK 1.9%
Germany 0.8%
Italy 0.5%
Netherlands 0.5%
Portugal 0.5%
Australia 0.4%
Remaining 7.3% divided among 78 countries / territories
Analysis by Dolores Labs
27. Who are Turkers?
Gender Age
Education Annual income
“Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
28. Why are Turkers?
A. To Kill Time
B. Fruitful way to spend free time
C. Income purposes
D. Pocket change/extra cash
E. For entertainment
F. Challenge, self-competition
G. Unemployed, no regular job, part-time job
H. To sharpen/ To keep mind sharp
I. Learn English
“Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
29. How much does AMT pay?
“How Much Turking Pays?”, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
35. Affect Recognition
We label 100 headlines
for each of 7 emotions
We pay 4 cents for 20
headlines (140 total
labels)
Total Cost: $2.00
Time to complete: 5.94 hrs
36. Example Task: Word Similarity
30 word pairs
(Rubenstein and
Goodenough, xxxx)
We pay 10 Turkers 2
cents apiece to score
all 30 word pairs
Total cost: $0.20
Time to complete:
10.4 minutes
38. • Comparison against multiple annotators
• (graphs)
• avg. number of nonexperts : expert = 4
39. Datasets lead the way
WSJ + syntactic annotation = Penn TreeBank enables Statistical
parsing
Brown corpus + sense labeling = Semcor => WSD
TreeBank + role labels = PropBank => SRL
political speeches + translations = United Nations parallel
corpora => statistical machine translation
more: RTE, Timebank, ACE/MUC, etc...
40. Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank
word sense
speech
disambiguation
recognition
WordNet
SemCor Switchboard
social network
analysis statistical MT
Enron E-mail
Corpus UN Parallel Text
textual
entailment
Pascal RTE