Rls For Emnlp 2008

3,274 views
3,187 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,274
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
35
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Rls For Emnlp 2008

  1. 1. Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
  2. 2. The primacy of data (Banko and Brill, 2001): Scaling to Very Very Large Corpora for Natural Language Disambiguation
  3. 3. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet Switchboard SemCor statistical textual machine entailment Pascal RTE translation UN Parallel Text
  4. 4. The advent of human computation • Open Mind Common Sense (Singh et al., 2002) • Games with a Purpose (von Ahn and Dabbish, 2004) • Online Word Games (Vickrey et al., 2008)
  5. 5. Amazon Mechanical Turk But what if your task isn’t “fun”? mturk.com
  6. 6. Using AMT for dataset creation • Su et al. (2007): name resolution, attribute extraction • Nakov (2008): paraphrasing noun compounds • Kaisser and Lowe (2008): sentence-level QA annotation • Kaisser et al. (2008): customizing QA summary length • Zaenen (2008): evaluating RTE agreement
  7. 7. Using AMT is cheap Paper Labels Cents/Label Su et al. (2007) 10,500 1.5 Nakov (2008) 19,018 unreported Kaisser and Lowe (2008) 24,321 2.0 Kaisser et al. (2008) 45,300 3.7 Zaenen (2008) 4,000 2.0
  8. 8. And it’s fast... blog.doloreslabs.com
  9. 9. But is it good? • Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data • Method: pick 5 standard datasets, and relabel each point with 10 new annotations • Compare Turk agreement to dataset with reported expert interannotator agreement
  10. 10. Tasks • Affect recognition fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”) • Strapparava and Mihalcea (2007) • Word Similarity sim(boy, lad) > sim(rooster, noon) • Miller and Charles (1991) • Textual Entailment if “Microsoft was established in Italy in 1985”, then “Microsoft was established in 1985” ? • Dagan et al. (2006) • WSD “a bass on the line” vs. “a funky bass line” • Pradhan et al. (2007) • Temporal Annotation ran happens before fell in: • Pustejovsky et al. (2003) “The horse ran past the barn fell.”
  11. 11. Tasks Expert Unique Interannotator Answer Task Labelers Examples Agreement Type Affect 6 700 0.603 numeric Recognition Word 1 30 0.958 numeric Similarity Textual 1 800 0.91 binary Entailment Temporal 1 462 Unknown binary Annotation WSD 1 177 Unknown ternary
  12. 12. Affect Recognition
  13. 13. Interannotator Agreement Emotion 1-E ITA Anger 0.459 Disgust 0.583 • 6 total experts. Fear 0.711 • One expert’s ITA is calculated as Joy 0.596 the average of Pearson correlations from each annotator to the avg. of Sadness 0.645 the other 5 annotators. Surprise 0.464 Valence 0.844 All 0.603
  14. 14. Nonexpert ITA We average over k annotations to create a single “proto-labeler”. We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
  15. 15. Interannotator Agreement anger disgust Emotion 1-E ITA 10-N ITA 0.75 0.65 Anger 0.459 0.675 correlation correlation 0.65 0.55 Disgust 0.583 0.746 0.55 0.45 2 4 6 8 10 2 4 6 8 10 fear joy Fear 0.711 0.689 0.65 0.70 0.45 0.55 correlation correlation 0.50 0.60 Joy 0.596 0.632 0.35 Sadness 0.645 0.776 0.40 2 4 6 8 10 2 4 6 8 10 sadness surprise 0.50 Surprise 0.464 0.496 0.75 0.30 0.40 correlation correlation 0.65 Valence 0.844 0.669 0.55 0.20 All 0.603 0.694 2 4 6 8 10 2 4 6 8 10 annotators annotators Number of nonexpert annotators required to match expert ITA, on average: 4
  16. 16. Interannotator Agreement word similarity RTE Task 1-E ITA 10-N ITA 0.84 0.90 0.96 0.70 0.80 0.90 Affect correlation accuracy 0.603 0.694 Recognition Word 2 4 6 8 10 2 4 6 8 10 0.958 0.952 before/after WSD Similarity 0.980 0.990 1.000 0.70 0.80 0.90 Textual accuracy accuracy 0.91 0.897 Entailment Temporal 2 4 6 8 10 2 4 6 8 10 0.940 annotators annotators Annotation WSD 0.994
  17. 17. Error Analysis: WSD only 1 “mistake” out of 177 labels: “The Egyptian president said he would visit Libya today...” Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.
  18. 18. Error Analysis: RTE ~10 disagreements out of 100: • Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.” • Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/ Close Examples T: A car bomb that exploded outside a U.S. T: “Google files for its long awaited IPO.” military base near Beiji, killed 11 Iraqis. H: “Google goes public.” H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis. Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.
  19. 19. Weighting Annotators • There are a small number of very prolific, very noisy annotators. If we plot each annotator: 1.0 0.8 accuracy 0.6 0.4 0 200 400 600 800 number of annotations Task: RTE • We should be able to do better than majority voting.
  20. 20. Weighting Annotators • To infer the true value x , we weight each i response yi from annotator w using a small gold standard training set: • We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.
  21. 21. Weighting Annotators RTE before/after 0.7 0.8 0.9 0.9 accuracy 0.8 Gold calibrated Naive voting 0.7 annotators annotators RTE: 4.0% avg. Temporal: 3.4% avg. accuracy increase accuracy increase • Several follow-up posts at http://lingpipe-blog.com
  22. 22. Cost Summary Total Cost in Time in Labels / Labels / Task Labels USD hours USD Hour Affect 7000 $2.00 5.93 3500 1180.4 Recognition Word 300 $0.20 0.17 1500 1724.1 Similarity Textual 8000 $8.00 89.3 1000 89.59 Entailment Temporal 4620 $13.86 39.9 333.3 115.85 Annotation WSD 1770 $1.76 8.59 1005.7 206.1 All 21690 $25.82 143.9 840.0 150.7
  23. 23. In Summary • All collected data and annotator instructions are available at: http://ai.stanford.edu/~rion/annotations • Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com nlp.stanford.edu doloreslabs.com ai.stanford.edu
  24. 24. Supplementary Slides
  25. 25. Training systems on nonexpert annotations • A simple affect recognition classifier trained on the averaged nonexpert votes outperforms one trained on a single expert annotation
  26. 26. Where are Turkers? United States 77.1% India 5.3% Philippines 2.8% Canada 2.8% UK 1.9% Germany 0.8% Italy 0.5% Netherlands 0.5% Portugal 0.5% Australia 0.4% Remaining 7.3% divided among 78 countries / territories Analysis by Dolores Labs
  27. 27. Who are Turkers? Gender Age Education Annual income “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  28. 28. Why are Turkers? A. To Kill Time B. Fruitful way to spend free time C. Income purposes D. Pocket change/extra cash E. For entertainment F. Challenge, self-competition G. Unemployed, no regular job, part-time job H. To sharpen/ To keep mind sharp I. Learn English “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  29. 29. How much does AMT pay? “How Much Turking Pays?”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  30. 30. Annotaton Guidelines: Affective Text
  31. 31. Annotaton Guidelines: Word Similarity
  32. 32. Annotaton Guidelines: Textual Entailment
  33. 33. Annotaton Guidelines: Temporal Ordering
  34. 34. Annotaton Guidelines: Word Sense Disambiguation
  35. 35. Affect Recognition We label 100 headlines for each of 7 emotions We pay 4 cents for 20 headlines (140 total labels) Total Cost: $2.00 Time to complete: 5.94 hrs
  36. 36. Example Task: Word Similarity 30 word pairs (Rubenstein and Goodenough, xxxx) We pay 10 Turkers 2 cents apiece to score all 30 word pairs Total cost: $0.20 Time to complete: 10.4 minutes
  37. 37. Word Similarity ITA 0.96 correlation 0.84 0.90 2 4 6 8 10 annotations
  38. 38. • Comparison against multiple annotators • (graphs) • avg. number of nonexperts : expert = 4
  39. 39. Datasets lead the way WSJ + syntactic annotation = Penn TreeBank enables Statistical parsing Brown corpus + sense labeling = Semcor => WSD TreeBank + role labels = PropBank => SRL political speeches + translations = United Nations parallel corpora => statistical machine translation more: RTE, Timebank, ACE/MUC, etc...
  40. 40. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet SemCor Switchboard social network analysis statistical MT Enron E-mail Corpus UN Parallel Text textual entailment Pascal RTE

×