Rls For Emnlp 2008
Upcoming SlideShare
Loading in...5
×
 

Rls For Emnlp 2008

on

  • 4,914 views

 

Statistics

Views

Total Views
4,914
Views on SlideShare
4,327
Embed Views
587

Actions

Likes
2
Downloads
33
Comments
0

8 Embeds 587

http://blog.doloreslabs.com 442
http://blog.crowdflower.com 129
http://stamp51.com 9
http://www.crowdflower.com 3
http://static.slideshare.net 1
http://www.slideshare.net 1
http://web.archive.org 1
http://crowdflower.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Rls For Emnlp 2008 Rls For Emnlp 2008 Presentation Transcript

  • Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
  • The primacy of data (Banko and Brill, 2001): Scaling to Very Very Large Corpora for Natural Language Disambiguation
  • Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet Switchboard SemCor statistical textual machine entailment Pascal RTE translation UN Parallel Text
  • The advent of human computation • Open Mind Common Sense (Singh et al., 2002) • Games with a Purpose (von Ahn and Dabbish, 2004) • Online Word Games (Vickrey et al., 2008)
  • Amazon Mechanical Turk But what if your task isn’t “fun”? mturk.com
  • Using AMT for dataset creation • Su et al. (2007): name resolution, attribute extraction • Nakov (2008): paraphrasing noun compounds • Kaisser and Lowe (2008): sentence-level QA annotation • Kaisser et al. (2008): customizing QA summary length • Zaenen (2008): evaluating RTE agreement
  • Using AMT is cheap Paper Labels Cents/Label Su et al. (2007) 10,500 1.5 Nakov (2008) 19,018 unreported Kaisser and Lowe (2008) 24,321 2.0 Kaisser et al. (2008) 45,300 3.7 Zaenen (2008) 4,000 2.0
  • And it’s fast... blog.doloreslabs.com
  • But is it good? • Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data • Method: pick 5 standard datasets, and relabel each point with 10 new annotations • Compare Turk agreement to dataset with reported expert interannotator agreement
  • Tasks • Affect recognition fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”) • Strapparava and Mihalcea (2007) • Word Similarity sim(boy, lad) > sim(rooster, noon) • Miller and Charles (1991) • Textual Entailment if “Microsoft was established in Italy in 1985”, then “Microsoft was established in 1985” ? • Dagan et al. (2006) • WSD “a bass on the line” vs. “a funky bass line” • Pradhan et al. (2007) • Temporal Annotation ran happens before fell in: • Pustejovsky et al. (2003) “The horse ran past the barn fell.”
  • Tasks Expert Unique Interannotator Answer Task Labelers Examples Agreement Type Affect 6 700 0.603 numeric Recognition Word 1 30 0.958 numeric Similarity Textual 1 800 0.91 binary Entailment Temporal 1 462 Unknown binary Annotation WSD 1 177 Unknown ternary
  • Affect Recognition
  • Interannotator Agreement Emotion 1-E ITA Anger 0.459 Disgust 0.583 • 6 total experts. Fear 0.711 • One expert’s ITA is calculated as Joy 0.596 the average of Pearson correlations from each annotator to the avg. of Sadness 0.645 the other 5 annotators. Surprise 0.464 Valence 0.844 All 0.603
  • Nonexpert ITA We average over k annotations to create a single “proto-labeler”. We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
  • Interannotator Agreement anger disgust Emotion 1-E ITA 10-N ITA 0.75 0.65 Anger 0.459 0.675 correlation correlation 0.65 0.55 Disgust 0.583 0.746 0.55 0.45 2 4 6 8 10 2 4 6 8 10 fear joy Fear 0.711 0.689 0.65 0.70 0.45 0.55 correlation correlation 0.50 0.60 Joy 0.596 0.632 0.35 Sadness 0.645 0.776 0.40 2 4 6 8 10 2 4 6 8 10 sadness surprise 0.50 Surprise 0.464 0.496 0.75 0.30 0.40 correlation correlation 0.65 Valence 0.844 0.669 0.55 0.20 All 0.603 0.694 2 4 6 8 10 2 4 6 8 10 annotators annotators Number of nonexpert annotators required to match expert ITA, on average: 4
  • Interannotator Agreement word similarity RTE Task 1-E ITA 10-N ITA 0.84 0.90 0.96 0.70 0.80 0.90 Affect correlation accuracy 0.603 0.694 Recognition Word 2 4 6 8 10 2 4 6 8 10 0.958 0.952 before/after WSD Similarity 0.980 0.990 1.000 0.70 0.80 0.90 Textual accuracy accuracy 0.91 0.897 Entailment Temporal 2 4 6 8 10 2 4 6 8 10 0.940 annotators annotators Annotation WSD 0.994
  • Error Analysis: WSD only 1 “mistake” out of 177 labels: “The Egyptian president said he would visit Libya today...” Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.
  • Error Analysis: RTE ~10 disagreements out of 100: • Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.” • Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/ Close Examples T: A car bomb that exploded outside a U.S. T: “Google files for its long awaited IPO.” military base near Beiji, killed 11 Iraqis. H: “Google goes public.” H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis. Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.
  • Weighting Annotators • There are a small number of very prolific, very noisy annotators. If we plot each annotator: 1.0 0.8 accuracy 0.6 0.4 0 200 400 600 800 number of annotations Task: RTE • We should be able to do better than majority voting.
  • Weighting Annotators • To infer the true value x , we weight each i response yi from annotator w using a small gold standard training set: • We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.
  • Weighting Annotators RTE before/after 0.7 0.8 0.9 0.9 accuracy 0.8 Gold calibrated Naive voting 0.7 annotators annotators RTE: 4.0% avg. Temporal: 3.4% avg. accuracy increase accuracy increase • Several follow-up posts at http://lingpipe-blog.com
  • Cost Summary Total Cost in Time in Labels / Labels / Task Labels USD hours USD Hour Affect 7000 $2.00 5.93 3500 1180.4 Recognition Word 300 $0.20 0.17 1500 1724.1 Similarity Textual 8000 $8.00 89.3 1000 89.59 Entailment Temporal 4620 $13.86 39.9 333.3 115.85 Annotation WSD 1770 $1.76 8.59 1005.7 206.1 All 21690 $25.82 143.9 840.0 150.7
  • In Summary • All collected data and annotator instructions are available at: http://ai.stanford.edu/~rion/annotations • Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com nlp.stanford.edu doloreslabs.com ai.stanford.edu
  • Supplementary Slides
  • Training systems on nonexpert annotations • A simple affect recognition classifier trained on the averaged nonexpert votes outperforms one trained on a single expert annotation
  • Where are Turkers? United States 77.1% India 5.3% Philippines 2.8% Canada 2.8% UK 1.9% Germany 0.8% Italy 0.5% Netherlands 0.5% Portugal 0.5% Australia 0.4% Remaining 7.3% divided among 78 countries / territories Analysis by Dolores Labs
  • Who are Turkers? Gender Age Education Annual income “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • Why are Turkers? A. To Kill Time B. Fruitful way to spend free time C. Income purposes D. Pocket change/extra cash E. For entertainment F. Challenge, self-competition G. Unemployed, no regular job, part-time job H. To sharpen/ To keep mind sharp I. Learn English “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • How much does AMT pay? “How Much Turking Pays?”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • Annotaton Guidelines: Affective Text
  • Annotaton Guidelines: Word Similarity
  • Annotaton Guidelines: Textual Entailment
  • Annotaton Guidelines: Temporal Ordering
  • Annotaton Guidelines: Word Sense Disambiguation
  • Affect Recognition We label 100 headlines for each of 7 emotions We pay 4 cents for 20 headlines (140 total labels) Total Cost: $2.00 Time to complete: 5.94 hrs
  • Example Task: Word Similarity 30 word pairs (Rubenstein and Goodenough, xxxx) We pay 10 Turkers 2 cents apiece to score all 30 word pairs Total cost: $0.20 Time to complete: 10.4 minutes
  • Word Similarity ITA 0.96 correlation 0.84 0.90 2 4 6 8 10 annotations
  • • Comparison against multiple annotators • (graphs) • avg. number of nonexperts : expert = 4
  • Datasets lead the way WSJ + syntactic annotation = Penn TreeBank enables Statistical parsing Brown corpus + sense labeling = Semcor => WSD TreeBank + role labels = PropBank => SRL political speeches + translations = United Nations parallel corpora => statistical machine translation more: RTE, Timebank, ACE/MUC, etc...
  • Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet SemCor Switchboard social network analysis statistical MT Enron E-mail Corpus UN Parallel Text textual entailment Pascal RTE