Rls For Emnlp 2008

Cheap and Fast - But is it Good?
 Evaluating Nonexpert Annotations
     for Natural Language Tasks



Rion Snow Brendan O’Connor Daniel Jurafsky   Andrew Y. Ng
The primacy of data




                 (Banko and Brill, 2001):
           Scaling to Very Very Large Corpora
          for Natural Language Disambiguation
Datasets drive research
                   statistical                      semantic role
                    parsing                           labeling
                                 PropBank
Penn Treebank




                  word sense                            speech
                disambiguation                        recognition
 WordNet                         Switchboard
 SemCor



                                                       statistical
                     textual
                                                       machine
                   entailment
 Pascal RTE                                           translation
                                 UN Parallel Text
The advent of human
         computation



• Open Mind Common Sense (Singh et al., 2002)
• Games with a Purpose (von Ahn and Dabbish, 2004)
• Online Word Games (Vickrey et al., 2008)
Amazon Mechanical Turk
   But what if your task isn’t “fun”?




            mturk.com
Using AMT for dataset
           creation
•   Su et al. (2007): name resolution, attribute extraction

•   Nakov (2008): paraphrasing noun compounds

•   Kaisser and Lowe (2008): sentence-level QA annotation

•   Kaisser et al. (2008): customizing QA summary length

•   Zaenen (2008): evaluating RTE agreement
Using AMT is cheap
         Paper            Labels   Cents/Label
    Su et al. (2007)      10,500       1.5

     Nakov (2008)         19,018   unreported

Kaisser and Lowe (2008)   24,321       2.0

  Kaisser et al. (2008)   45,300       3.7

    Zaenen (2008)         4,000        2.0
And it’s fast...




   blog.doloreslabs.com
But is it good?
• Objective: compare nonexpert annotation
  quality on NLP tasks with gold standard,
  expert-annotated data
• Method: pick 5 standard datasets, and
  relabel each point with 10 new annotations
• Compare Turk agreement to dataset with
  reported expert interannotator agreement
Tasks
• Affect recognition                      fear(“Tropical storm forms in Atlantic”) >
                                                fear(“Goal delight for Sheva”)
 •   Strapparava and Mihalcea (2007)


• Word Similarity                             sim(boy, lad) > sim(rooster, noon)
 •   Miller and Charles (1991)


• Textual Entailment                   if “Microsoft was established in Italy in 1985”,
                                          then “Microsoft was established in 1985” ?
 •   Dagan et al. (2006)


• WSD                                    “a bass on the line” vs. “a funky bass line”
 •   Pradhan et al. (2007)


• Temporal Annotation                          ran happens before fell in:
 •   Pustejovsky et al. (2003)              “The horse ran past the barn fell.”
Tasks
               Expert      Unique    Interannotator   Answer
   Task
              Labelers    Examples     Agreement       Type
  Affect
                 6          700         0.603         numeric
Recognition
   Word
                 1          30          0.958         numeric
 Similarity
 Textual
                 1          800          0.91         binary
Entailment
 Temporal
                 1          462       Unknown         binary
Annotation

  WSD            1          177       Unknown         ternary
Affect Recognition
Interannotator Agreement
                                          Emotion 1-E ITA
                                           Anger     0.459
                                          Disgust    0.583
•   6 total experts.
                                           Fear      0.711
•   One expert’s ITA is calculated as
                                            Joy      0.596
    the average of Pearson correlations
    from each annotator to the avg. of    Sadness    0.645
    the other 5 annotators.
                                          Surprise   0.464
                                          Valence    0.844
                                            All      0.603
Nonexpert ITA
We average over k
annotations to create a
single “proto-labeler”.

We plot the ITA of this
proto-labeler for up to
10 annotations and
compare to the average
single expert ITA.
Interannotator Agreement
                           anger                                disgust
                                                                                      Emotion 1-E ITA 10-N ITA

                                                  0.75
            0.65




                                                                                       Anger     0.459   0.675
correlation




                                             correlation
                                                 0.65
  0.55




                                                                                      Disgust    0.583   0.746
                                                  0.55
       0.45




                   2   4      6     8   10                  2   4     6      8   10

                           fear                                     joy
                                                                                       Fear      0.711   0.689
                                                  0.65
       0.70




                                              0.45 0.55
 correlation




                                             correlation
0.50 0.60




                                                                                        Joy      0.596   0.632
                                                  0.35




                                                                                      Sadness    0.645   0.776
       0.40




                   2   4      6     8   10                  2   4     6      8   10

                       sadness                                  surprise
                                                  0.50




                                                                                      Surprise   0.464   0.496
           0.75




                                             0.30 0.40
correlation




                                              correlation
 0.65




                                                                                      Valence    0.844   0.669
       0.55




                                                  0.20




                                                                                        All      0.603   0.694
                   2   4     6      8   10                  2   4     6      8   10
                       annotators                               annotators




                           Number of nonexpert annotators required to match expert ITA, on average: 4
Interannotator Agreement
                  word similarity                                      RTE
                                                                                          Task       1-E ITA 10-N ITA
0.84 0.90 0.96




                                           0.70 0.80 0.90
                                                                                         Affect
    correlation




                                                accuracy
                                                                                                   0.603 0.694
                                                                                       Recognition
                                                                                          Word
                  2     4   6    8    10                       2   4    6    8    10
                                                                                                     0.958 0.952
                      before/after                                     WSD              Similarity
                                           0.980 0.990 1.000
0.70 0.80 0.90




                                                                                        Textual
     accuracy




                                                accuracy




                                                                                                     0.91 0.897
                                                                                       Entailment
                                                                                        Temporal
                  2     4   6     8   10                       2    4   6     8   10                         0.940
                       annotators                                  annotators          Annotation

                                                                                         WSD                 0.994
Error Analysis: WSD
                       only 1 “mistake” out of 177 labels:


                          “The Egyptian president said
                            he would visit Libya today...”



Semeval Task 17 marks this as “executive officer of a firm” sense,
     while Turkers voted for “head of a country” sense.
Error Analysis: RTE
                                     ~10 disagreements out of 100:
                                      •   Bob Carpenter: “Over half of the residual
                                          disagreements between the Turker annotations and
                                          the gold standard were of this highly suspect
                                          nature and some were just wrong.”

                                      •   Bob Carpenter’s full analysis available at“Fool’s
                                          Gold Standard”, http://lingpipe-blog.com/


                                  Close Examples
T: 
 A car bomb that exploded outside a U.S.          T: “Google files for its long awaited IPO.”
military base near Beiji, killed 11 Iraqis.
                                                      H: “Google goes public.”
H: A car bomb exploded outside a U.S. base in
the northern town of Beiji, killing 11 Iraqis.

Labeled “TRUE” in PASCAL RTE-1,                       Labeled “TRUE” in PASCAL RTE-1,
Turkers vote 6-4 “FALSE”.                             Turkers vote 6-4 “FALSE”.
Weighting Annotators
 • There are a small number of very prolific, very
   noisy annotators. If we plot each annotator:

                          1.0
                          0.8
               accuracy

                          0.6
                          0.4




                                0    200    400    600      800

                                    number of annotations


                                    Task: RTE
• We should be able to do better than majority voting.
Weighting Annotators
• To infer the true value x , we weight each
                              i
    response yi from annotator w using a small gold
    standard training set:




•   We estimate annotator response from 5% of the gold
    standard test set, and evaluate with 20-fold CV.
Weighting Annotators
                    RTE                  before/after
    0.7 0.8 0.9




                               0.9
       accuracy




                               0.8
                                            Gold calibrated
                                            Naive voting




                               0.7
                  annotators              annotators


      RTE: 4.0% avg.           Temporal: 3.4% avg.
     accuracy increase          accuracy increase

• Several follow-up posts at         http://lingpipe-blog.com
Cost Summary
               Total   Cost in   Time in   Labels /   Labels /
   Task
              Labels    USD       hours     USD        Hour
   Affect     7000     $2.00      5.93      3500      1180.4
Recognition
   Word
               300     $0.20      0.17      1500      1724.1
 Similarity
  Textual
              8000     $8.00      89.3      1000       89.59
Entailment
 Temporal
              4620     $13.86     39.9      333.3     115.85
Annotation
  WSD         1770     $1.76      8.59     1005.7      206.1

    All       21690    $25.82    143.9      840.0      150.7
In Summary
   • All collected data and annotator
      instructions are available at:
      http://ai.stanford.edu/~rion/annotations




   • Summary blog post and comments on
      the Dolores Labs blog:
      http://blog.doloreslabs.com




nlp.stanford.edu    doloreslabs.com     ai.stanford.edu
Supplementary Slides
Training systems on
nonexpert annotations
• A simple affect recognition classifier trained
  on the averaged nonexpert votes
  outperforms one trained on a single expert
  annotation
Where are Turkers?
          United States                       77.1%
              India                            5.3%
           Philippines                         2.8%
            Canada                             2.8%
               UK                              1.9%
            Germany                            0.8%
              Italy                            0.5%
          Netherlands                          0.5%
            Portugal                           0.5%
            Australia                          0.4%

          Remaining 7.3% divided among 78 countries / territories

                         Analysis by Dolores Labs
Who are Turkers?


Gender                                        Age




Education                              Annual income
 “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU
    behind-the-enemy-lines.blogspot.com
Why are Turkers?

A. To Kill Time
B. Fruitful way to spend free time
C. Income purposes
D. Pocket change/extra cash
E. For entertainment
F. Challenge, self-competition
G. Unemployed, no regular job, part-time job
H. To sharpen/ To keep mind sharp
I. Learn English




      “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU
                   behind-the-enemy-lines.blogspot.com
How much does AMT pay?




      “How Much Turking Pays?”, Panos Ipeirotis, NYU
     behind-the-enemy-lines.blogspot.com
Annotaton Guidelines:
   Affective Text
Annotaton Guidelines:
  Word Similarity
Annotaton Guidelines:
 Textual Entailment
Annotaton Guidelines:
 Temporal Ordering
Annotaton Guidelines:
Word Sense Disambiguation
Affect Recognition


           We label 100 headlines
           for each of 7 emotions
            We pay 4 cents for 20
             headlines (140 total
                    labels)
            Total Cost: $2.00
        Time to complete: 5.94 hrs
Example Task: Word Similarity
                    30 word pairs
                   (Rubenstein and
                  Goodenough, xxxx)

                  We pay 10 Turkers 2
                  cents apiece to score
                    all 30 word pairs

                   Total cost: $0.20
                   Time to complete:
                     10.4 minutes
Word Similarity ITA
                 0.96
     correlation
0.84     0.90




                        2    4     6     8    10
                             annotations
• Comparison against multiple annotators
• (graphs)
• avg. number of nonexperts : expert = 4
Datasets lead the way
WSJ + syntactic annotation = Penn TreeBank enables Statistical
                           parsing

      Brown corpus + sense labeling = Semcor => WSD

         TreeBank + role labels = PropBank => SRL

  political speeches + translations = United Nations parallel
           corpora => statistical machine translation

           more: RTE, Timebank, ACE/MUC, etc...
Datasets drive research
                   statistical                       semantic role
                    parsing                            labeling
                                    PropBank
 Penn Treebank




                   word sense
                                                        speech
                 disambiguation
                                                      recognition
    WordNet
    SemCor                          Switchboard



                 social network
                    analysis                         statistical MT
Enron E-mail
  Corpus                          UN Parallel Text
                   textual
                 entailment
Pascal RTE
1 of 40

Recommended

Copy Of Consumer by
Copy Of ConsumerCopy Of Consumer
Copy Of ConsumerPaul Bujak
305 views22 slides
Spatially resolved characterization of capping using nondestructive ultrasoni... by
Spatially resolved characterization of capping using nondestructive ultrasoni...Spatially resolved characterization of capping using nondestructive ultrasoni...
Spatially resolved characterization of capping using nondestructive ultrasoni...Boehringer Ingelheim Pharmaceuticals, Inc.
1.4K views27 slides
Pet bottle rejection by
Pet bottle rejectionPet bottle rejection
Pet bottle rejectionmacoy0916
626 views13 slides
Bayesian Mixture Modeling of Structural Equation Models by
Bayesian Mixture Modeling of Structural Equation ModelsBayesian Mixture Modeling of Structural Equation Models
Bayesian Mixture Modeling of Structural Equation Modelsninohardt
301 views21 slides
Required vs. Optional Arguments by
Required vs. Optional ArgumentsRequired vs. Optional Arguments
Required vs. Optional ArgumentsJinho Choi
473 views17 slides
Exce lesson 8 power point by
Exce lesson 8 power pointExce lesson 8 power point
Exce lesson 8 power pointAnthony Sanchez
199 views14 slides

More Related Content

Recently uploaded

GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
142 views32 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
75 views23 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
130 views29 slides
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
58 views9 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
103 views59 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 views38 slides

Recently uploaded(20)

GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue149 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views

Featured

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
26.2K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
6.3K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.7K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
5.4K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.5K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.8K views42 slides

Featured(20)

ChatGPT and the Future of Work - Clark Boyd by Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd26.2K views
Getting into the tech field. what next by Tessa Mero
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero6.3K views
Google's Just Not That Into You: Understanding Core Updates & Search Intent by Lily Ray
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray6.7K views
Time Management & Productivity - Best Practices by Vit Horky
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky169.8K views
The six step guide to practical project management by MindGenius
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius36.7K views
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.7K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.3K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views

Rls For Emnlp 2008

  • 1. Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
  • 2. The primacy of data (Banko and Brill, 2001): Scaling to Very Very Large Corpora for Natural Language Disambiguation
  • 3. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet Switchboard SemCor statistical textual machine entailment Pascal RTE translation UN Parallel Text
  • 4. The advent of human computation • Open Mind Common Sense (Singh et al., 2002) • Games with a Purpose (von Ahn and Dabbish, 2004) • Online Word Games (Vickrey et al., 2008)
  • 5. Amazon Mechanical Turk But what if your task isn’t “fun”? mturk.com
  • 6. Using AMT for dataset creation • Su et al. (2007): name resolution, attribute extraction • Nakov (2008): paraphrasing noun compounds • Kaisser and Lowe (2008): sentence-level QA annotation • Kaisser et al. (2008): customizing QA summary length • Zaenen (2008): evaluating RTE agreement
  • 7. Using AMT is cheap Paper Labels Cents/Label Su et al. (2007) 10,500 1.5 Nakov (2008) 19,018 unreported Kaisser and Lowe (2008) 24,321 2.0 Kaisser et al. (2008) 45,300 3.7 Zaenen (2008) 4,000 2.0
  • 8. And it’s fast... blog.doloreslabs.com
  • 9. But is it good? • Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data • Method: pick 5 standard datasets, and relabel each point with 10 new annotations • Compare Turk agreement to dataset with reported expert interannotator agreement
  • 10. Tasks • Affect recognition fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”) • Strapparava and Mihalcea (2007) • Word Similarity sim(boy, lad) > sim(rooster, noon) • Miller and Charles (1991) • Textual Entailment if “Microsoft was established in Italy in 1985”, then “Microsoft was established in 1985” ? • Dagan et al. (2006) • WSD “a bass on the line” vs. “a funky bass line” • Pradhan et al. (2007) • Temporal Annotation ran happens before fell in: • Pustejovsky et al. (2003) “The horse ran past the barn fell.”
  • 11. Tasks Expert Unique Interannotator Answer Task Labelers Examples Agreement Type Affect 6 700 0.603 numeric Recognition Word 1 30 0.958 numeric Similarity Textual 1 800 0.91 binary Entailment Temporal 1 462 Unknown binary Annotation WSD 1 177 Unknown ternary
  • 13. Interannotator Agreement Emotion 1-E ITA Anger 0.459 Disgust 0.583 • 6 total experts. Fear 0.711 • One expert’s ITA is calculated as Joy 0.596 the average of Pearson correlations from each annotator to the avg. of Sadness 0.645 the other 5 annotators. Surprise 0.464 Valence 0.844 All 0.603
  • 14. Nonexpert ITA We average over k annotations to create a single “proto-labeler”. We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
  • 15. Interannotator Agreement anger disgust Emotion 1-E ITA 10-N ITA 0.75 0.65 Anger 0.459 0.675 correlation correlation 0.65 0.55 Disgust 0.583 0.746 0.55 0.45 2 4 6 8 10 2 4 6 8 10 fear joy Fear 0.711 0.689 0.65 0.70 0.45 0.55 correlation correlation 0.50 0.60 Joy 0.596 0.632 0.35 Sadness 0.645 0.776 0.40 2 4 6 8 10 2 4 6 8 10 sadness surprise 0.50 Surprise 0.464 0.496 0.75 0.30 0.40 correlation correlation 0.65 Valence 0.844 0.669 0.55 0.20 All 0.603 0.694 2 4 6 8 10 2 4 6 8 10 annotators annotators Number of nonexpert annotators required to match expert ITA, on average: 4
  • 16. Interannotator Agreement word similarity RTE Task 1-E ITA 10-N ITA 0.84 0.90 0.96 0.70 0.80 0.90 Affect correlation accuracy 0.603 0.694 Recognition Word 2 4 6 8 10 2 4 6 8 10 0.958 0.952 before/after WSD Similarity 0.980 0.990 1.000 0.70 0.80 0.90 Textual accuracy accuracy 0.91 0.897 Entailment Temporal 2 4 6 8 10 2 4 6 8 10 0.940 annotators annotators Annotation WSD 0.994
  • 17. Error Analysis: WSD only 1 “mistake” out of 177 labels: “The Egyptian president said he would visit Libya today...” Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.
  • 18. Error Analysis: RTE ~10 disagreements out of 100: • Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.” • Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/ Close Examples T: A car bomb that exploded outside a U.S. T: “Google files for its long awaited IPO.” military base near Beiji, killed 11 Iraqis. H: “Google goes public.” H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis. Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.
  • 19. Weighting Annotators • There are a small number of very prolific, very noisy annotators. If we plot each annotator: 1.0 0.8 accuracy 0.6 0.4 0 200 400 600 800 number of annotations Task: RTE • We should be able to do better than majority voting.
  • 20. Weighting Annotators • To infer the true value x , we weight each i response yi from annotator w using a small gold standard training set: • We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.
  • 21. Weighting Annotators RTE before/after 0.7 0.8 0.9 0.9 accuracy 0.8 Gold calibrated Naive voting 0.7 annotators annotators RTE: 4.0% avg. Temporal: 3.4% avg. accuracy increase accuracy increase • Several follow-up posts at http://lingpipe-blog.com
  • 22. Cost Summary Total Cost in Time in Labels / Labels / Task Labels USD hours USD Hour Affect 7000 $2.00 5.93 3500 1180.4 Recognition Word 300 $0.20 0.17 1500 1724.1 Similarity Textual 8000 $8.00 89.3 1000 89.59 Entailment Temporal 4620 $13.86 39.9 333.3 115.85 Annotation WSD 1770 $1.76 8.59 1005.7 206.1 All 21690 $25.82 143.9 840.0 150.7
  • 23. In Summary • All collected data and annotator instructions are available at: http://ai.stanford.edu/~rion/annotations • Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com nlp.stanford.edu doloreslabs.com ai.stanford.edu
  • 25. Training systems on nonexpert annotations • A simple affect recognition classifier trained on the averaged nonexpert votes outperforms one trained on a single expert annotation
  • 26. Where are Turkers? United States 77.1% India 5.3% Philippines 2.8% Canada 2.8% UK 1.9% Germany 0.8% Italy 0.5% Netherlands 0.5% Portugal 0.5% Australia 0.4% Remaining 7.3% divided among 78 countries / territories Analysis by Dolores Labs
  • 27. Who are Turkers? Gender Age Education Annual income “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 28. Why are Turkers? A. To Kill Time B. Fruitful way to spend free time C. Income purposes D. Pocket change/extra cash E. For entertainment F. Challenge, self-competition G. Unemployed, no regular job, part-time job H. To sharpen/ To keep mind sharp I. Learn English “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 29. How much does AMT pay? “How Much Turking Pays?”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 30. Annotaton Guidelines: Affective Text
  • 31. Annotaton Guidelines: Word Similarity
  • 35. Affect Recognition We label 100 headlines for each of 7 emotions We pay 4 cents for 20 headlines (140 total labels) Total Cost: $2.00 Time to complete: 5.94 hrs
  • 36. Example Task: Word Similarity 30 word pairs (Rubenstein and Goodenough, xxxx) We pay 10 Turkers 2 cents apiece to score all 30 word pairs Total cost: $0.20 Time to complete: 10.4 minutes
  • 37. Word Similarity ITA 0.96 correlation 0.84 0.90 2 4 6 8 10 annotations
  • 38. • Comparison against multiple annotators • (graphs) • avg. number of nonexperts : expert = 4
  • 39. Datasets lead the way WSJ + syntactic annotation = Penn TreeBank enables Statistical parsing Brown corpus + sense labeling = Semcor => WSD TreeBank + role labels = PropBank => SRL political speeches + translations = United Nations parallel corpora => statistical machine translation more: RTE, Timebank, ACE/MUC, etc...
  • 40. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet SemCor Switchboard social network analysis statistical MT Enron E-mail Corpus UN Parallel Text textual entailment Pascal RTE