Crowdsourcing in NLP

145 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
145
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Crowdsourcing in NLP

  1. 1. What is Crowdsourcing ?• “Crowdsourcing is the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined and large network of people in the form of an open call.”(2006 - magazine article)• "Crowdsourcing is a type of participative online activity in which an individual, an institution, a non-profit organization, or company proposes to a group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The undertaking of the task, of variable complexity and modularity, and in which the crowd should participate bringing their work, money, knowledge and/or experience, always entails mutual benefit….". (Estellés-Arolas, 2012 – integrated 40 definitions from literature 2008 onward)Sample Tasks (Difficult for computers but simple for human beings):• Identify the disease mentions from PubMed abstracts• Classify book reviews as positive or negative 1
  2. 2. Amazon Mechanical Turk(AMT) (launched 2005) A Crowdsourcing Platform Amazon Mechanical Turk• Requester – Designs the task and prepare the dataset (i.e., taskitems) • Workers – Submits to AMT – Work on taskitem(s); can work • # judgments per taskitem on as many or as few taskitems • Reward payment for each as they please taskitem to each worker – Get paid small amounts of • Specify restrictions (worker money (a few cents per locations, previous tasks accuracy) taskitem) 2
  3. 3. About the AMT workersWho are they? (Ross et al. 2010) How do they search for Tasks on AMT (Chilton et al. 2010)• Age • Title – Average 31, Min 18, Max 71, Median 27 • Reward amount• Gender – Female 55%, Male 45% • Date posted• Occupation – Newest to oldest – 38% FT, 31% PT, 31% unemployed• Education • Time allotted – 66% college or higher, 33% students • Number of task items• Salary – Most to fewest task items – Median 20k – 30k • Expiration date• Country – USA 57%, India 32%, Other 11% 3
  4. 4. Paper 1: Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks Rion Snow, Brendon O’Conner, Daniel Jurafsky, Andrew Y. Ng Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008Motivation• Large-scale annotation – Vital to NLP research and developing new algorithms• Challenges to expert annotation – Financially Expensive – Time Consuming• An alternative – Explore non-expert annotations (crowdsourcing) 4
  5. 5. Overview• 5 natural language tasks (short and simple) 1. Affect Recognition 2. Word Similarity 3. Recognizing Textual Entailment (RTE) 4. Event Temporal Ordering 5. Word Sense Disambiguation• Method – Post tasks on Amazon Mechanical Turk – Request 10 independent judgments(annotations)/taskitem• Evaluate Performance of Non-experts – Compare annotations (with experts) – Compare m/c learning classifier performance (trained on expert vs. non- expert data) 5
  6. 6. Task#1: Affect Recognition Original Experiment with expert: (Strapparava and Mihalcea, 2007) in SemEval• Given a textual headline • Original Experiment – identify and rate emotions – 1000 headlines extracted from – anger [0,100], disgust [0,100], New York Times, CNN, Google fear [0,100], joy [0,100], News sadness [0,100], surprise – 6 expert annotators per [0,100], overall valence [- headline 100,100] • Non-expert• Example headline- (Crowdsourcing) annotation pair Experiment Outcry at N Korea ‘nuclear test’ – 100 headline sample (Anger, 30), (Disgust, 30), (Fear, – 10 annotations per headline 30), (Joy, 0), (Sadness, 20), (Surprise, 40), (Valence, -50) • 70 affect labels – Paid $2.00 to Amazon for collecting 7000 affect labels 6
  7. 7. Task#1 Affect Recognition Results (Inter-Annotator Agreement ITA) • ITAtask= average(ITAannotator) • ITAannotator= Pearson’s correlation – This annotator’s labels Vs. Average of labels of other annotators Individual Experts are better labelers than individual non-expertsOriginal experiment with experts Non-expert annotations are good enough to increase overall quality of task 7
  8. 8. Task#1 Affect Recognition Results How many non-expert labelers are equivalent to 1 expert labeler?• Treat n (1,2,3,…10) non-expert • Find the minimum number of annotators as 1 meta-labeler non-experts (k) needed to rival – Average the labels for all possible the performance of an expert subsets of size n On an average, it takes 4 non-experts to produce expert-like performance. For this task, it takes $1.00 to generate 875 expert equivalent labels 8
  9. 9. Task#2: Word Similarity (Original Experiment with Experts – Resnik, 1999)• Given a pair of words • Crowdsourcing Experiment {boy, lad} – 30 pairs of words – 10 annotations per pair – Provide numeric – Paid Total $0.2 for 300 judgments [0,10] on annotations similarity – Task was completed within 11• Original Experiment minutes of posting – 30 pairs of words – 10 experts – ITA = 0.958 – Maximum ITA 0.952 9
  10. 10. Task#3: Recognizing Textual Entailment (Original Experiment: Dagan et al. 2006)• To determine whether • Crowdsourcing second sentence infers experiment from the first sentence – 800 sentence pairs S1: Crude Oil Prices Slump – 10 annotators S2: Oil prices drop – For average response, used – Answer: true simple majority voting, and S1: The government announced random tie breaking that it plans to raise oil prices S2: Oil prices drop – Answer: false• Original Experiment – 800 sentence pairs – ITA = 0.91 – Maximum ITA = 0.89 10
  11. 11. Task#5: Word Sense Disambiguation (Original Experiment - SemEval Pradhan et al. 2007) • Crowdsourcing Experiment• Robert E. Lyons III … was – 177 examples of noun “president” for appointed president and chief the 3 senses operating officer… – 10 annotators – Results aggregated using maj. voting – What is the sense of and random tie breaking, and “president”? accuracy calculated w.r.t gold • Executive officer of a form, corporation or university • Head of a country (other than the US) • Head of the US, President of the United States• Original Experiment – ITA not reported – Provides gold standard – Red line represents best system’s performance SemEval Task 17 (Cai et al., 2007) 11
  12. 12. Training Classifiers: Non-Experts vs. Experts Task#1: Affect Recognition • Designed a supervised • Experiments (Training: 100;Testing: 900 ) affect recognition system • Training the system • Testing a new headline – In most cases, only 1 non-expert helped better train the system. – Possible Explanation: • Individual labelers(experts or non- experts) tend to be biased • For non-experts, even a single set of e= emotion annotations (1-NE) is created using multiple non-expert labelers. (because t=token (word in headline) of the nature of crowdsourcing) – may H=headline have reduced bias Ht= set of headlines containing token t 12
  13. 13. Summary• Individual experts better than individual non-experts• Non-experts improve quality of annotations• For many tasks, only a small number(avg. 4) of non- experts/item are needed to equal the expert performance• Non-experts’ trained system performed better probably because they offer more diversity 13
  14. 14. Paper 2: Validating Candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing John Burger, Emily Doughty, Sam Bayer, et al. Data Integration in the Life Sciences, 8th International Conference, DILS 2012Goal• To identify relationships between genes and mutations from the biomedical literature (mutation grounding)Challenge• Multiple mentions of genes and mutations, extracting the correct association is challengingMethod• Identify all mutation and gene mentions using existing tools• Identify gene-mutation relationships using crowdsourcing with non-experts 14
  15. 15. Dataset• PubMed abstracts – Mesh terms: Mutation, Mutation AND Polymorphism/genetic – Diseases (identified using Metamap): breast cancer, prostate cancer, autism spectrum disorder• 810 abstracts – Expert curated (Gold standard) 1608 gene-mutation relationships• Working dataset: selected 250 abstracts with 578 gene-mutation pairs 15
  16. 16. MethodExtracting Mentions Extracting Relationships• Used existing tool EMU • Normalized all mentions (Extractor of Mutations) • Generated cross-product of (Doughty et al. 2011) all gene-mutation • Gene Identification relationships within an – String match with HUGO article and NCBI gene databases • Total 1398 candidate gene-• Mutation(SNPs) mutation pairs – submitted Identification as a taskitem to Amazon – Regular expressions Mechanical Turk 16
  17. 17. Sample Task-item posted for crowdsourcing 17
  18. 18. Method: Crowdsourcing• 5 annotations per task-item(candidate association)• 1398 task-items + 467 control items – Control items are hidden tests (Amazon uses them to calculate worker’s rating)• Restricted the task to workers – from United States only – With 95% rating(from previous tasks)• Payment – 8 cents per abstract to each worker – Total: $900 18
  19. 19. Results• Mutation Recall: • Crowdsourcing Results: 477/550= 86.7% – Time: 30 hr• Gene Recall: 257/276 = – Total 58 workers 93.1% • 12 performed only 1 item • 22 performed 2-10• Out of 250 abstracts, • 13 performed 11-100 items 185 were perfect • 11 performed 100+items documents (with 100% recall of genes and mutations) 19
  20. 20. ResultsWorker Accuracy Consensus Accuracy • Simple Majority Voting – 78.4% • Weighted Vote Approach (based on worker’s ratings) – 78.3% • Naïve Bayes Classifier (to identify probability of correctness of each response) – 82.1% 20
  21. 21. Conclusion• It is easy to recruit the workers and achieve a fast turnaround time• Performance of workers varies• The task required significant level of biomedical literacy , yet one worker gave 95% accurate responses.• It is important to find new ways to identify qualified workers and aggregating results 21

×