Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

4,981 views

Published on

The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to the volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, this assumption often creates issues in practice. Previous experiments we performed found that inter-annotator disagreement is usually never captured, either because the number of annotators is too small to capture the full diversity of opinion, or because the crowd data is aggregated with metrics that enforce consensus, such as majority vote. These practices create artificial data that is neither general nor reflects the ambiguity inherent in the data.

To address these issues, we proposed the method for crowdsourcing ground truth by harnessing inter-annotator disagreement. We present an alternative approach for crowdsourcing ground truth data that, instead of enforcing an agreement between annotators, captures the ambiguity inherent in semantic annotation through the use of disagreement-aware metrics for aggregating crowdsourcing responses. Based on this principle, we have implemented the CrowdTruth framework for machine-human computation, that first introduced the disagreement-aware metrics and built a pipeline to process crowdsourcing data with these metrics.

In this paper, we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high-quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with a majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

  1. 1. Crowdsourcing Ambiguity-Aware Ground Truth Chris Welty, Anca Dumitrache, Oana Inel, Benjamin Timmermans, Lora Aroyo June 15th, 2017 Collective Intelligence Conference
  2. 2. • rather than accepting disagreement as a natural property of semantic interpretation • traditionally, disagreement is considered a measure of poor quality in the annotation task because: – task is poorly defined or – annotators lack training This makes the elimination of disagreement a goal What if it is GOOD? Crowdsourcing Myth: Disagreement is Bad
  3. 3. • typically annotators are asked whether a binary property holds for each example • often not given a chance to say that the property may partially hold, or holds but is not clearly expressed • mathematics of using ground truth treats every example the same – either match correct result or not • poor quality examples tend to generate high disagreement • disagreement allows us to weight sentences, giving us the ability to both train and evaluate a machine in a more flexible way Crowdsourcing Myth: All Examples Are Created Equal What if they are DIFFERENT?
  4. 4. Related Work on Annotating Ambiguity http://CrowdTruth.org ● Juergens (2013): For word-sense disambiguation, the crowd with ambiguity modeling was able to achieve expert-level quality of annotations. ● Cheatham et al. (2014): Current benchmarks in ontology alignment and evaluation are not designed to model uncertainty caused by disagreement between annotators, both expert and crowd. ● Plank et al. (2014): For part-of-speech tagging, most inter-annotator disagreement pointed to debatable cases in linguistic theory. ● Chang et al. (2017): In a workflow of tasks for collecting and correcting labels for text and images, found that many ambiguous cases cannot be resolved by better annotation guidelines or through worker quality control.
  5. 5. • Annotator disagreement is signal, not noise. • It is indicative of the variation in human semantic interpretation of signs • It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality CrowdTruth http://CrowdTruth.org
  6. 6. Medical Relation Extraction 1. workers select the medical relations that hold between the given 2 terms in a sentence 2. closed task - workers pick from given set of 14 top UMLS relations 3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction 4. 15 workers /sentence Twitter Event Extraction 1. workers select the events expressed in the tweet 2. closed task - workers pick from set of 8 events 3. 2,019 English tweets, crawled based on event hashtags 4. 7 workers /tweet (*) no expert annotators News Events Identification 1. workers highlight words/phrases that describe an event in the sentence 2. open-ended task - workers can pick any words 3. 200 sentences from English TimeBank corpus 4. 15 workers /sentence Sound Interpretation 1. workers give tags that describe a sound 2. open-ended task - workers can pick any tag 3. 284 sounds from Freesound database 4. 10 workers /sentence http://CrowdTruth.org
  7. 7. Medical Relation Extraction http://CrowdTruth.org Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1 Is ACUTE FEVER – related to → INFLUENZA AH1N1?
  8. 8. Twitter Event Extraction http://CrowdTruth.org
  9. 9. News Event Identification http://CrowdTruth.org
  10. 10. Sound Interpretation http://CrowdTruth.org
  11. 11. Medical Relation Extraction 1. workers select the medical relations that hold between the given 2 terms in a sentence 2. closed task - workers pick from given set of 14 top UMLS relations 3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction 4. 15 workers /sentence Twitter Event Extraction 1. workers select the events expressed in the tweet 2. closed task - workers pick from set of 8 events 3. 2,019 English tweets, crawled based on event hashtags 4. 7 workers /tweet (*) no expert annotators News Events Identification 1. workers highlight words/phrases that describe an event in the sentence 2. open-ended task - workers can pick any words 3. 200 sentences from English TimeBank corpus 4. 15 workers /sentence Sound Interpretation 1. workers give tags that describe a sound 2. open-ended task - workers can pick any tag 3. 284 sounds from Freesound database 4. 10 workers /sentence http://CrowdTruth.org
  12. 12. 1 1 1 Worker Vector - Closed Task http://CrowdTruth.org Medical Relation Extraction
  13. 13. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 4 3 0 0 5 1 0 Media Unit Vector - Closed Task http://CrowdTruth.org
  14. 14. Unclear relationship between the two arguments reflected in the disagreement Medical Relation Extraction http://CrowdTruth.org
  15. 15. Clearly expressed relation between the two arguments reflected in the agreement Medical Relation Extraction http://CrowdTruth.org
  16. 16. Twitter Event Extraction http://CrowdTruth.org The tension building between China, Japan, U.S., and Vietnam is reaching new heights this week - let's stay tuned! #APAC #powerstruggle Which of the following EVENTS can you identify in the tweet? 4 0 0 0 6 0 0 0 2 islands disputed between China and Japan anti China protests in Vietnam None Unclear description of events in tweet reflected in the disagreement
  17. 17. Twitter Event Extraction http://CrowdTruth.org RT @jc_stubbs: Another tragic day in #Ukraine - More than 50 rebels killed as new leader unleashes assault: http://t.co/wcfU3kyAFX Which of the following EVENTS can you identify in the tweet? 0 6 0 0 0 0 0 0 1 Ukraine crisis 2014 None Clear description of events in tweet reflected in the agreement
  18. 18. Media Unit Vector - Open Task http://CrowdTruth.org Worker 1: balloon exploding Worker 2: gun Worker 3: gunshot Worker 4: loud noise, pop Worker 5: gun, balloon 2 3 1 1 balloon gun,gunshot loudnoise pop + clustering (e.g. word2vec) Sound Interpretation Multiple interpretations of the sound reflected in the disagreement
  19. 19. Sound Interpretation http://CrowdTruth.org Worker 1: siren Worker 2: police siren Worker 3: siren, alarm Worker 4: siren Worker 5: annoying 4 1 1 siren alarm annoying + clustering (e.g. word2vec) Clear meaning of the sound reflected in the agreement
  20. 20. News Event Identification http://CrowdTruth.org Other than usage in business, Internet technology is also beginning to infiltrate the lifestyle domain . HIGHLIGHT the words/phrases that refer to an EVENT. 5 4 2 2 4 5 3Internet technology is also beginning infiltrate lifestyle 1 business 3 usage 0 Other 3 domain Unclear specification of events in sentence reflected in the disagreement
  21. 21. Clear specification of events in sentence reflected in the agreement News Event Identification http://CrowdTruth.org Most of the city's monuments were destroyed including a magnificent tiled mosque which dominated the skyline for centuries. HIGHLIGHT the words/phrases that refer to an EVENT. 5 12 0 1 1 1 4 1 0 were destroyed including magnificent tiled mosque dominated skyline centuries 5 monuments 3 city 1 Most
  22. 22. Measures how clearly a media unit expresses an annotation 1 0 1 1 0 0 4 3 0 0 5 1 0 Unit vector for annotation A6 Media Unit Vector Cosine = .55 Media Unit - Annotation Score http://CrowdTruth.org
  23. 23. • Goal: what is the most accurate data labeling method? ○ CrowdTruth: media unit-annotation score (continuous value) ○ Majority Vote: decision of majority of workers (discrete) ○ Single: decision of a single worker, randomly sampled (discrete) ○ Expert: decision of domain expert (discrete) • Approach: evaluate with trusted labels set with either: ○ agreement between crowd and expert, or ○ manual evaluation when disagreement Experimental Setup http://CrowdTruth.org
  24. 24. Evaluation: F1 score
  25. 25. Evaluation: F1 score CrowdTruth performs better than Majority Vote, at least as well as Expert. Each task has different best thresholds in the media unit - annotation score.
  26. 26. Evaluation: number of workers
  27. 27. Evaluation: number of workers Each task reaches stable F1 for different number of workers. Majority Vote never beats CrowdTruth. Sound Interpretation needs more workers.
  28. 28. • CrowdTruth performs just as well as domain experts • crowd is also cheaper • crowd is always available • capturing ambiguity is essential: • majority voting discards important signals in the data • using only a few annotators for ground truth is faulty • optimal number of workers / media unit is task dependent Experiments proved that: http://CrowdTruth.org
  29. 29. CrowdTruth.org umitrache et al.: Empirical Methodology for Crowdsourcing Ground ruth. Semantic Web Journal Special Issue on Human Computation nd Crowdsourcing (HC&C) in the Context of the Semantic Web (in review).
  30. 30. Ambiguity in Crowdsourcing http://CrowdTruth.org Media Unit Annotation Worker
  31. 31. Ambiguity in Crowdsourcing http://CrowdTruth.org Media Unit Annotation Worker
  32. 32. Ambiguity in Crowdsourcing http://CrowdTruth.org Media Unit Annotation Worker Disagreement can indicate ambiguity!

×