Crowdsourcing Ambiguity-Aware
Ground Truth
Chris Welty, Anca Dumitrache, Oana Inel,
Benjamin Timmermans, Lora Aroyo
June 15th, 2017
Collective Intelligence Conference
• rather than accepting
disagreement as a natural
property of semantic
interpretation
• traditionally, disagreement is
considered a measure of poor
quality in the annotation task
because:
– task is poorly defined or
– annotators lack training
This makes the elimination of
disagreement a goal
What if it is GOOD?
Crowdsourcing
Myth: Disagreement
is Bad
• typically annotators are asked whether a
binary property holds for each example
• often not given a chance to say that the
property may partially hold, or holds
but is not clearly expressed
• mathematics of using ground truth
treats every example the same – either
match correct result or not
• poor quality examples tend to generate
high disagreement
• disagreement allows us to weight
sentences, giving us the ability to both
train and evaluate a machine in a more
flexible way
Crowdsourcing
Myth: All Examples
Are Created Equal
What if they are DIFFERENT?
Related Work on Annotating Ambiguity
http://CrowdTruth.org
● Juergens (2013): For word-sense disambiguation, the crowd with
ambiguity modeling was able to achieve expert-level quality of
annotations.
● Cheatham et al. (2014): Current benchmarks in ontology alignment
and evaluation are not designed to model uncertainty caused by
disagreement between annotators, both expert and crowd.
● Plank et al. (2014): For part-of-speech tagging, most inter-annotator
disagreement pointed to debatable cases in linguistic theory.
● Chang et al. (2017): In a workflow of tasks for collecting and
correcting labels for text and images, found that many ambiguous
cases cannot be resolved by better annotation guidelines or through
worker quality control.
• Annotator disagreement
is signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity,
over-generality, etc,
as well as quality
CrowdTruth
http://CrowdTruth.org
Medical Relation Extraction
1. workers select the medical relations that
hold between the given 2 terms in a
sentence
2. closed task - workers pick from given set of
14 top UMLS relations
3. 975 sentences from PubMed abstracts +
distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed
in the tweet
2. closed task - workers pick from set of
8 events
3. 2,019 English tweets, crawled based
on event hashtags
4. 7 workers /tweet
(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that
describe an event in the sentence
2. open-ended task - workers can pick any
words
3. 200 sentences from English TimeBank
corpus
4. 15 workers /sentence
Sound Interpretation
1. workers give tags that describe a
sound
2. open-ended task - workers can pick
any tag
3. 284 sounds from Freesound
database
4. 10 workers /sentence
http://CrowdTruth.org
Medical Relation Extraction
http://CrowdTruth.org
Patients with ACUTE FEVER and nausea could be suffering from
INFLUENZA AH1N1
Is ACUTE FEVER – related to → INFLUENZA AH1N1?
Twitter Event Extraction
http://CrowdTruth.org
News Event Identification
http://CrowdTruth.org
Sound Interpretation
http://CrowdTruth.org
Medical Relation Extraction
1. workers select the medical relations that
hold between the given 2 terms in a
sentence
2. closed task - workers pick from given set of
14 top UMLS relations
3. 975 sentences from PubMed abstracts +
distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed
in the tweet
2. closed task - workers pick from set of
8 events
3. 2,019 English tweets, crawled based
on event hashtags
4. 7 workers /tweet
(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that
describe an event in the sentence
2. open-ended task - workers can pick any
words
3. 200 sentences from English TimeBank
corpus
4. 15 workers /sentence
Sound Interpretation
1. workers give tags that describe a
sound
2. open-ended task - workers can pick
any tag
3. 284 sounds from Freesound
database
4. 10 workers /sentence
http://CrowdTruth.org
1 1 1
Worker Vector - Closed Task
http://CrowdTruth.org
Medical Relation Extraction
1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Media Unit Vector - Closed Task
http://CrowdTruth.org
Unclear relationship between the two arguments reflected
in the disagreement
Medical Relation Extraction
http://CrowdTruth.org
Clearly expressed relation between the two arguments reflected in
the agreement
Medical Relation Extraction
http://CrowdTruth.org
Twitter Event Extraction
http://CrowdTruth.org
The tension building between China, Japan, U.S., and Vietnam is
reaching new heights this week - let's stay tuned! #APAC
#powerstruggle
Which of the following EVENTS can you identify in the tweet?
4 0 0 0 6 0 0 0 2
islands disputed
between China
and Japan
anti China
protests in
Vietnam
None
Unclear description of events in tweet
reflected in the disagreement
Twitter Event Extraction
http://CrowdTruth.org
RT @jc_stubbs: Another tragic day in #Ukraine - More than 50
rebels killed as new leader unleashes assault:
http://t.co/wcfU3kyAFX
Which of the following EVENTS can you identify in the tweet?
0 6 0 0 0 0 0 0 1
Ukraine
crisis 2014
None
Clear description of events in tweet
reflected in the agreement
Media Unit Vector - Open Task
http://CrowdTruth.org
Worker 1: balloon exploding
Worker 2: gun
Worker 3: gunshot
Worker 4: loud noise, pop
Worker 5: gun, balloon
2 3 1 1
balloon
gun,gunshot
loudnoise
pop
+ clustering
(e.g. word2vec)
Sound Interpretation
Multiple interpretations of the sound
reflected in the disagreement
Sound Interpretation
http://CrowdTruth.org
Worker 1: siren
Worker 2: police siren
Worker 3: siren, alarm
Worker 4: siren
Worker 5: annoying
4 1 1
siren
alarm
annoying
+ clustering
(e.g. word2vec)
Clear meaning of the sound
reflected in the agreement
News Event Identification
http://CrowdTruth.org
Other than usage in business, Internet technology is also beginning
to infiltrate the lifestyle domain .
HIGHLIGHT the words/phrases that refer to an EVENT.
5 4 2 2 4 5 3Internet
technology
is
also
beginning
infiltrate
lifestyle
1
business
3
usage
0
Other
3
domain
Unclear specification of events in sentence
reflected in the disagreement
Clear specification of events in sentence
reflected in the agreement
News Event Identification
http://CrowdTruth.org
Most of the city's monuments were destroyed including a
magnificent tiled mosque which dominated the skyline for centuries.
HIGHLIGHT the words/phrases that refer to an EVENT.
5 12 0 1 1 1 4 1 0
were
destroyed
including
magnificent
tiled
mosque
dominated
skyline
centuries
5
monuments
3
city
1
Most
Measures how clearly a media unit expresses an
annotation
1
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for
annotation A6
Media Unit Vector
Cosine = .55
Media Unit - Annotation Score
http://CrowdTruth.org
• Goal: what is the most accurate data
labeling method?
○ CrowdTruth: media unit-annotation
score (continuous value)
○ Majority Vote: decision of majority of
workers (discrete)
○ Single: decision of a single worker,
randomly sampled (discrete)
○ Expert: decision of domain expert
(discrete)
• Approach: evaluate with trusted
labels set with either:
○ agreement between crowd and expert, or
○ manual evaluation when disagreement
Experimental Setup
http://CrowdTruth.org
Evaluation: F1 score
Evaluation: F1 score
CrowdTruth performs better than Majority
Vote, at least as well as Expert.
Each task has different best thresholds in
the media unit - annotation score.
Evaluation: number of workers
Evaluation: number of workers
Each task reaches stable F1 for different
number of workers.
Majority Vote never beats CrowdTruth.
Sound Interpretation needs more workers.
• CrowdTruth performs just as
well as domain experts
• crowd is also cheaper
• crowd is always available
• capturing ambiguity is essential:
• majority voting discards
important signals in the data
• using only a few annotators for
ground truth is faulty
• optimal number of workers /
media unit is task dependent
Experiments
proved that:
http://CrowdTruth.org
CrowdTruth.org
umitrache et al.: Empirical Methodology for Crowdsourcing Ground
ruth. Semantic Web Journal Special Issue on Human Computation
nd Crowdsourcing (HC&C) in the Context of the Semantic Web (in
review).
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Disagreement can indicate ambiguity!

Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

  • 1.
    Crowdsourcing Ambiguity-Aware Ground Truth ChrisWelty, Anca Dumitrache, Oana Inel, Benjamin Timmermans, Lora Aroyo June 15th, 2017 Collective Intelligence Conference
  • 2.
    • rather thanaccepting disagreement as a natural property of semantic interpretation • traditionally, disagreement is considered a measure of poor quality in the annotation task because: – task is poorly defined or – annotators lack training This makes the elimination of disagreement a goal What if it is GOOD? Crowdsourcing Myth: Disagreement is Bad
  • 3.
    • typically annotatorsare asked whether a binary property holds for each example • often not given a chance to say that the property may partially hold, or holds but is not clearly expressed • mathematics of using ground truth treats every example the same – either match correct result or not • poor quality examples tend to generate high disagreement • disagreement allows us to weight sentences, giving us the ability to both train and evaluate a machine in a more flexible way Crowdsourcing Myth: All Examples Are Created Equal What if they are DIFFERENT?
  • 4.
    Related Work onAnnotating Ambiguity http://CrowdTruth.org ● Juergens (2013): For word-sense disambiguation, the crowd with ambiguity modeling was able to achieve expert-level quality of annotations. ● Cheatham et al. (2014): Current benchmarks in ontology alignment and evaluation are not designed to model uncertainty caused by disagreement between annotators, both expert and crowd. ● Plank et al. (2014): For part-of-speech tagging, most inter-annotator disagreement pointed to debatable cases in linguistic theory. ● Chang et al. (2017): In a workflow of tasks for collecting and correcting labels for text and images, found that many ambiguous cases cannot be resolved by better annotation guidelines or through worker quality control.
  • 5.
    • Annotator disagreement issignal, not noise. • It is indicative of the variation in human semantic interpretation of signs • It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality CrowdTruth http://CrowdTruth.org
  • 6.
    Medical Relation Extraction 1.workers select the medical relations that hold between the given 2 terms in a sentence 2. closed task - workers pick from given set of 14 top UMLS relations 3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction 4. 15 workers /sentence Twitter Event Extraction 1. workers select the events expressed in the tweet 2. closed task - workers pick from set of 8 events 3. 2,019 English tweets, crawled based on event hashtags 4. 7 workers /tweet (*) no expert annotators News Events Identification 1. workers highlight words/phrases that describe an event in the sentence 2. open-ended task - workers can pick any words 3. 200 sentences from English TimeBank corpus 4. 15 workers /sentence Sound Interpretation 1. workers give tags that describe a sound 2. open-ended task - workers can pick any tag 3. 284 sounds from Freesound database 4. 10 workers /sentence http://CrowdTruth.org
  • 7.
    Medical Relation Extraction http://CrowdTruth.org Patientswith ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1 Is ACUTE FEVER – related to → INFLUENZA AH1N1?
  • 8.
  • 9.
  • 10.
  • 11.
    Medical Relation Extraction 1.workers select the medical relations that hold between the given 2 terms in a sentence 2. closed task - workers pick from given set of 14 top UMLS relations 3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction 4. 15 workers /sentence Twitter Event Extraction 1. workers select the events expressed in the tweet 2. closed task - workers pick from set of 8 events 3. 2,019 English tweets, crawled based on event hashtags 4. 7 workers /tweet (*) no expert annotators News Events Identification 1. workers highlight words/phrases that describe an event in the sentence 2. open-ended task - workers can pick any words 3. 200 sentences from English TimeBank corpus 4. 15 workers /sentence Sound Interpretation 1. workers give tags that describe a sound 2. open-ended task - workers can pick any tag 3. 284 sounds from Freesound database 4. 10 workers /sentence http://CrowdTruth.org
  • 12.
    1 1 1 WorkerVector - Closed Task http://CrowdTruth.org Medical Relation Extraction
  • 13.
    1 1 1 11 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 4 3 0 0 5 1 0 Media Unit Vector - Closed Task http://CrowdTruth.org
  • 14.
    Unclear relationship betweenthe two arguments reflected in the disagreement Medical Relation Extraction http://CrowdTruth.org
  • 15.
    Clearly expressed relationbetween the two arguments reflected in the agreement Medical Relation Extraction http://CrowdTruth.org
  • 16.
    Twitter Event Extraction http://CrowdTruth.org Thetension building between China, Japan, U.S., and Vietnam is reaching new heights this week - let's stay tuned! #APAC #powerstruggle Which of the following EVENTS can you identify in the tweet? 4 0 0 0 6 0 0 0 2 islands disputed between China and Japan anti China protests in Vietnam None Unclear description of events in tweet reflected in the disagreement
  • 17.
    Twitter Event Extraction http://CrowdTruth.org RT@jc_stubbs: Another tragic day in #Ukraine - More than 50 rebels killed as new leader unleashes assault: http://t.co/wcfU3kyAFX Which of the following EVENTS can you identify in the tweet? 0 6 0 0 0 0 0 0 1 Ukraine crisis 2014 None Clear description of events in tweet reflected in the agreement
  • 18.
    Media Unit Vector- Open Task http://CrowdTruth.org Worker 1: balloon exploding Worker 2: gun Worker 3: gunshot Worker 4: loud noise, pop Worker 5: gun, balloon 2 3 1 1 balloon gun,gunshot loudnoise pop + clustering (e.g. word2vec) Sound Interpretation Multiple interpretations of the sound reflected in the disagreement
  • 19.
    Sound Interpretation http://CrowdTruth.org Worker 1:siren Worker 2: police siren Worker 3: siren, alarm Worker 4: siren Worker 5: annoying 4 1 1 siren alarm annoying + clustering (e.g. word2vec) Clear meaning of the sound reflected in the agreement
  • 20.
    News Event Identification http://CrowdTruth.org Otherthan usage in business, Internet technology is also beginning to infiltrate the lifestyle domain . HIGHLIGHT the words/phrases that refer to an EVENT. 5 4 2 2 4 5 3Internet technology is also beginning infiltrate lifestyle 1 business 3 usage 0 Other 3 domain Unclear specification of events in sentence reflected in the disagreement
  • 21.
    Clear specification ofevents in sentence reflected in the agreement News Event Identification http://CrowdTruth.org Most of the city's monuments were destroyed including a magnificent tiled mosque which dominated the skyline for centuries. HIGHLIGHT the words/phrases that refer to an EVENT. 5 12 0 1 1 1 4 1 0 were destroyed including magnificent tiled mosque dominated skyline centuries 5 monuments 3 city 1 Most
  • 22.
    Measures how clearlya media unit expresses an annotation 1 0 1 1 0 0 4 3 0 0 5 1 0 Unit vector for annotation A6 Media Unit Vector Cosine = .55 Media Unit - Annotation Score http://CrowdTruth.org
  • 23.
    • Goal: whatis the most accurate data labeling method? ○ CrowdTruth: media unit-annotation score (continuous value) ○ Majority Vote: decision of majority of workers (discrete) ○ Single: decision of a single worker, randomly sampled (discrete) ○ Expert: decision of domain expert (discrete) • Approach: evaluate with trusted labels set with either: ○ agreement between crowd and expert, or ○ manual evaluation when disagreement Experimental Setup http://CrowdTruth.org
  • 24.
  • 25.
    Evaluation: F1 score CrowdTruthperforms better than Majority Vote, at least as well as Expert. Each task has different best thresholds in the media unit - annotation score.
  • 26.
  • 27.
    Evaluation: number ofworkers Each task reaches stable F1 for different number of workers. Majority Vote never beats CrowdTruth. Sound Interpretation needs more workers.
  • 28.
    • CrowdTruth performsjust as well as domain experts • crowd is also cheaper • crowd is always available • capturing ambiguity is essential: • majority voting discards important signals in the data • using only a few annotators for ground truth is faulty • optimal number of workers / media unit is task dependent Experiments proved that: http://CrowdTruth.org
  • 29.
    CrowdTruth.org umitrache et al.:Empirical Methodology for Crowdsourcing Ground ruth. Semantic Web Journal Special Issue on Human Computation nd Crowdsourcing (HC&C) in the Context of the Semantic Web (in review).
  • 30.
  • 31.
  • 32.
    Ambiguity in Crowdsourcing http://CrowdTruth.org MediaUnit Annotation Worker Disagreement can indicate ambiguity!