The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to the volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, this assumption often creates issues in practice. Previous experiments we performed found that inter-annotator disagreement is usually never captured, either because the number of annotators is too small to capture the full diversity of opinion, or because the crowd data is aggregated with metrics that enforce consensus, such as majority vote. These practices create artificial data that is neither general nor reflects the ambiguity inherent in the data.
To address these issues, we proposed the method for crowdsourcing ground truth by harnessing inter-annotator disagreement. We present an alternative approach for crowdsourcing ground truth data that, instead of enforcing an agreement between annotators, captures the ambiguity inherent in semantic annotation through the use of disagreement-aware metrics for aggregating crowdsourcing responses. Based on this principle, we have implemented the CrowdTruth framework for machine-human computation, that first introduced the disagreement-aware metrics and built a pipeline to process crowdsourcing data with these metrics.
In this paper, we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high-quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with a majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.
2. • rather than accepting
disagreement as a natural
property of semantic
interpretation
• traditionally, disagreement is
considered a measure of poor
quality in the annotation task
because:
– task is poorly defined or
– annotators lack training
This makes the elimination of
disagreement a goal
What if it is GOOD?
Crowdsourcing
Myth: Disagreement
is Bad
3. • typically annotators are asked whether a
binary property holds for each example
• often not given a chance to say that the
property may partially hold, or holds
but is not clearly expressed
• mathematics of using ground truth
treats every example the same – either
match correct result or not
• poor quality examples tend to generate
high disagreement
• disagreement allows us to weight
sentences, giving us the ability to both
train and evaluate a machine in a more
flexible way
Crowdsourcing
Myth: All Examples
Are Created Equal
What if they are DIFFERENT?
4. Related Work on Annotating Ambiguity
http://CrowdTruth.org
● Juergens (2013): For word-sense disambiguation, the crowd with
ambiguity modeling was able to achieve expert-level quality of
annotations.
● Cheatham et al. (2014): Current benchmarks in ontology alignment
and evaluation are not designed to model uncertainty caused by
disagreement between annotators, both expert and crowd.
● Plank et al. (2014): For part-of-speech tagging, most inter-annotator
disagreement pointed to debatable cases in linguistic theory.
● Chang et al. (2017): In a workflow of tasks for collecting and
correcting labels for text and images, found that many ambiguous
cases cannot be resolved by better annotation guidelines or through
worker quality control.
5. • Annotator disagreement
is signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity,
over-generality, etc,
as well as quality
CrowdTruth
http://CrowdTruth.org
6. Medical Relation Extraction
1. workers select the medical relations that
hold between the given 2 terms in a
sentence
2. closed task - workers pick from given set of
14 top UMLS relations
3. 975 sentences from PubMed abstracts +
distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed
in the tweet
2. closed task - workers pick from set of
8 events
3. 2,019 English tweets, crawled based
on event hashtags
4. 7 workers /tweet
(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that
describe an event in the sentence
2. open-ended task - workers can pick any
words
3. 200 sentences from English TimeBank
corpus
4. 15 workers /sentence
Sound Interpretation
1. workers give tags that describe a
sound
2. open-ended task - workers can pick
any tag
3. 284 sounds from Freesound
database
4. 10 workers /sentence
http://CrowdTruth.org
11. Medical Relation Extraction
1. workers select the medical relations that
hold between the given 2 terms in a
sentence
2. closed task - workers pick from given set of
14 top UMLS relations
3. 975 sentences from PubMed abstracts +
distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed
in the tweet
2. closed task - workers pick from set of
8 events
3. 2,019 English tweets, crawled based
on event hashtags
4. 7 workers /tweet
(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that
describe an event in the sentence
2. open-ended task - workers can pick any
words
3. 200 sentences from English TimeBank
corpus
4. 15 workers /sentence
Sound Interpretation
1. workers give tags that describe a
sound
2. open-ended task - workers can pick
any tag
3. 284 sounds from Freesound
database
4. 10 workers /sentence
http://CrowdTruth.org
14. Unclear relationship between the two arguments reflected
in the disagreement
Medical Relation Extraction
http://CrowdTruth.org
15. Clearly expressed relation between the two arguments reflected in
the agreement
Medical Relation Extraction
http://CrowdTruth.org
16. Twitter Event Extraction
http://CrowdTruth.org
The tension building between China, Japan, U.S., and Vietnam is
reaching new heights this week - let's stay tuned! #APAC
#powerstruggle
Which of the following EVENTS can you identify in the tweet?
4 0 0 0 6 0 0 0 2
islands disputed
between China
and Japan
anti China
protests in
Vietnam
None
Unclear description of events in tweet
reflected in the disagreement
17. Twitter Event Extraction
http://CrowdTruth.org
RT @jc_stubbs: Another tragic day in #Ukraine - More than 50
rebels killed as new leader unleashes assault:
http://t.co/wcfU3kyAFX
Which of the following EVENTS can you identify in the tweet?
0 6 0 0 0 0 0 0 1
Ukraine
crisis 2014
None
Clear description of events in tweet
reflected in the agreement
18. Media Unit Vector - Open Task
http://CrowdTruth.org
Worker 1: balloon exploding
Worker 2: gun
Worker 3: gunshot
Worker 4: loud noise, pop
Worker 5: gun, balloon
2 3 1 1
balloon
gun,gunshot
loudnoise
pop
+ clustering
(e.g. word2vec)
Sound Interpretation
Multiple interpretations of the sound
reflected in the disagreement
19. Sound Interpretation
http://CrowdTruth.org
Worker 1: siren
Worker 2: police siren
Worker 3: siren, alarm
Worker 4: siren
Worker 5: annoying
4 1 1
siren
alarm
annoying
+ clustering
(e.g. word2vec)
Clear meaning of the sound
reflected in the agreement
20. News Event Identification
http://CrowdTruth.org
Other than usage in business, Internet technology is also beginning
to infiltrate the lifestyle domain .
HIGHLIGHT the words/phrases that refer to an EVENT.
5 4 2 2 4 5 3Internet
technology
is
also
beginning
infiltrate
lifestyle
1
business
3
usage
0
Other
3
domain
Unclear specification of events in sentence
reflected in the disagreement
21. Clear specification of events in sentence
reflected in the agreement
News Event Identification
http://CrowdTruth.org
Most of the city's monuments were destroyed including a
magnificent tiled mosque which dominated the skyline for centuries.
HIGHLIGHT the words/phrases that refer to an EVENT.
5 12 0 1 1 1 4 1 0
were
destroyed
including
magnificent
tiled
mosque
dominated
skyline
centuries
5
monuments
3
city
1
Most
22. Measures how clearly a media unit expresses an
annotation
1
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for
annotation A6
Media Unit Vector
Cosine = .55
Media Unit - Annotation Score
http://CrowdTruth.org
23. • Goal: what is the most accurate data
labeling method?
○ CrowdTruth: media unit-annotation
score (continuous value)
○ Majority Vote: decision of majority of
workers (discrete)
○ Single: decision of a single worker,
randomly sampled (discrete)
○ Expert: decision of domain expert
(discrete)
• Approach: evaluate with trusted
labels set with either:
○ agreement between crowd and expert, or
○ manual evaluation when disagreement
Experimental Setup
http://CrowdTruth.org
25. Evaluation: F1 score
CrowdTruth performs better than Majority
Vote, at least as well as Expert.
Each task has different best thresholds in
the media unit - annotation score.
27. Evaluation: number of workers
Each task reaches stable F1 for different
number of workers.
Majority Vote never beats CrowdTruth.
Sound Interpretation needs more workers.
28. • CrowdTruth performs just as
well as domain experts
• crowd is also cheaper
• crowd is always available
• capturing ambiguity is essential:
• majority voting discards
important signals in the data
• using only a few annotators for
ground truth is faulty
• optimal number of workers /
media unit is task dependent
Experiments
proved that:
http://CrowdTruth.org
29. CrowdTruth.org
umitrache et al.: Empirical Methodology for Crowdsourcing Ground
ruth. Semantic Web Journal Special Issue on Human Computation
nd Crowdsourcing (HC&C) in the Context of the Semantic Web (in
review).