Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

Crowdsourcing Ambiguity-Aware
Ground Truth
Chris Welty, Anca Dumitrache, Oana Inel,
Benjamin Timmermans, Lora Aroyo
June 15th, 2017
Collective Intelligence Conference

• rather than accepting
disagreement as a natural
property of semantic
interpretation
• traditionally, disagreement is
considered a measure of poor
quality in the annotation task
because:
– task is poorly defined or
– annotators lack training
This makes the elimination of
disagreement a goal
What if it is GOOD?
Crowdsourcing
Myth: Disagreement
is Bad

• typically annotators are asked whether a
binary property holds for each example
• often not given a chance to say that the
property may partially hold, or holds
but is not clearly expressed
• mathematics of using ground truth
treats every example the same – either
match correct result or not
• poor quality examples tend to generate
high disagreement
• disagreement allows us to weight
sentences, giving us the ability to both
train and evaluate a machine in a more
flexible way
Crowdsourcing
Myth: All Examples
Are Created Equal
What if they are DIFFERENT?

Related Work on Annotating Ambiguity
http://CrowdTruth.org
● Juergens (2013): For word-sense disambiguation, the crowd with
ambiguity modeling was able to achieve expert-level quality of
annotations.
● Cheatham et al. (2014): Current benchmarks in ontology alignment
and evaluation are not designed to model uncertainty caused by
disagreement between annotators, both expert and crowd.
● Plank et al. (2014): For part-of-speech tagging, most inter-annotator
disagreement pointed to debatable cases in linguistic theory.
● Chang et al. (2017): In a workflow of tasks for collecting and
correcting labels for text and images, found that many ambiguous
cases cannot be resolved by better annotation guidelines or through
worker quality control.

• Annotator disagreement
is signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity,
over-generality, etc,
as well as quality
CrowdTruth

Medical Relation Extraction
1. workers select the medical relations that
hold between the given 2 terms in a
sentence
2. closed task - workers pick from given set of
14 top UMLS relations
3. 975 sentences from PubMed abstracts +
distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed
in the tweet
2. closed task - workers pick from set of
8 events
3. 2,019 English tweets, crawled based
on event hashtags
4. 7 workers /tweet
(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that
describe an event in the sentence
2. open-ended task - workers can pick any
words
3. 200 sentences from English TimeBank
corpus
Sound Interpretation
1. workers give tags that describe a
sound
2. open-ended task - workers can pick
any tag
3. 284 sounds from Freesound
database

Patients with ACUTE FEVER and nausea could be suffering from
INFLUENZA AH1N1
Is ACUTE FEVER – related to → INFLUENZA AH1N1?

News Event Identification

1 1 1
Worker Vector - Closed Task

1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Media Unit Vector - Closed Task

Unclear relationship between the two arguments reflected
in the disagreement

Clearly expressed relation between the two arguments reflected in
the agreement

The tension building between China, Japan, U.S., and Vietnam is
reaching new heights this week - let's stay tuned! #APAC
#powerstruggle
Which of the following EVENTS can you identify in the tweet?
4 0 0 0 6 0 0 0 2
islands disputed
between China
and Japan
anti China
protests in
Vietnam
None
Unclear description of events in tweet
reflected in the disagreement

RT @jc_stubbs: Another tragic day in #Ukraine - More than 50
rebels killed as new leader unleashes assault:
http://t.co/wcfU3kyAFX
Which of the following EVENTS can you identify in the tweet?
0 6 0 0 0 0 0 0 1
Ukraine
crisis 2014
None
Clear description of events in tweet
reflected in the agreement

Media Unit Vector - Open Task
Worker 1: balloon exploding
Worker 2: gun
Worker 3: gunshot
Worker 4: loud noise, pop
Worker 5: gun, balloon
2 3 1 1
balloon
gun,gunshot
loudnoise
pop
+ clustering
(e.g. word2vec)
Multiple interpretations of the sound

Worker 1: siren
Worker 2: police siren
Worker 3: siren, alarm
Worker 4: siren
Worker 5: annoying
4 1 1
siren
alarm
annoying
+ clustering
(e.g. word2vec)
Clear meaning of the sound

Other than usage in business, Internet technology is also beginning
to infiltrate the lifestyle domain .
HIGHLIGHT the words/phrases that refer to an EVENT.
5 4 2 2 4 5 3Internet
technology
is
also
beginning
infiltrate
lifestyle
1
business
3
usage
0
Other
3
domain
Unclear specification of events in sentence

Clear specification of events in sentence
Most of the city's monuments were destroyed including a
magnificent tiled mosque which dominated the skyline for centuries.
HIGHLIGHT the words/phrases that refer to an EVENT.
5 12 0 1 1 1 4 1 0
were
destroyed
including
magnificent
tiled
mosque
dominated
skyline
centuries
5
monuments
3
city
1
Most

Measures how clearly a media unit expresses an
annotation
1
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for
annotation A6
Media Unit Vector
Cosine = .55
Media Unit - Annotation Score

• Goal: what is the most accurate data
labeling method?
○ CrowdTruth: media unit-annotation
score (continuous value)
○ Majority Vote: decision of majority of
workers (discrete)
○ Single: decision of a single worker,
randomly sampled (discrete)
○ Expert: decision of domain expert
(discrete)
• Approach: evaluate with trusted
labels set with either:
○ agreement between crowd and expert, or
○ manual evaluation when disagreement
Experimental Setup

Evaluation: F1 score
CrowdTruth performs better than Majority
Vote, at least as well as Expert.
Each task has different best thresholds in
the media unit - annotation score.

Evaluation: number of workers
Each task reaches stable F1 for different
number of workers.
Majority Vote never beats CrowdTruth.
Sound Interpretation needs more workers.

• CrowdTruth performs just as
well as domain experts
• crowd is also cheaper
• crowd is always available
• capturing ambiguity is essential:
• majority voting discards
important signals in the data
• using only a few annotators for
ground truth is faulty
• optimal number of workers /
media unit is task dependent
Experiments
proved that:

CrowdTruth.org
umitrache et al.: Empirical Methodology for Crowdsourcing Ground
ruth. Semantic Web Journal Special Issue on Human Computation
nd Crowdsourcing (HC&C) in the Context of the Semantic Web (in
review).

Ambiguity in Crowdsourcing
Media Unit
Annotation Worker

Ambiguity in Crowdsourcing
Media Unit
Annotation Worker
Disagreement can indicate ambiguity!

Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

More Related Content

What's hot

Similar to Crowdsourcing ambiguity aware ground truth - collective intelligence 2017

More from Lora Aroyo

Recently uploaded

Crowdsourcing ambiguity aware ground truth - collective intelligence 2017