SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
CrowdTruth Tutorial: Using the Crowd to Understand Ambiguity
3.
1759: computing date of Halley’s Comet return
1892: for the first time appears as term in NYT
http://crowdtruth.org
4.
• collective decisions of large groups of people
• a group of error-prone decision-makers can be
surprisingly good at picking the best choice.
• for a binary decision, odds that majory of people
in a crowd will pick the right answer is greater
than the odds that any one of them will pick it
on their own.
• performance gets better as the size grows
1785
Marquis de Condorcet
statistics for democracy
“wisdom of crowds”
http://crowdtruth.org
5.
• asked 787 people to guess the weight
of an ox
• none got the right answer
• their collective guess was almost
perfect
“wisdom of crowds”
Sir Francis Galton
http://crowdtruth.org
6.
WWII Math Rosies
1942: Ballistics calculations and flight trajectories
http://crowdtruth.org
7.
“Who Wants to Be a Millionaire?”
•help from individual expert or audience
poll
•majority of the audience right 91% of the
time
•individuals right only 65% of the time
“wisdom of crowds”
http://crowdtruth.org
8.
humans solving computational problems
too hard for computers alone
Human Computing
http://crowdtruth.org
9.
“we treat human brains as
processors in a distributed system each performing
a small part of a massive computation”
Human Computing
Luis von Ahn
http://crowdtruth.org
10.
• 1979: amateur naturalists Don McCrimmon & Cal Smith first applied the concept of "citizen
science”
• 1980: Rick Bonney coined the term "citizen science”
http://crowdtruth.org
11.
• 2005: term “human computation” by Luis von Ahn
• 2005: term “crowdsourcing” by Jeff Howe & Mark Robinson (Wired)
• 2006: Games with the purpose by Luis von Ahn
• 2006: “The Rise of the Crowdsourcing”, by Jeff Howe (Wired)
• 2007: reCaptcha (company)
• 2011: Duolingo, Luis von Ahn
2005
http://crowdtruth.org
20.
• Human annotators with domain
knowledge provide better
annotated data, e.g
– if you want medical texts annotated
for medical relations you need
medical experts
• But experts are expensive &
don’t scale
• Multiple perspectives on data
can be useful, beyond what
experts believe is salient or
correct
Crowdsourcing Myth:
Ask the Expert
What if the CROWD IS BETTER?
http://CrowdTruth.org
21.
What is the relation between the highlighted terms?
He was the first physician to identify the relationship
between HEMOPHILIA and HEMOPHILIC ARTHROPATHY.
Experts Better?
Crowd reads text literally - provide better examples to machine
experts: cause
crowd: no relation
http://CrowdTruth.org
22.
• rather than accepting
disagreement as a natural
property of semantic
interpretation
• traditionally, disagreement is
considered a measure of poor
quality in the annotation task
because:
– task is poorly defined or
– annotators lack training
This makes the elimination of
disagreement a goal
Crowdsourcing Myth:
Disagreement is Bad
What if it is GOOD?
http://CrowdTruth.org
23.
Disagreement Bad?
Does each sentence express the TREAT relation?
ANTIBIOTICS are the first line treatment for indications of TYPHUS. →
agreement 95%
Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects.
→ agreement 80%
With ANTIBIOTICS in short supply, DDT was used during WWII to control
the insect vectors of TYPHUS. → agreement 50%
Disagreement can reflect the degree of clarity in a sentence
http://CrowdTruth.org
24.
Do gold standards capture all this?
http://crowdtruth.org
25.
http://crowdtruth.org
● single human experts → never
perfectly correct
● gold standard is measured in
inter-annotator agreement →
what if disagreeing annotators
are both right?
● gold standards → do not account
for alternative interpretations &
clarity
26.
http://crowdtruth.org
● single human experts → never
perfectly correct
● gold standard is measured in
inter-annotator agreement →
what if disagreeing annotators
are both right?
● gold standards → do not account
for alternative interpretations &
clarity
The fallacy of the “one truth”
assumption that pervades
computational semantics
27.
“best collective decisions are
result of disagreement,
not consensus or compromise”
James Surowiecki
http://crowdtruth.org
28.
Position
disagreement is a signal and not noise
can we harness it?
29.
• Annotator disagreement
is signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity,
over-generality, etc,
as well as quality
CrowdTruth
http://CrowdTruth.org
30.
RelExTASKinCrowdFlower
Patients with ACUTE FEVER and nausea could be suffering
from INFLUENZA AH1N1
Is ACUTE FEVER – related to → INFLUENZA AH1N1?
http://crowdtruth.org
33.
Unclear relationship between the two arguments reflected
in the disagreement
http://crowdtruth.org
What’s in a sentence vector?
34.
Clearly expressed relation between the two arguments reflected in
the agreement
http://crowdtruth.org
What’s in a sentence vector?
35.
Patients with ACUTE FEVER and nausea could be suffering from
INFLUENZA AH1N1.
measures how clearly a relation is expressed in a sentence
Sentence-Relation Score
http://crowdtruth.org
Sentence vector
# Workers
10
15
0
15
3
15
8
15
. . .
causes
treats
partof
sym
ptom
36.
Disagreement is a property of the entire system!
How to model CrowdTruth?
http://crowdtruth.org
Sentence Quality
Relation Quality Worker Quality
43.
measures the agreement between a worker & all sentences
weighted by sentence & relation quality
0 1 1 0 0 4 3 0 0 5 1 0
Worker’s
sentence
vector
Sentence
Vector
AVG (Cosine) ∀ sentences
0
1
1
0
0
1
1
0
0
1
1
0
Worker-Sentence Agreement
http://crowdtruth.org
weighted by
relation quality
weighted by
sentence quality
44.
• measures the conditional probability that if a
worker picks the relation in a sentence, then
another worker will also pick it
• weighted by sentence & worker quality
Relation Quality
http://crowdtruth.org
Relation Quality (relation r) =
P( worker i annotates relation r in sentence s |
worker j annotates relation r in sentence s)
∀ workers i,j, ∀ sentences s
46.
• Goals:
○ collecting a medical relation
extraction gold standard
○ improve the performance of a
relation extraction classifier
• Approach:
○ crowdsource 900 medical
sentences
○ measure disagreement with
CrowdTruth metrics
○ train & evaluate classifier with
CrowdTruth score
CrowdTruth for
medical relation
extraction in
practice
http://crowdtruth.org
47.
Medical RelEx Classifier Trained with Expert vs.
CrowdTruth Sentence-Relation Score
0.642, p = 0:016
0.638
http://crowdtruth.org
48.
Medical RelEx Classifier Trained with Expert vs.
CrowdTruth Sentence-Relation Score
0.642, p = 0:016
0.638
crowd provides training data that is at least as good
if not better than experts
http://crowdtruth.org
49.
# of Workers: Impact on Sentence-Relation Score
http://crowdtruth.org
50.
# of Workers: Impact on RelEx Model Performance
only 54 sent. had 15 or more workers
http://crowdtruth.org
51.
• crowd performs just as well as
medical experts
• crowd is also cheaper
• crowd is always available
• using only a few annotators for
ground truth is faulty
• min 10 workers/sentence are
needed for highest quality
annotations
• CrowdTruth = a solution to
Clinical NLP Challenge:
• lack of ground truth for training
& benchmarking
In summary:
CrowdTruth for
Medical
Relation
Extraction
http://crowdtruth.org
52.
• Annotator disagreement
is signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity,
over-generality, etc,
as well as quality
CrowdTruth
http://CrowdTruth.org