Swan(sea) Song – personal research during my six years at Swansea ... and bey...
CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session
1. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
gathering gold standard annotations for relation extraction
Crowd Truth
Harnessing Disagreement in
Crowdsourcing
2. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
IBM Confidential
• Open Domain Question-Answering Machine, that given
– Rich Natural Language Questions
– Over a Broad Domain of Knowledge
• Won a 2-game Jeopardy match against the all-time winners
– viewed by over 50,000,000
3. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Cognitive Computing
EXPANDS human cognition, makes the jobs we do easier,
like a cognitive prosthesis, especially when dealing with processing
massive data, or data that requires human interpretation
LEARNS as you use it – most machine errors are easy for a
human to detect, and we can instrument usage of systems to
better understand the system and the problem it solves
INTERACTS naturally. We need to bring machines closer to
their users, we have adapted ourselves enough to them, they should
understand natural language, spoken or written, be able to process
images and videos. These simple human problems are extremely
complex for machines, but are hallmarks of a new computing era.
4. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Watson MD
• Adapt Watson to Medical QA
• Mainly an NLP task
• Cognitive computing systems need
human-annotated data for training, testing,
evaluation
the human annotation task is one of semantic
interpretation
Now answering
medical
questions!
5. Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis it presents a risk of nephrogenic
systemic fibrosis.
Mention detection: find the spans (begin, end) of relevant medical
terms (factors) in a passage.
Factor Typing: find the type of each mention
substance disorder
disorder
NER
disorder
treatment
NLP Tasks
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
6. NLP Tasks
Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis it presents a risk of nephrogenic
systemic fibrosis.
Mention detection: find the spans (begin, end) of relevant medical
terms (factors) in a passage.
Factor Typing: find the type of each mention
Factor (Entity) Identification: find the corresponding ids for a
mentioned factor in a knowledge-base
C0016911
C1408325
C0035078
C1619692
C0019004
NLP Tasks
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
7. NLP Tasks
Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis it presents a risk of nephrogenic
systemic fibrosis.
Mention detection: find the spans (begin, end) of relevant medical
terms (factors) in a passage.
Factor Typing: find the type of each mention
Factor (Entity) Identification: find the corresponding ids for a
mentioned factor in a knowledge-base
Relation detection: find relations that are expressed in a passage
between factors?
cause
treats
treats
contra-
indicates
NLP Tasks
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
8. NLP Tasks
Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis it presents a risk of nephrogenic
systemic fibrosis.
Mention detection: find the spans (begin, end) of relevant medical
terms (factors) in a passage.
Factor Typing: find the type of each mention
Factor (Entity) Identification: find the corresponding ids for a
mentioned factor in a knowledge-base
Relation detection: find relations that are expressed in a passage
between factors?
Coreference: Find the mentions in a sentence that refer to the same
factor.
9. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Gold Standard
Assumption
• Cognitive systems need to be told what is right & what is wrong
• A gold standard or ground truth
• Performance is measured on test sets vetted by human experts
à never perfect, always improving against test data
• Historically, gold standards are created assuming that for each
annotated instance there is a single right answer
• Gold standard quality is measured in inter-annotator
agreement à does not account for perspectives, for
reasonable alternative interpretations
11. Disagreement
Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis there is a risk of nephrogenic
systemic fibrosis.
cause
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
12. Gadolinium agents are useful for patients with renal
impairment, but in patients with severe renal failure
requiring dialysis there is a risk of nephrogenic
systemic fibrosis.
side-effect The human annotation task is one
of semantic interpretation
Disagreement
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
13. Why do people disagree?
Sentence
Relation Worker
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
14. Key Question
How do we represent &
measure disagreement in a
way that it can be harnessed?
15. Why do people disagree?
Sign
Referent Observer
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Triangle of Reference
17. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Crowd Truth
Annotator disagreement is signal, not
noise.
It is indicative of the variation in
human semantic interpretation of
signs, and can indicate ambiguity,
vagueness, over-generality, etc.
http://www.freefoto.com/preview/01-47-44/Flock-of-Birds
18. Approach Principles
1. understand the range of disagreements by creating a
space of possibilities with frequencies & similarities
2. tolerate, capture & exploit disagreement
3. score machine output based on where it falls in this space
4. adaptable to new annotation tasks
Flickr: auroille
19. Crowd Watson
• Crowdsourcing gold standard data for
• Training Watson in medical domain, as well as for events extraction,
image annotations, video tagging and summarization
• Crowdsourcing for Domain Adaptation
• How to rapidly acquire knowledge for new domains
• Platforms
• CrowdFlower, Amazon Mechanical Turk
• Crowdsourcing Games with a Purpose, e.g. Dr. Watson, Waisda?
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
20. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
Relation Extraction
crowdsourcing gold standard data
Relations overlap in meaning
Sentences are vague and ambiguous
Experts have different interpretations
21. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
In distant supervision we take arguments that are known to
be related by a target relation in a knowledge base and we find
all sentences in a corpus that mention both arguments.
25. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
Feeling the way the CHEST expands (PALPATION), can identify areas of
the lung that are full of fluid.
?PALPATIONIs CHEST related to
diagnose location associated
with
is_a otherpart_of
0 0 02 3 0 0 0 1 0 0 44 1
Disagreement for
Sentence Clarity
Unclear relationship between the two arguments
reflected in the disagreement
26. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
?CONJUNCTIVITISHYPERAEMIA related toIs
0 0 0 1 0 0 0 013 0 0 0 0 0
symptomcause
Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora)
of the eyes are symptoms common to all forms of CONJUNCTIVITIS.
Disagreement for
Sentence Clarity
Clearly expressed relation between the two
arguments reflected in the agreement
27. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
Sentence-Relation Score
Measures how clearly a sentence
expresses a relation
0
1
1
0
0
4
3
0
0
5
1
0
Unit vector for
relation R6
Sentence
Vector
Cosine = .55
28. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
Worker Disagreement
Measured per worker
Worker-sentence disagreement
0
1
1
0
0
4
3
0
0
5
1
0
Worker’s
sentence vector
Sentence
Vector
AVG (Cosine)
29. Crowd Truth Metrics
Relation Extraction
Three parts to understand human interpretations:
§ Sentence
• How good is a sentence for relation extraction task?
§ Workers
• How well does a worker understand the sentence?
§ Relations
• Is the meaning of the relation clear?
• How ambiguous/confusable is it?
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
30. Crowd Truth Metrics
Based on the Triangle of Reference
Three parts to understand human interpretations:
§ Sign
• How good is a sign for conveying information?
§ People
• How well does a person understand the sign?
§ Ontology
• Are the distinctions of the ontology clear?
• How ambiguous/confusable are they?
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
31. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
The Dark Side of Crowdsourcing
Disagreement
• spammers generate disagreement for the wrong reasons
• most spam detection requires gold standard
• Worker-sentence disagreement: the average of all the cosines between
each worker’s sentence vector and the full sentence vector (minus that
worker). Indicates how much a worker disagrees with the crowd on a
sentence basis
• Worker-worker disagreement: a pairwise confusion matrix between workers
and the average agreement across the matrix for each worker. Indicates
whether there are consistently like-minded workers
32. Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo
Harnessing Disagreement
• Sentence-relation score: measured for each relation on each sentence as the cosine of
the unit vector for relation with sentence vector
• Sentence clarity: for each sentence - max relation score for that sentence. If all the
workers selected the same relation for a sentence, the max score is 1, indicating a
clear sentence
• Relation similarity: pairwise conditional probability that if relation Ri is annotated in a
sentence, then Rj is as well. Indicates how confusable linguistic expression of two
relations are
• Relation ambiguity: max relation similarity for a relation. If a relation is clear score is
low
• Relation clarity: max sentence-relation score for a relation over all sentences. If a
relation has a high clarity score, it means that it is at least possible to express the
relation clearly
• Worker Quality: avg. cosine of worker vector with sentence vector for all sentences the
worker annotated.
33. Disagreement metrics
• Diverging opinions cluster around the most
plausible options.
• Identify workers who systematically disagree
1. With the opinion of the majority (worker-sentence disag)
o Compare worker opinion with that of the majority
2. With the rest of their co-workers (worker-worker disag)
o Workers with the same opinion as worker W.
3. + Avg. number of relations / sentence
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
39. Conclusions
• Crowd Truth can help us understand the
diversity of interpretations
• with adequate representation & metrics
• dispense with the “one correct answer” assumption
• Disagreement metrics can be augmented by
content filters for better spam detection
• explanations by workers can be useful
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
40. The Crew
• Lora Aroyo (VU)
• Chris Welty (IBM)
• Guillermo Soberon (VU)
• Hui Lin (IBM)
• Anca Dumitrache (VU)
• Oana Inel (VU)
• Manfred Overmeen (IBM)
• Robert-Jan Sips (IBM)
42. Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
Questions?
43. Accuracy pred. low quality (1)
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
44. Accuracy pred. low quality (2)
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
45. Spamming scenarios
Dev. Test
• 12 spammers / 110
workers
• 139 "spammed"
sentences out of 1302
(11%)
• 100% accuracy spam
detection
• 20 spammers / 93
workers
• 386 "spammed"
sentences out of 1291
(30%)
• 89% accuracy (10
spammers missed)
Can we do better?
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
46. Data collected
• Annotations
o 12 relations + OTH / NON
o Behaviour with respect to the crowd
Disagreement
Filters
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
47. • Annotations
o 12 relations + OTH / NON
o Behaviour with respect to the crowd
• Explanations
o Selected Words (justify the choice)
o Explanation (for OTHER or NONE)
o Individual behaviour patterns.
Disagreement
Filters
Explanation
filters
Data collected
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
49. Explanations analysis
Four patterns in worker behaviour indicating
spam:
o No Valid Words were used for the text
o Using the same text for all the annotations
o Using the same text for both "Selected words" and
"Explanation"
o Bad understanding (not following) of the task
instructions:
§ Selecting "None" and "Other" in combination
with other relations
§ Including explanations when are not required.
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
50. Spam patterns analysis
None / Other Rep. Response Rep. Text No Valid Words
Spam
Candidates
22 8 14 12
Overlap with
disagreement
18% 37% 36% 42%
30 unique workers were identified ONLY
by the Explanation filters as possible low quality
workers.
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
51. Spam patterns analysis
None / Other Rep. Response Rep. Text No Valid Words
Spam
Candidates
22 8 14 12
Overlap with
disagreement
18% 37% 36% 42%
30 unique workers were identified ONLY
by the Explanation filters as possible low quality
workers.
Explanation Filters ⊄ Disagreement metrics
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty
52. Results
• Linear combination of Disagreement metrics
+ Explanation filters
o "No Valid Words" and Avg. Num Relations / sent a
bit more weight than the rest
• Results
o 95% accuracy and .88 F1 score
o 16 spammers out of 20
• Previously, only with disagreement metrics:
o 88% Accuracy, .66 F1 score
o 10 spammers out of 20
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty