Truth 
is 
a 
Lie 
CrowdTruth: 
The 
7 
Myths 
of 
Human 
Annota9on 
Lora 
Aroyo
Human 
annota9on 
of 
seman)c 
interpreta)on 
tasks 
as 
cri)cal 
part 
of 
cogni)ve 
systems 
engineering 
– standard 
pr...
I 
amar 
prestar 
aen... 
• amount 
of 
data 
& 
scale 
of 
computa9on 
available 
have 
increased 
by 
a 
previously 
inc...
Seman)c 
interpreta)on 
is 
needed 
in 
all 
sciences 
– Data 
abstracted 
into 
categories 
– PaIerns, 
correla9ons, 
ass...
• Humans 
analyze 
examples: 
annota)ons 
for 
ground 
truth 
= 
the 
correct 
output 
for 
each 
example 
• Machines 
lea...
• Cogni)ve 
Compu)ng 
increases 
the 
need 
for 
machines 
to 
handle 
the 
scale 
of 
data 
• Results 
in 
increasing 
ne...
• One 
truth: 
data 
collec)on 
efforts 
assume 
one 
correct 
interpreta)on 
for 
every 
example 
• All 
examples 
are 
c...
current 
ground 
truth 
collec)on 
efforts 
assume 
one 
correct 
interpreta)on 
for 
every 
example 
the 
ideal 
of 
trut...
Which is the mood most appropriate 
Cluster 
1 
Cluster 
2 
Cluster 
3 
Cluster 
4 
Cluster 
5 
Other 
passionate, 
rollic...
• typically 
annotators 
are 
asked 
whether 
a 
binary 
property 
holds 
for 
each 
example 
• o?en 
not 
given 
a 
chanc...
Is TREAT relation expressed between 
the highlighted terms? 
ANTIBIOTICS are the first line treatment for indications of 
...
• Perfuming 
agreement 
scores 
by 
forcing 
annotators 
to 
make 
choices 
they 
may 
think 
are 
not 
valid 
• Low 
anno...
Which mood cluster is 
most appropriate for a song? 
Instruc9ons 
Your 
task 
is 
to 
listen 
to 
the 
following 
30 
seco...
• rather 
than 
accep)ng 
disagreement 
as 
a 
natural 
property 
of 
seman)c 
interpreta)on 
• tradi)onally, 
disagreemen...
Does each sentence express 
the TREAT relation? 
ANTIBIOTICS are the first line treatment for indications of TYPHUS. 
à a...
• over 
90% 
of 
annotated 
examples 
– 
seen 
by 
1-­‐2 
annotators 
• small 
number 
overlap 
– 
to 
measure 
agreement ...
One 
Quality? 
accumulated 
results 
for 
each 
rela)on 
across 
all 
the 
sentences 
20 
workers/sentence 
(and 
higher) ...
• conven9onal 
wisdom: 
human 
annotators 
with 
domain 
knowledge 
provide 
beIer 
annotated 
data, 
e.g 
– medical 
text...
What is the (medical) relation between 
the highlighted (medical) terms? 
• 91% of expert annotations covered by the crowd...
• perspec9ves 
change 
over 
9me 
– 
old 
training 
data 
might 
contain 
examples 
that 
are 
not 
valid 
or 
only 
par)a...
Which are mentions of terrorists 
in this sentence? 
OSAMA 
BIN 
LADEN used money from his own 
construction company to su...
crowdtruth.org 
Jean-­‐Marc 
Côté, 
1899
• annotator disagreement is signal, not noise. 
• it is indicative of the variation in human 
semantic interpretation of s...
crowdtruth.org
The 
Team 
2013 
hIp://crowd-­‐watson.nl
The Crew 2014
The 
(almost 
complete) 
Team 
2014
lora-aroyo.org 
slideshare.com/laroyo 
@laroyo 
crowdtruth.org
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Upcoming SlideShare
Loading in...5
×

Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

1,774

Published on

Big data is having a disruptive impact across the sciences.
Human annotation of semantic interpretation tasks is a critical
part of big data semantics, but it is based on an antiquated
ideal of a single correct truth that needs to be similarly
disrupted.We expose seven myths about human annotation,
most of which derive from that antiquated ideal of truth,
and dispell these myths with examples from our research.We
propose a new theory of truth, Crowd Truth, that is based
on the intuition that human interpretation is subjective, and
that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.

Published in: Technology
2 Comments
18 Likes
Statistics
Notes
  • The points about limitations of gold standard and option of CrowdTruth are interesting...however, I need to understand better what forms of recording of disagreements would be practical and useful? How would one come up with limited sets of disagreement, and if disagreements are recorded with open world assumption, what labels (descriptive terms) annotators will use that will capture different degrees/levels of agreements in different contexts so that annotations will still be useful for machine processing? I am reminded of the challenge in mapping objects (degrees of similarity, or what I called 'semantic proximity' which also incorporated concept of context in which assertions are made or agreement recorded) when two object are not equal, one is not a subset of another, etc. (similar problems as gross insufficiency of binary options, and so on.).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • yes, yes, yes. Lora..excellent presentation. This is what I have been doing with ImageSnippets, too.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,774
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
21
Comments
2
Likes
18
Embeds 0
No embeds

No notes for slide

Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014

  1. 1. Truth is a Lie CrowdTruth: The 7 Myths of Human Annota9on Lora Aroyo
  2. 2. Human annota9on of seman)c interpreta)on tasks as cri)cal part of cogni)ve systems engineering – standard prac)ce based on an9quated ideal of a single correct truth – 7 myths of human annota)on – new theory of truth: CrowdTruth Take Home Message Lora Aroyo
  3. 3. I amar prestar aen... • amount of data & scale of computa9on available have increased by a previously inconceivable amount • CS & AI moved out of thought problems to empirical science • current methods pre-­‐date this fundamental shi? • the ideal of “one truth” is a lie • crowdsourcing & seman9cs together correct the fallacy and improve analy)c systems The world has changed: there is a need to form a new theory of truth -­‐ appropriate to cogni)ve systems Lora Aroyo
  4. 4. Seman)c interpreta)on is needed in all sciences – Data abstracted into categories – PaIerns, correla9ons, associa9ons & implica9ons are extracted Seman9c Interpreta9on Cogni9ve Compu9ng: providing some way of scalable seman)c interpreta)on Lora Aroyo
  5. 5. • Humans analyze examples: annota)ons for ground truth = the correct output for each example • Machines learn from the examples • Ground Truth Quality: – measured by inter-­‐annotator agreement – founded on ideal for single, universally constant truth – high agreement = high quality – disagreement must be eliminated Tradi9onal Human Annota9on Lora Aroyo Current gold standard acquisi9on & quality evalua9on are outdated
  6. 6. • Cogni)ve Compu)ng increases the need for machines to handle the scale of data • Results in increasing need for new gold standards able to measure machine performance on tasks that require seman)c interpreta)on Need for Change Lora Aroyo The New Ground Truth is CrowdTruth
  7. 7. • One truth: data collec)on efforts assume one correct interpreta)on for every example • All examples are created equal: ground truth treats all examples the same – either match the correct result or not • Detailed guidelines help: if examples cause disagreement -­‐ add instruc)ons to limit interpreta)ons • Disagreement is bad: increase quality of annota)on data by reducing disagreement among the annotators • One is enough: most of the annotated examples are evaluated by one person • Experts are beIer: annotators with domain knowledge provide beIer annota)ons • Once done, forever valid: annota)ons are not updated; new data not aligned with previous 7 Myths myths directly influence the prac)ce of collec)ng human annotated data; Need to be revisited in the context of new changing world & in the face of a new theory of truth (CrowdTruth) Lora Aroyo
  8. 8. current ground truth collec)on efforts assume one correct interpreta)on for every example the ideal of truth is a fallacy for seman9c interpreta9on and needs to be changed 1. One Truth What if there are MORE? Lora Aroyo
  9. 9. Which is the mood most appropriate Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Other passionate, rollicking, literate, humorous, silly, aggressive, fiery, does not fit into rousing, cheerful, fun, poignant, wis9ul, campy, quirky, tense, anxious, any of the 5 confident, sweet, amiable, bi>ersweet, whimsical, wi>y, intense, vola?le, clusters boisterous, good-­‐natured autumnal, wry visceral rowdy brooding Lora Aroyo Choose one: for each song? one truth? Results in: (Lee and Hu 2012)
  10. 10. • typically annotators are asked whether a binary property holds for each example • o?en not given a chance to say that the property may par9ally hold, or holds but is not clearly expressed • mathema9cs of using ground truth treats every example the same – either match correct result or not • poor quality examples tend to generate high disagreement disagreement allows us to weight sentences = the ability to train & evaluate a machine more flexibly 2. All Examples Are Created Equal What if they are DIFFERENT? Lora Aroyo
  11. 11. Is TREAT relation expressed between the highlighted terms? ANTIBIOTICS are the first line treatment for indications of TYPHUS. clearly With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS. treats less clear treats equal training data? disagreement can indicate vagueness & ambiguity of sentences Lora Aroyo
  12. 12. • Perfuming agreement scores by forcing annotators to make choices they may think are not valid • Low annotator agreement is addressed by detailed guidelines for annotators to consistently handle the cases that generate disagreement • Remove poten9al signal on examples that are ambiguous precise annota)on guidelines do eliminate disagreement but do not increase quality 3. Detailed Guidelines Help What if they HURT? Lora Aroyo
  13. 13. Which mood cluster is most appropriate for a song? Instruc9ons Your task is to listen to the following 30 second music clips and select disagreement can indicate problems with the task the most appropriate mood cluster that represents the mood of the music. Try to think about the mood carried by the music and please try to ignore any lyrics. If you feel the music does not fit into any of the 5 clusters please select “Other”. The descrip)ons of the clusters are provided in the panel at the top of the page for your reference. Answer the ques)ons carefully. Your work will not be accepted if your answers are inconsistent and/or incomplete. restric2ng guidelines help? (Lee and Hu 2012) Lora Aroyo
  14. 14. • rather than accep)ng disagreement as a natural property of seman)c interpreta)on • tradi)onally, disagreement is considered a measure of poor quality because: – task is poorly defined or – annotators lack training this makes the elimina9on of disagreement the GOAL 4. Disagreement is Bad What if it is GOOD? Lora Aroyo
  15. 15. Does each sentence express the TREAT relation? ANTIBIOTICS are the first line treatment for indications of TYPHUS. à agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. à agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. à agreement 50% disagreement bad? disagreement can reflect the degree of clarity in a sentence Lora Aroyo
  16. 16. • over 90% of annotated examples – seen by 1-­‐2 annotators • small number overlap – to measure agreement five or six popular interpreta9ons can’t be captured by one or two people 5. One is Enough What if it is NOT ENOUGH? Lora Aroyo
  17. 17. One Quality? accumulated results for each rela)on across all the sentences 20 workers/sentence (and higher) yields same rela9ve disagreement Lora Aroyo
  18. 18. • conven9onal wisdom: human annotators with domain knowledge provide beIer annotated data, e.g – medical texts should be annotated by medical experts • but experts are expensive & don’t scale mul9ple perspec9ves on data can be useful, beyond what experts believe is salient or correct 6. Experts Are BeIer What if the CROWD IS BETTER? Lora Aroyo
  19. 19. What is the (medical) relation between the highlighted (medical) terms? • 91% of expert annotations covered by the crowd • expert annotators reach agreement only in 30% • most popular crowd vote covers 95% of this expert annotation agreement experts beIer than crowd? Lora Aroyo
  20. 20. • perspec9ves change over 9me – old training data might contain examples that are not valid or only par)ally valid later • con9nuous collec9on of training data over )me allows the adapta)on of gold standards to changing )mes – popularity of music – levels of educa)on 7. Once Done, Forever Valid What if VALIDITY CHANGES?
  21. 21. Which are mentions of terrorists in this sentence? OSAMA BIN LADEN used money from his own construction company to support the MUHAJADEEN in Afghanistan against Soviet forces. forever valid? 1990: hero 2011: terrorist both types should be valid -­‐ two roles for same en9ty -­‐ adapta9on of gold standards to changing 9mes Lora Aroyo
  22. 22. crowdtruth.org Jean-­‐Marc Côté, 1899
  23. 23. • annotator disagreement is signal, not noise. • it is indicative of the variation in human semantic interpretation of signs • it can indicate ambiguity, vagueness, similarity, over-generality, as well as quality crowdtruth.org
  24. 24. crowdtruth.org
  25. 25. The Team 2013 hIp://crowd-­‐watson.nl
  26. 26. The Crew 2014
  27. 27. The (almost complete) Team 2014
  28. 28. lora-aroyo.org slideshare.com/laroyo @laroyo crowdtruth.org
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×