Data Excellence:
Better Data for Better AI
ODSC 2020
Lora Aroyo
http://lora-aroyo.org
@laroyo
By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN
0-8109-6720-0) by Justin Foote (talk)., Fair use,
https://en.wikipedia.org/w/index.php?curid=3955850
http://lora-aroyo.org @laroyo
TAKE HOME MESSAGE
2
data lifecycle - just like in software - is needed to
guide data research & development practices
data is the compass for AI - AI advances where
there is data
data is at the center - AI systems success
depends on the quality of their data
https://en.wikipedia.org/wiki/Metamorphosis_II
data quality must be addressed in AI practices
- multitude of notions of truth
- necessity for data quality standards
data lifecycle is the backbone for data
excellence tools and practices to stay ahead of
future unintended AI behaviours
http://lora-aroyo.org @laroyo 3
The Rise of the Machines
“AI Winter”
lab experiments
Expert Systems
small scale
experiments
http://lora-aroyo.org @laroyo 4
The Rise of the Machines
“AI Winter” → “AI Breakthroughs in Games”
IBM Watson Jeopardy
DeepMind AlphaGo
beat the humans
http://lora-aroyo.org @laroyo 5
The Rise of the Machines
“AI Winter” → “AI Breakthroughs in Games” → “Real World Tasks”
Health diagnostics
Flue prediction
Weather prediction
Text, Image and Video classification
Text Generation
Text Translation
Conversational AI
support the humans
http://lora-aroyo.org @laroyo 6
Mainstream Deployment of AI
“Real World Tasks” deployed in the wild → Unintended behaviors
Microsoft Tay bot
IBM Watson Oncology
Amazon Rekognition
Google Photos
Apple Face ID
Facebook chat bots
Various Speech Assistants
http://lora-aroyo.org @laroyo 7
getting computers to “see”
the diversity of data
data quality is essential for
guiding AI away from
unintended behaviours
Data is the compass for AI
http://lora-aroyo.org @laroyo 8
The Life of AI Data
“It exists!”
bootstrapping AI with data
Caltech101
LabelMe
Berkley-3D
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
http://lora-aroyo.org @laroyo 9
The Life of AI Data
“It exists!” → “It is bigger!”
data hungry AI
ImageNet
SIFT10M
OpenImages
COCO
Web 1T 5-Gram
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
http://lora-aroyo.org @laroyo 10
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
but before it got better ...
http://lora-aroyo.org @laroyo 11
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
but before it got better ...
it got worse ...
http://lora-aroyo.org @laroyo 12
Unintended Behaviors in AI
Adapted from “AI in the Open World: Discovering Blind Spots of AI”, SafeAI 2020, Ece Kumar
http://lora-aroyo.org @laroyo 13
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
but before it got better ...
reactive
data improvement
http://lora-aroyo.org @laroyo 14
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
to reach here
we need proactive
data improvement
http://lora-aroyo.org @laroyo 15
The Life of AI Data
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (2009)
In the decade since then, the research community have done a lot
with quantity, but quality has been left behind
http://lora-aroyo.org @laroyo 16
In the 90’s we introduced standards
to achieve Software reliability
introduced software engineering lifecycle
- requirements, design and testing
established processes for software maintenance
- version control, sharing, documenting
established software quality metrics & processes
Ben Hutchinson, 2020
http://lora-aroyo.org @laroyo 17
Now we need the same for Data
introduce data lifecycle
- requirements, design and testing
establish processes for dataset maintenance
- version control, sharing, documenting
establish data quality metrics & processes
Ben Hutchinson, 2020
http://lora-aroyo.org @laroyo 18
data quality is typically not
caused by software bugs or just
by human errors
dataset are not easy to debug
data quality is typically result of:
- how well a dataset
represent the actual task
- how is the annotation done
- are the quality metrics
adequate
Data Quality is not easy ...
http://lora-aroyo.org @laroyo
it is not easy to give Y/N answer
for most of our AI tasks
19
Do these images depict a GUITAR ?
Data Quality is not only human error
✓
✓ ✓
✘
✘
✘✘✓
✓
http://lora-aroyo.org @laroyo 20
Do these images depict NEW ZEALAND ?
Data Quality should consider context of use
it is not easy to give Y/N answer
for most of our AI tasks
the answer typically depends on
the context, on the task, on the
usage, etc
✓ ✘
✓ ✓ ✘
✘
http://lora-aroyo.org @laroyo 21
Do these images depict a WEDDING ?
Data Quality should include real world diversity
it is not easy to give Y/N answer
for most of our AI tasks
the answer typically depends on
the context, on the task, on the
usage, etc
disagreement is signal for
diversity and should be included
in AI training
✓
✘
✓
✓
✘
✓
http://lora-aroyo.org @laroyo 22
Does the Sentence expresses
Does the sentence express TREATS relation between Chloroquine, Malaria?
Data Quality is difficult even with experts
For prevention of malaria, use only in individuals traveling to malarious
areas where CHLOROQUINE resistant P. falciparum MALARIA
has not been reported.
Rheumatoid arthritis and MALARIA have been treated
with CHLOROQUINE for decades.
Among 56 subjects reporting to a clinic with symptoms of MALARIA
53 (95%) had ordinarily effective levels of CHLOROQUINE in blood.
✓
✘
✓
http://lora-aroyo.org @laroyo
DISAGREEMENT IS SIGNAL
Variety of sources for disagreement
http://lora-aroyo.org @laroyo 24
Does the Sentence expresses
Model of semantic interpretation
TRIANGLE OF MEANING
“Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty
Workshop on “Subjectivity, Ambiguity and Disagreement (SAD) in Crowdsourcing”, The Web Conference 2019, https://sadworkshop.wordpress.com/
Annotator disagreement
is signal, not noise
Annotator disagreement
is indicative of
variation in human
interpretation
Annotator disagreement
is indicative of
ambiguity, vagueness,
similarity, over-generality,
& quality
http://lora-aroyo.org @laroyo 25
Three sides of human interpretation
CROWDTRUTH Disagreement provides
guidance in task analysis:
● items with poor semantics
● items with salient terms
● items difficult to classify
● items that are ambiguous
● subjective annotations
● time-sensitive annotations
● difficult annotation tasks
● mis-translated annotations
● users with/without
specific knowledge
● communities of thought
● spammers
You can’t remove the corners…
“Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty
http://lora-aroyo.org @laroyo
THE WORLD IS A SMOOTH SPECTRUM OF TRUTH
26
http://lora-aroyo.org @laroyo 27
One truth: knowledge acquisition typically assumes one
correct interpretation for every example
Experts rule: knowledge is captured from domain experts
One is enough: single expert’s knowledge is sufficient
Disagreement bad: when people disagree, they must not
understand the problem
Detailed explanations help: if examples cause
disagreement - adding instructions should help
Once done, forever valid: knowledge is not updated; new
data not aligned with old
All examples are created equal: triples are triples, one is
not more important than another, they are all either true or
false
… and we force the smoothness into a binary form
7 Myths about Human Annotation
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
http://lora-aroyo.org @laroyo 28
High Quality Data
represents a phenomena
accurately and consistently over time
and is replicable, reproducible,
and maintainable over time;
has empirical and explanatory power;
and is collected, stored, and used
responsibly.
Rigorous Evaluation of AI Systems workshop, 2019, Human Computation (HCOMP), http://eval.how/
Evaluating Evaluation for AI Systems workshop, 2020, Association for the Advancement of Artificial Intelligence (AAAI), http://eval.how/aaai-2020/
http://lora-aroyo.org @laroyo 29
From Data Quality to Data Excellence
Data Quality is
- a point-estimate of goodness of data
Data Excellence is
- the set of practices and tools that result in
high quality data
http://lora-aroyo.org @laroyo 30
How do we achieve Data Excellence?
Maintainability
Well documented datasets with
owners, which follow best practices
for data at any scale.
Reproducibility
Basic and critical regression tests
for datasets which suppo solid
conclusions for decision making.
Reliability
Datasets which are internally sound
and consistent; factors that a ect
the data are addressed or disclosed.
Fidelity
Data which faithfully, accurately, and
comprehensively represents the
captured phenomenon.
Validity
Datasets which explain aspects of
the phenomena that they represent
in terms of external measures.
1st International Workshop on Data Excellence: http://eval.how/dew2020/
Utility
Data which adequately and
accurately achieves the intended
product behavior.
http://lora-aroyo.org @laroyo 31
much like in software lifecycles, cutting corners at each stage
cascades to subsequent versions, which lead to technical debt
Dataset [Requirements] Analysis
Requirements Analysis
Stakeholder Input
Privacy, compliance
Trust & safety planning
Dataset Maintenance
Updating data over time
Extending to other languages
Version control
Storage and accessibility
Dataset Design
Data acquisition methodology
Rater guidelines
Construct validation
Dataset Testing
Representation metrics
Fairness metrics
Reliability metrics
Approval process
Dataset Implementation
Human labeled data
Logging interaction data
Data
Lifecycle
Ben Hutchinson, 2020
http://lora-aroyo.org @laroyo
TAKE HOME MESSAGE
32
https://en.wikipedia.org/wiki/Metamorphosis_II
data lifecycle - just like in software - is needed to
guide data research & development practices
data is the compass for AI - AI advances where
there is data
data is at the center - AI systems success
depends on the quality of their data
data quality must be addressed in AI practices
- multitude of notions of truth
- necessity for data quality standards
data lifecycle is the backbone for data
excellence tools and practices to stay ahead of
future unintended AI behaviours
http://lora-aroyo.org @laroyo 33
Collaborators
EthicalAI
Ben Hutchinson
Crowd Platform
Amol Wankhede
Anurag Batra
People + AI Research (PAIR)
Nithya Sambasivan
Kristen Olson
Shivani Kapania
Jess Holbrook
Andrew Zaldivar
Mahima Pushkarna
Maysam Moussalem
Praveen Paritosh Ka Wong
Lora Aroyo Devi Krishna
Likert team
Data Excellence:
Better Data for Better AI
ODSC 2020
Lora Aroyo
http://lora-aroyo.org
@laroyo
By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN
0-8109-6720-0) by Justin Foote (talk)., Fair use,
https://en.wikipedia.org/w/index.php?curid=3955850
high profile data failure
not bugs in the software, not mistake of humans
problems caused by quality in the data
just like software quality in 90’s - the same has to happen with data
examples of questionable data
crowdtruth relation extraction
how would you annotate it
how do we know and measure the quality of the data
how well does it represent the actual task we are trying to solve
like software we need to establish data quality standards

Data excellence: Better data for better AI

  • 1.
    Data Excellence: Better Datafor Better AI ODSC 2020 Lora Aroyo http://lora-aroyo.org @laroyo By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN 0-8109-6720-0) by Justin Foote (talk)., Fair use, https://en.wikipedia.org/w/index.php?curid=3955850
  • 2.
    http://lora-aroyo.org @laroyo TAKE HOMEMESSAGE 2 data lifecycle - just like in software - is needed to guide data research & development practices data is the compass for AI - AI advances where there is data data is at the center - AI systems success depends on the quality of their data https://en.wikipedia.org/wiki/Metamorphosis_II data quality must be addressed in AI practices - multitude of notions of truth - necessity for data quality standards data lifecycle is the backbone for data excellence tools and practices to stay ahead of future unintended AI behaviours
  • 3.
    http://lora-aroyo.org @laroyo 3 TheRise of the Machines “AI Winter” lab experiments Expert Systems small scale experiments
  • 4.
    http://lora-aroyo.org @laroyo 4 TheRise of the Machines “AI Winter” → “AI Breakthroughs in Games” IBM Watson Jeopardy DeepMind AlphaGo beat the humans
  • 5.
    http://lora-aroyo.org @laroyo 5 TheRise of the Machines “AI Winter” → “AI Breakthroughs in Games” → “Real World Tasks” Health diagnostics Flue prediction Weather prediction Text, Image and Video classification Text Generation Text Translation Conversational AI support the humans
  • 6.
    http://lora-aroyo.org @laroyo 6 MainstreamDeployment of AI “Real World Tasks” deployed in the wild → Unintended behaviors Microsoft Tay bot IBM Watson Oncology Amazon Rekognition Google Photos Apple Face ID Facebook chat bots Various Speech Assistants
  • 7.
    http://lora-aroyo.org @laroyo 7 gettingcomputers to “see” the diversity of data data quality is essential for guiding AI away from unintended behaviours Data is the compass for AI
  • 8.
    http://lora-aroyo.org @laroyo 8 TheLife of AI Data “It exists!” bootstrapping AI with data Caltech101 LabelMe Berkley-3D https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
  • 9.
    http://lora-aroyo.org @laroyo 9 TheLife of AI Data “It exists!” → “It is bigger!” data hungry AI ImageNet SIFT10M OpenImages COCO Web 1T 5-Gram https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
  • 10.
    http://lora-aroyo.org @laroyo 10 TheLife of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ...
  • 11.
    http://lora-aroyo.org @laroyo 11 TheLife of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ... it got worse ...
  • 12.
    http://lora-aroyo.org @laroyo 12 UnintendedBehaviors in AI Adapted from “AI in the Open World: Discovering Blind Spots of AI”, SafeAI 2020, Ece Kumar
  • 13.
    http://lora-aroyo.org @laroyo 13 TheLife of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ... reactive data improvement
  • 14.
    http://lora-aroyo.org @laroyo 14 TheLife of AI Data “It exists!” → “It is bigger!” → “It is better!” to reach here we need proactive data improvement
  • 15.
    http://lora-aroyo.org @laroyo 15 TheLife of AI Data Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (2009) In the decade since then, the research community have done a lot with quantity, but quality has been left behind
  • 16.
    http://lora-aroyo.org @laroyo 16 Inthe 90’s we introduced standards to achieve Software reliability introduced software engineering lifecycle - requirements, design and testing established processes for software maintenance - version control, sharing, documenting established software quality metrics & processes Ben Hutchinson, 2020
  • 17.
    http://lora-aroyo.org @laroyo 17 Nowwe need the same for Data introduce data lifecycle - requirements, design and testing establish processes for dataset maintenance - version control, sharing, documenting establish data quality metrics & processes Ben Hutchinson, 2020
  • 18.
    http://lora-aroyo.org @laroyo 18 dataquality is typically not caused by software bugs or just by human errors dataset are not easy to debug data quality is typically result of: - how well a dataset represent the actual task - how is the annotation done - are the quality metrics adequate Data Quality is not easy ...
  • 19.
    http://lora-aroyo.org @laroyo it isnot easy to give Y/N answer for most of our AI tasks 19 Do these images depict a GUITAR ? Data Quality is not only human error ✓ ✓ ✓ ✘ ✘ ✘✘✓ ✓
  • 20.
    http://lora-aroyo.org @laroyo 20 Dothese images depict NEW ZEALAND ? Data Quality should consider context of use it is not easy to give Y/N answer for most of our AI tasks the answer typically depends on the context, on the task, on the usage, etc ✓ ✘ ✓ ✓ ✘ ✘
  • 21.
    http://lora-aroyo.org @laroyo 21 Dothese images depict a WEDDING ? Data Quality should include real world diversity it is not easy to give Y/N answer for most of our AI tasks the answer typically depends on the context, on the task, on the usage, etc disagreement is signal for diversity and should be included in AI training ✓ ✘ ✓ ✓ ✘ ✓
  • 22.
    http://lora-aroyo.org @laroyo 22 Doesthe Sentence expresses Does the sentence express TREATS relation between Chloroquine, Malaria? Data Quality is difficult even with experts For prevention of malaria, use only in individuals traveling to malarious areas where CHLOROQUINE resistant P. falciparum MALARIA has not been reported. Rheumatoid arthritis and MALARIA have been treated with CHLOROQUINE for decades. Among 56 subjects reporting to a clinic with symptoms of MALARIA 53 (95%) had ordinarily effective levels of CHLOROQUINE in blood. ✓ ✘ ✓
  • 23.
    http://lora-aroyo.org @laroyo DISAGREEMENT ISSIGNAL Variety of sources for disagreement
  • 24.
    http://lora-aroyo.org @laroyo 24 Doesthe Sentence expresses Model of semantic interpretation TRIANGLE OF MEANING “Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty Workshop on “Subjectivity, Ambiguity and Disagreement (SAD) in Crowdsourcing”, The Web Conference 2019, https://sadworkshop.wordpress.com/ Annotator disagreement is signal, not noise Annotator disagreement is indicative of variation in human interpretation Annotator disagreement is indicative of ambiguity, vagueness, similarity, over-generality, & quality
  • 25.
    http://lora-aroyo.org @laroyo 25 Threesides of human interpretation CROWDTRUTH Disagreement provides guidance in task analysis: ● items with poor semantics ● items with salient terms ● items difficult to classify ● items that are ambiguous ● subjective annotations ● time-sensitive annotations ● difficult annotation tasks ● mis-translated annotations ● users with/without specific knowledge ● communities of thought ● spammers You can’t remove the corners… “Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty
  • 26.
    http://lora-aroyo.org @laroyo THE WORLDIS A SMOOTH SPECTRUM OF TRUTH 26
  • 27.
    http://lora-aroyo.org @laroyo 27 Onetruth: knowledge acquisition typically assumes one correct interpretation for every example Experts rule: knowledge is captured from domain experts One is enough: single expert’s knowledge is sufficient Disagreement bad: when people disagree, they must not understand the problem Detailed explanations help: if examples cause disagreement - adding instructions should help Once done, forever valid: knowledge is not updated; new data not aligned with old All examples are created equal: triples are triples, one is not more important than another, they are all either true or false … and we force the smoothness into a binary form 7 Myths about Human Annotation “Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
  • 28.
    http://lora-aroyo.org @laroyo 28 HighQuality Data represents a phenomena accurately and consistently over time and is replicable, reproducible, and maintainable over time; has empirical and explanatory power; and is collected, stored, and used responsibly. Rigorous Evaluation of AI Systems workshop, 2019, Human Computation (HCOMP), http://eval.how/ Evaluating Evaluation for AI Systems workshop, 2020, Association for the Advancement of Artificial Intelligence (AAAI), http://eval.how/aaai-2020/
  • 29.
    http://lora-aroyo.org @laroyo 29 FromData Quality to Data Excellence Data Quality is - a point-estimate of goodness of data Data Excellence is - the set of practices and tools that result in high quality data
  • 30.
    http://lora-aroyo.org @laroyo 30 Howdo we achieve Data Excellence? Maintainability Well documented datasets with owners, which follow best practices for data at any scale. Reproducibility Basic and critical regression tests for datasets which suppo solid conclusions for decision making. Reliability Datasets which are internally sound and consistent; factors that a ect the data are addressed or disclosed. Fidelity Data which faithfully, accurately, and comprehensively represents the captured phenomenon. Validity Datasets which explain aspects of the phenomena that they represent in terms of external measures. 1st International Workshop on Data Excellence: http://eval.how/dew2020/ Utility Data which adequately and accurately achieves the intended product behavior.
  • 31.
    http://lora-aroyo.org @laroyo 31 muchlike in software lifecycles, cutting corners at each stage cascades to subsequent versions, which lead to technical debt Dataset [Requirements] Analysis Requirements Analysis Stakeholder Input Privacy, compliance Trust & safety planning Dataset Maintenance Updating data over time Extending to other languages Version control Storage and accessibility Dataset Design Data acquisition methodology Rater guidelines Construct validation Dataset Testing Representation metrics Fairness metrics Reliability metrics Approval process Dataset Implementation Human labeled data Logging interaction data Data Lifecycle Ben Hutchinson, 2020
  • 32.
    http://lora-aroyo.org @laroyo TAKE HOMEMESSAGE 32 https://en.wikipedia.org/wiki/Metamorphosis_II data lifecycle - just like in software - is needed to guide data research & development practices data is the compass for AI - AI advances where there is data data is at the center - AI systems success depends on the quality of their data data quality must be addressed in AI practices - multitude of notions of truth - necessity for data quality standards data lifecycle is the backbone for data excellence tools and practices to stay ahead of future unintended AI behaviours
  • 33.
    http://lora-aroyo.org @laroyo 33 Collaborators EthicalAI BenHutchinson Crowd Platform Amol Wankhede Anurag Batra People + AI Research (PAIR) Nithya Sambasivan Kristen Olson Shivani Kapania Jess Holbrook Andrew Zaldivar Mahima Pushkarna Maysam Moussalem Praveen Paritosh Ka Wong Lora Aroyo Devi Krishna Likert team
  • 34.
    Data Excellence: Better Datafor Better AI ODSC 2020 Lora Aroyo http://lora-aroyo.org @laroyo By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN 0-8109-6720-0) by Justin Foote (talk)., Fair use, https://en.wikipedia.org/w/index.php?curid=3955850
  • 35.
    high profile datafailure not bugs in the software, not mistake of humans problems caused by quality in the data just like software quality in 90’s - the same has to happen with data examples of questionable data crowdtruth relation extraction how would you annotate it how do we know and measure the quality of the data how well does it represent the actual task we are trying to solve like software we need to establish data quality standards