When stars align: studies in data
quality, knowledge graphs, and
machine learning
Elena Simperl
@esimperl
Workshop on data quality meets machine learning and knowledge graphs (DQMLKG)
May 2024
Data-centric
AI
Data is the new code in AI/ML
Data defines best possible
functionality of an ML model
The model is a lossy compiler
Knowledge
graphs in
data-centric
AI
Knowledge graphs are
curated trusted sources of AI
data
Used for e.g. embeddings,
XAI, RAG, reasoning
Assuring
knowledge
graphs
4
I study KGs as socio-technical
constructs
My research
• Explores the links between social and
technical qualities of knowledge
graphs
• Proposes methods and tools to
uncover biases, improve quality, and
increase user trust
We know how good the data is
We know where the data comes from
6
We audit our data to make it less biased
7
We know who created the data
8
We know how the data was created
9
We know how the data is used
10
Data-centric
AI @ODI
Vision
A world where
data works for
everyone
Mission
An open,
trustworthy data
ecosystem
Vision
A world where
data works for
everyone
Mission
An open,
trustworthy data
ecosystem
ODI strategy
12
Our independent, non-partisan status, our trusted convening power
and our extensive body of work – over 10 years –
make us unique in the world.
13
Leadership team
Sir Nigel Shadbolt
Chairman, cofounder
Louise Burke
CEO
Sir Tim Berners-Lee
President, cofounder
LB
15 theODI.org
International reach
Contributed to
data-driven
policies
Incubated early
stage enterprises
Return on Investment
Economic benefits
Matched government
investment with private sector,
philanthropic funding
30+ public sector
bodies 191 businesses x4 - x14 (£72m - £252m)
£18m with £33m
Trained and delivered data literacy
courses to the public and civil
servants
50,000 people
Created 1,000+ jobs
£100m+ revenue for UK economy
Critical infrastructure: Open Banking etc.
43+ countries
During the past decade, we have…
Make data AI-ready
• Enable and support high-quality AI
datasets.
• Ensure existing rights are respected as
datasets are created.
• Design and apply responsible data
stewardship and governance.
• Invest in and incentivise public good
datasets for innovation.
• Establish and standardise best practices
in AI data assurance.
17
Make AI data accessible and
usable
• Work with data holders to study critical datasets
and document them and their limitations.
• Mandate fair and equitable data access for
critical use cases e.g. misinformation, climate.
• Promote, invest in data standards to reduce the
costs of data access, operations, compliance.
• Enable safe access to data by startups and
SMEs.
• Close the data divide with bespoke,
responsibly built AI co-pilots.
Make AI systems use data
responsibly
• Explore data provenance and lineage in training
datasets and document best practices.
• Invest in R&I for AI models that rely less on
massive data collection.
• Design and assess meaningful data licenses.
• Invest in creating more practical toolkits to
inform regulation and reduce compliance costs,
supported by peer learning networks.
• Strengthen data-centric RAI practices through
training for AI engineers and other stakeholders.
Data quality requirements for ML datasets
20
Standardised metadata about
ML datasets with Croissant
An open format for ML datasets, based on
web standards, that represents ML data
and metadata to
● Reduce friction for using datasets across ML tools and
platforms
● Make it easy to publish, discover and reuse ML datasets
● Support responsible data practices.
Find out more at: mlcommons.org/croissant
Paper: https://arxiv.org/pdf/2403.19546
21
Croissant is based on schema.org/Dataset
22
Metadata attributes
Four layers:
• Metadata: general information + ML-
specific attributes
• Resources: the source data organised as
files and sets of files
• Structure: the structure of the resources.
Basic data manipulation (JSON Path,
regex)
• Semantics: ML-specific data
interpretations for interoperability.
Custom types e.g. bounding box and
data organisations e.g. train/test splits.
See: github.com/mlcommons/croissant
mlcommons.org/croissant/1.0
23
● name, description, license, …
● ML-specific attributes: splits,
features, labels, …
● Responsible AI attributes
Dataset metadata
RecordSets
Resources
FileSet(s)
FileObject(s)
Single files:
● CSV
● JSON
● Zip, …
Directories/sets
of homogeneous
files:
● images
● text, …
Tabular structure
over structured
and
unstructured
resources.
Supports joining
& flattening in
preparation for
ML loading.
Fields (schema):
● name
● type
● references
● nesting, ...
field
1
field
2
field
3
a 1 img1
a 2 img2
Croissant for data
users
24
Search for ML datasets in Dataset
Search
Download ML datasets from
repositories like Kaggle, Hugging
Face, OpenML, TensorFlow
Datasets catalogue
Load ML datasets into ML
frameworks like TensorFlow, JAX
and PyTorch.
Croissant for data publishers
● Visual editor to create
and modify datasets,
automate the description
of the data (e.g., CSV
columns), and get
recommendations to
improve metadata
● Python library to validate,
manipulate and convert
datasets
25
Modular and extensible by design
• Responsible AI: v1 launched
• Use cases e.g. fair ML, safe ML,
crowdsourced labelling,
regulatory compliance,
transparency
• Planned: geospatial, health
26
Towards Croissant v2
• New attributes for
provenance and lineage,
quality etc
• Alignment with
• Data Provenance Initiative
• MLDCAT-AP
• SAGE DUO (health)
27
Source: Data Provenance Explorer, 2024
28
Join the working group!
29
Also relevant: data-centric ML research
working group
Source: dmlr.ai
Thank you
31
Elena Simperl
@esimperl
Some slides courtesy of the Croissant working group
@MLCommons

When stars align: studies in data quality, knowledge graphs, and machine learning

  • 1.
    When stars align:studies in data quality, knowledge graphs, and machine learning Elena Simperl @esimperl Workshop on data quality meets machine learning and knowledge graphs (DQMLKG) May 2024
  • 2.
    Data-centric AI Data is thenew code in AI/ML Data defines best possible functionality of an ML model The model is a lossy compiler
  • 3.
    Knowledge graphs in data-centric AI Knowledge graphsare curated trusted sources of AI data Used for e.g. embeddings, XAI, RAG, reasoning
  • 4.
    Assuring knowledge graphs 4 I study KGsas socio-technical constructs My research • Explores the links between social and technical qualities of knowledge graphs • Proposes methods and tools to uncover biases, improve quality, and increase user trust
  • 5.
    We know howgood the data is
  • 6.
    We know wherethe data comes from 6
  • 7.
    We audit ourdata to make it less biased 7
  • 8.
    We know whocreated the data 8
  • 9.
    We know howthe data was created 9
  • 10.
    We know howthe data is used 10
  • 11.
  • 12.
    Vision A world where dataworks for everyone Mission An open, trustworthy data ecosystem Vision A world where data works for everyone Mission An open, trustworthy data ecosystem ODI strategy 12
  • 13.
    Our independent, non-partisanstatus, our trusted convening power and our extensive body of work – over 10 years – make us unique in the world. 13 Leadership team Sir Nigel Shadbolt Chairman, cofounder Louise Burke CEO Sir Tim Berners-Lee President, cofounder LB
  • 15.
    15 theODI.org International reach Contributedto data-driven policies Incubated early stage enterprises Return on Investment Economic benefits Matched government investment with private sector, philanthropic funding 30+ public sector bodies 191 businesses x4 - x14 (£72m - £252m) £18m with £33m Trained and delivered data literacy courses to the public and civil servants 50,000 people Created 1,000+ jobs £100m+ revenue for UK economy Critical infrastructure: Open Banking etc. 43+ countries During the past decade, we have…
  • 16.
    Make data AI-ready •Enable and support high-quality AI datasets. • Ensure existing rights are respected as datasets are created. • Design and apply responsible data stewardship and governance. • Invest in and incentivise public good datasets for innovation. • Establish and standardise best practices in AI data assurance.
  • 17.
    17 Make AI dataaccessible and usable • Work with data holders to study critical datasets and document them and their limitations. • Mandate fair and equitable data access for critical use cases e.g. misinformation, climate. • Promote, invest in data standards to reduce the costs of data access, operations, compliance. • Enable safe access to data by startups and SMEs. • Close the data divide with bespoke, responsibly built AI co-pilots.
  • 18.
    Make AI systemsuse data responsibly • Explore data provenance and lineage in training datasets and document best practices. • Invest in R&I for AI models that rely less on massive data collection. • Design and assess meaningful data licenses. • Invest in creating more practical toolkits to inform regulation and reduce compliance costs, supported by peer learning networks. • Strengthen data-centric RAI practices through training for AI engineers and other stakeholders.
  • 20.
    Data quality requirementsfor ML datasets 20
  • 21.
    Standardised metadata about MLdatasets with Croissant An open format for ML datasets, based on web standards, that represents ML data and metadata to ● Reduce friction for using datasets across ML tools and platforms ● Make it easy to publish, discover and reuse ML datasets ● Support responsible data practices. Find out more at: mlcommons.org/croissant Paper: https://arxiv.org/pdf/2403.19546 21
  • 22.
    Croissant is basedon schema.org/Dataset 22
  • 23.
    Metadata attributes Four layers: •Metadata: general information + ML- specific attributes • Resources: the source data organised as files and sets of files • Structure: the structure of the resources. Basic data manipulation (JSON Path, regex) • Semantics: ML-specific data interpretations for interoperability. Custom types e.g. bounding box and data organisations e.g. train/test splits. See: github.com/mlcommons/croissant mlcommons.org/croissant/1.0 23 ● name, description, license, … ● ML-specific attributes: splits, features, labels, … ● Responsible AI attributes Dataset metadata RecordSets Resources FileSet(s) FileObject(s) Single files: ● CSV ● JSON ● Zip, … Directories/sets of homogeneous files: ● images ● text, … Tabular structure over structured and unstructured resources. Supports joining & flattening in preparation for ML loading. Fields (schema): ● name ● type ● references ● nesting, ... field 1 field 2 field 3 a 1 img1 a 2 img2
  • 24.
    Croissant for data users 24 Searchfor ML datasets in Dataset Search Download ML datasets from repositories like Kaggle, Hugging Face, OpenML, TensorFlow Datasets catalogue Load ML datasets into ML frameworks like TensorFlow, JAX and PyTorch.
  • 25.
    Croissant for datapublishers ● Visual editor to create and modify datasets, automate the description of the data (e.g., CSV columns), and get recommendations to improve metadata ● Python library to validate, manipulate and convert datasets 25
  • 26.
    Modular and extensibleby design • Responsible AI: v1 launched • Use cases e.g. fair ML, safe ML, crowdsourced labelling, regulatory compliance, transparency • Planned: geospatial, health 26
  • 27.
    Towards Croissant v2 •New attributes for provenance and lineage, quality etc • Alignment with • Data Provenance Initiative • MLDCAT-AP • SAGE DUO (health) 27 Source: Data Provenance Explorer, 2024
  • 28.
  • 29.
  • 30.
    Also relevant: data-centricML research working group Source: dmlr.ai
  • 31.
    Thank you 31 Elena Simperl @esimperl Someslides courtesy of the Croissant working group @MLCommons