When stars align: studies in data quality, knowledge graphs, and machine learning

When stars align: studies in data
quality, knowledge graphs, and
machine learning
Elena Simperl
@esimperl
Workshop on data quality meets machine learning and knowledge graphs (DQMLKG)
May 2024

Data-centric
AI
Data is the new code in AI/ML
Data defines best possible
functionality of an ML model
The model is a lossy compiler

Knowledge
graphs in
data-centric
AI
Knowledge graphs are
curated trusted sources of AI
data
Used for e.g. embeddings,
XAI, RAG, reasoning

Assuring
knowledge
graphs
4
I study KGs as socio-technical
constructs
My research
• Explores the links between social and
technical qualities of knowledge
graphs
• Proposes methods and tools to
uncover biases, improve quality, and
increase user trust

We know where the data comes from
6

We audit our data to make it less biased
7

We know who created the data
8

We know how the data was created
9

We know how the data is used
10

Vision
A world where
data works for
everyone
Mission
An open,
trustworthy data
ecosystem
Vision
A world where
data works for
everyone
Mission
An open,
trustworthy data
ecosystem
ODI strategy
12

Our independent, non-partisan status, our trusted convening power
and our extensive body of work – over 10 years –
make us unique in the world.
13
Leadership team
Sir Nigel Shadbolt
Chairman, cofounder
Louise Burke
CEO
Sir Tim Berners-Lee
President, cofounder
LB

15 theODI.org
International reach
Contributed to
data-driven
policies
Incubated early
stage enterprises
Return on Investment
Economic benefits
Matched government
investment with private sector,
philanthropic funding
30+ public sector
bodies 191 businesses x4 - x14 (£72m - £252m)
£18m with £33m
Trained and delivered data literacy
courses to the public and civil
servants
50,000 people
Created 1,000+ jobs
£100m+ revenue for UK economy
Critical infrastructure: Open Banking etc.
43+ countries
During the past decade, we have…

Make data AI-ready
• Enable and support high-quality AI
datasets.
• Ensure existing rights are respected as
datasets are created.
• Design and apply responsible data
stewardship and governance.
• Invest in and incentivise public good
datasets for innovation.
• Establish and standardise best practices
in AI data assurance.

17
Make AI data accessible and
usable
• Work with data holders to study critical datasets
and document them and their limitations.
• Mandate fair and equitable data access for
critical use cases e.g. misinformation, climate.
• Promote, invest in data standards to reduce the
costs of data access, operations, compliance.
• Enable safe access to data by startups and
SMEs.
• Close the data divide with bespoke,
responsibly built AI co-pilots.

Make AI systems use data
responsibly
• Explore data provenance and lineage in training
datasets and document best practices.
• Invest in R&I for AI models that rely less on
massive data collection.
• Design and assess meaningful data licenses.
• Invest in creating more practical toolkits to
inform regulation and reduce compliance costs,
supported by peer learning networks.
• Strengthen data-centric RAI practices through
training for AI engineers and other stakeholders.

Data quality requirements for ML datasets
20

Standardised metadata about
ML datasets with Croissant
An open format for ML datasets, based on
web standards, that represents ML data
and metadata to
● Reduce friction for using datasets across ML tools and
platforms
● Make it easy to publish, discover and reuse ML datasets
● Support responsible data practices.
Find out more at: mlcommons.org/croissant
Paper: https://arxiv.org/pdf/2403.19546
21

Croissant is based on schema.org/Dataset
22

Metadata attributes
Four layers:
• Metadata: general information + ML-
specific attributes
• Resources: the source data organised as
files and sets of files
• Structure: the structure of the resources.
Basic data manipulation (JSON Path,
regex)
• Semantics: ML-specific data
interpretations for interoperability.
Custom types e.g. bounding box and
data organisations e.g. train/test splits.
See: github.com/mlcommons/croissant
mlcommons.org/croissant/1.0
23
● name, description, license, …
● ML-specific attributes: splits,
features, labels, …
● Responsible AI attributes
Dataset metadata
RecordSets
Resources
FileSet(s)
FileObject(s)
Single files:
● CSV
● JSON
● Zip, …
Directories/sets
of homogeneous
files:
● images
● text, …
Tabular structure
over structured
and
unstructured
resources.
Supports joining
& flattening in
preparation for
ML loading.
Fields (schema):
● name
● type
● references
● nesting, ...
field
1
field
2
field
3
a 1 img1
a 2 img2

Croissant for data
users
24
Search for ML datasets in Dataset
Search
Download ML datasets from
repositories like Kaggle, Hugging
Face, OpenML, TensorFlow
Datasets catalogue
Load ML datasets into ML
frameworks like TensorFlow, JAX
and PyTorch.

Croissant for data publishers
● Visual editor to create
and modify datasets,
automate the description
of the data (e.g., CSV
columns), and get
recommendations to
improve metadata
● Python library to validate,
manipulate and convert
datasets
25

Modular and extensible by design
• Responsible AI: v1 launched
• Use cases e.g. fair ML, safe ML,
crowdsourced labelling,
regulatory compliance,
transparency
• Planned: geospatial, health
26

Towards Croissant v2
• New attributes for
provenance and lineage,
quality etc
• Alignment with
• Data Provenance Initiative
• MLDCAT-AP
• SAGE DUO (health)
27
Source: Data Provenance Explorer, 2024

Also relevant: data-centric ML research
working group
Source: dmlr.ai

Thank you
31
Elena Simperl
@esimperl
Some slides courtesy of the Croissant working group
@MLCommons

When stars align: studies in data quality, knowledge graphs, and machine learning

More Related Content

Similar to When stars align: studies in data quality, knowledge graphs, and machine learning

More from Elena Simperl

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine learning