A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Data-centric AI and the convergence of data and model engineering:opportunities to streamline the end-to-end data value chain
1. Data-centric AI and the convergence of data and model engineering:
opportunities to streamline the end-to-end data value chain
Prof. Paolo Missier
School of Computing
Newcastle University, UK
University of Évora, Portugal
Nov 22-24, 2023
2. 2
Health Data Science is all about the Data
Tracking trajectories of multiple long-term conditions
using dynamic patient-cluster associations. Kremer, R.;
Raza, S. M.; Eto, F.; Casement, J.; Atallah, C.; Finer, S.;
Lendrem, D.; Barnes, M.; Reynolds, N. J; and Missier,
P. In 2022 IEEE International Conference on Big Data
(Big Data), pages 4390–4399, December 2022.
IDEAL
2023
3. 3
Data-centric AI (DCAI) – is that a thing?
towardsdatascience.com
https://datacentricai.org/
https://landing.ai/data-centric-ai/
IDEAL
2023
5. 5
A practical view
https://www.vanderschaar-lab.com/data-centric-ai/
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable
Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
IDEAL
2023
6. 6
IDEAL
2023
Talk outline
• Data Centric AI sounds familiar to database people…
old wine in new bottles?
• What is new:
Iteration patterns that span the entire data / training / deployment / monitoring lifecycle
• A few examples: Training set cleaning, pruning
• Goal: harvesting this knowledge into reusable patterns and libraries
To support end-to-end data science pipelines in a principled way
• How can we do this?
Data versioning, provenance in support of reproducibility and explainability
Current technical solutions, and open challenges
7. 7
Rapidly emerging literature
[2] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023.
http://arxiv.org/abs/2301.04819.
[3] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92.
https://doi.org/10.1145/3571724
[4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A
Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158.
Source: [2]
Source: [3]
IDEAL
2023
8. 8
Hypothesis: DCAI involves extended feedback loops
Source: [5]
[5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6,
no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001.
IDEAL
2023
9. 9
IDEAL
2023
The interesting twist: end-to-end patterns
Data
interventions
Competitive
Modelling
Learning-ready
data
Evaluation
(model
performance)
Ops /
Monitoring
Data
Selection
Why are these interesting?
- human-in-the-loop pipelines
- challenging to explain, reproduce
- a combination of skills required
10. 10
A closer look: data cleaning for ML
[6] Neutatz, Felix, et al. "From Cleaning before ML to Cleaning for ML." IEEE Data Eng. Bull. 44.1 (2021): 24-41.
Hypothesis:
data cleaning needs to take an end-to-end application-driven approach that integrates
cleaning throughout the ML application
These data interventions by themselves are not entirely new to Data Quality practitioners
However: DQ can be model- and application-specific. [6]
IDEAL
2023
11. 11
Example: ActiveClean
[7] Krishnan, Sanjay, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. ‘ActiveClean: Interactive Data Cleaning for Statistical
Modeling’. Proceedings of the VLDB Endowment 9, no. 12 (1 August 2016): 948–59. https://doi.org/10.14778/2994509.2994514.
Initial model trained on a dirty training set
The consequence of dirty data is that the
wrong loss function is optimized
The best model is a suboptimal point along
the “real” loss function
Approach: Incremental cleaning combines with SGD:
Repeat until <cleaning budget reached>:
ActiveClean selects ”dirty” items that are
expected to move model down the gradient
items are cleaned (eg manually) and added to
the clean set
- relies on convexity for convergence
“cleaning” (manually) removing outliers and attribute transformation
IDEAL
2023
Problem: Full cleaning not feasible
12. 12
Some concrete goals
1. Reusability and reproducibility
• Build a curated, reusable library of safe and predictable data+model intervention patterns
2. Versioning: End-to-end pipeline management beyond model management (eg MLFlow)
3. Explainability: why was an intervention needed? How and Why was a datapoint affected?
• How can this be achieved?
• What language is needed to express explanations?
• At which level of abstraction?
To provide infrastructure support to streamline interesting end-to-end patterns
IDEAL
2023
13. 13
Use cases from the DataPerf challenges
Aim: to develop data performance benchmarks for ML
Complementing MLPerf benchmaks
Both part of ML Commons
Why are these interesting?
- surface and demonstrate novel data intervention approaches
- evidence new data/training interleaving patterns
- offer opportunities for reusability / generalisability
- challenges typically fix dataset + model, focus on optimal interventions
https://www.dataperf.org/
[10] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for
Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062.
Benchmarks emerge through challenges:
demonstrate how model performance can be enhanced through data interventions
IDEAL
2023
14. 14
Use case 1: “Training Set Debugging” challenge
Context: Vision dataset for image classification tasks
Given: a training set Dtr of labelled images and a classification task T
images annotated with
image-level labels,
object bounding boxes,
object segmentation
masks…
Training data from
OpenImage V7 dataset
“Data-Centric” Challenge Scenario and Goals:
- realistically, some of the labels in Dtr are noisy
- Goal: suggest a strategy to achieve the minimal number of label fixing
“cleaning” to achieve a target performance gain relative to P
IDEAL
2023
For reference, we first calculate best model performance P for M(Dtr)
trained on a perfectly labelled Dtr and an independent test set Dtest
15. 15
IDEAL
2023
Data Cleaning simulation pattern
cleaning
priority
strategy
D’
Model
training
M’
Model
eval
Dtr
corrupt
labels
Dn
Fixed Training
code
Eval
Score
clean
Model
training
Competitor side Evaluator side
A noisy version Dn is generated from
Dtr (eg label flipping)
Target performance recorded by
training on Dtr and testing on Dtest
Strategies are scored based on number
of cleaning actions required to achieve
95% of target performance
Model performance on Dn will be degraded relative to top performance P achieved by M on Dtr
Strategy must suggest ranking of examples in Dn such that by "cleaning" those in order,
performance approximates top performance
16. 16
What can be learnt from this exercise?
cleaning
strategy
D’
Model
training
M’
eval
Dn
Mbest
MLOps
The challenge is effectively a simulation of a 2-levels iterative process:
Challenge winners will have developed and demonstrated new
strategies for training set debugging
However:
Strategy may be optimized for dataset Dn, task T, and the pre-
selected model
IDEAL
2023
17. 17
Provenance and versioning
CSi
Di
Model
training
M’
eval
Dn
Mbest
MLOps
We would like to:
1. Document that Di was derived from Dn using
CSi, as part of a longer pipeline
2. Be able to identify:
1. What effect CSi had on Dn:
1. Which data labels were cleaned
2. Why they were cleaned
3. Make sure CSi can be reused safely:
1. Specify assumptions, pre-requisites
2. Provide examples of past usages
IDEAL
2023
18. 18
Provenance layers I: coarse provenance
Assumptions:
- Dn, Di atomic units of data
- CS atomic unit of processing
Reproducibility: “Outer layer” questions:
- Where does Di come from?
- Which version Di was used to train Mbest?
Derivation:
Di was derived from Dn using CSi
Mbest was trained on Di
Attribution:
CSi was created by <creator C> Dn Di
CSi
wasGeneratedBy
wasDerivedFrom
C
wasAssociatedWith
used
IDEAL
2023
19. 19
Provenance layers II: data-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS atomic unit of processing
Explainability: Data-level Questions:
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were
cleaned to achieve a target performance?
Derivations:
for each xi
j that has been cleaned by CSi:
xi
j was derived from xnj xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
IDEAL
2023
20. 20
Provenance layers III: data- and process-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS is provenance-aware: able to
attach a local explanation to each data
derivation
Explainability: CS explains its own actions:
- “Why was xnj cleaned?”
Derivations: same as previous
- local explanations algorithm-specific
- some standardization possible, eg human-
supplied labels (Active Learning)
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
IDEAL
2023
21. 21
Representing provenance
A formal, interoperable data model and syntax for generic provenance constructs
- accommodates layers I, II
- extensible to a domain vocabulary eg DC-Check
IDEAL
2023
22. 22
Example: Layers I, II
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
entity(D_noisy, [ prov:type=”training-set’])
entity(D_clean, [ prov:type=”training-set’])
entity(e1, [ prov:type=”training-data”, inSet=‘D_noisy’, index=j, val=V])
entity(e2, [ prov:type=”training-data”, inset=‘D_clean, index=j, val=W])
entity(C, [ prov:type=”prov:agent”, prov:type=“CS-creator”])
activity(CS, [ prov:type=”cleaning-strategy”, version=”v1.0”, desc=‘…’])
wasDerivedFrom(e2, e1)
used(CS,e1)
wasGeneratedBy(e2, CS)
wasAssociatedWith(CS,C)
IDEAL
2023
Internal representation:
Property-value graphs!
Hint:
Neo4J works well…
Surface representation:
PROV-N: a relational-like notation (a la Datalog)
23. 23
Use case 2: training set optimisation
Motivation: training efficiency
model performance (test loss) correlates with training data size D according to a power law [11]
However, “Since scalings with N (model size), D (training tokens), Cmin (compute budget) are power-laws,
there are diminishing returns with increasing scale.” [11]
[11] ] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
This motivates trying to optimize D:
1- Redundancy in D leads to wasted training time
2- Not all training examples are equally important for
training:
which ones should be kept / removed?
IDEAL
2023
24. 24
Training set optimization Task 1: reducing redundancy
[12] Abbas, Amro, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. ‘SemDeDup: Data-Efficient Learning at Web-Scale through
Semantic Deduplication’. arXiv, 22 March 2023. http://arxiv.org/abs/2303.09540.
Approach [12]:
1. Map the training set D to an embedded space – using pre-trained foundation models
2. Cluster all data points in embedded space using k-means
3. Using cosine similarity, identify similar points within each cluster. Threshold and select
IDEAL
2023
25. 25
Training set optimization Task 2: pruning easy/hard examples
Main findings from [13]:
1. Not all training examples are created equal
• Hard vs easy
2. The best pruning strategy depends on the
amount of initial data
• Small TS keep the easy examples
• Large TS keep the hard examples
[13] Sorscher, Ben, et al. "Beyond neural scaling laws: beating power law scaling via data pruning." Advances in Neural Information Processing
Systems 35 (2022): 19523-19536.
Repr from [13]
A real simple pruning method – very similar to Task 1
"To compute a self-supervised pruning metric for ImageNet, we perform k-means
clustering in the embedding space of an ImageNet pre-trained self-supervised model and
define the difficulty of each data point by the Euclidean distance to its nearest cluster
centroid, or prototype"
Caveat: only tested on ImageNet!
IDEAL
2023
26. 26
Filter
Provenance for training set optimization
This is a classic filter pipeline – only a little more sophisticated:
TSfull TSopt
Filter
wasGeneratedBy
wasDerivedFrom
used
TSfull Embed Cluster Select TSopt
Layers I and II are very similar to Use Case 1:
Reproducibility:
- Where does TSopt come from?
black / gray box options
TSfull TSopt
wasDerivedFrom
used
Embed Cluster Select
TSemb TSclus
used
wgby
used
wgby wgby
IDEAL
2023
27. 27
Provenance for training set optimization / Layer II
Assumptions:
- TSfull = {ti}, TSopt = {ti}
- Filter is an atomic unit of processing
Explainability: Data-level Questions:
- which ti were filtered out?
- “how redundant was TSfull?”
Derivations:
for each ti that has been removed by Filter:
ti was invalidated by Filter
TSfull ti
Filter
wasInvalidatedBy
used ti
ti
IDEAL
2023
28. 28
Provenance for training set optimization / Layer III
Explainability:
- Why was ti selected/removed?
1. ti belongs to cluster Ch,
2. there exists tj in Ch such that d(ti,tj) < δ,
ti and tj redundant,
tj selected, ti removed”
Assumptions:
- TSfull = {ti}, TSopt = {ti}
- Filter (embed/cluster/filter) is
provenance-aware
A specific (formal) language is needed to express these explanations
… from which a natural language rendering can then be created
IDEAL
2023
29. 29
Representing provenance: Layer III
Need a language to express solution-specific explanations:
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
- Why was ti selected/removed?
1. ti belongs to cluster Ch,
2. there exists tj in Ch such that d(ti,tj) < δ,
ti and tj redundant,
tj selected, ti removed”
- “Why was xnj cleaned?”
IDEAL
2023
TSfull ti
Filter
wasInvalidatedBy
used
why
30. 30
A possible catalogue of DC operators: DC-Check
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of
Reliable Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
IDEAL
2023
31. 31
Data and training
Q1 Continuous data curation:
- automatic labelling (semi-supervised)
- merging / fusing data sources
- data forensics (detect inappropriate data subsets)
Q3 Data quality assessment:
- Verify ground truth
- Some samples are easier to learn from than others
Q4 Synthetic data, data augmentation
- improve dataset coverage
- unbiased data generation
- stress-testing data
Q5 Model architecture and hyper parameter search
Q7 Balancing robustness and fairness with performance
IDEAL
2023
32. 32
Testing and deployment
Q10, Q11 Model evaluation and monitoring:
- Identify model failure scenarios, additional targeted data collection
- Hidden groups and hidden stratification
- Detect training and deployment distribution mismatch issues
Q12 Detect and track data shift:
- Construct new datasets for retraining
- Feedback-driven datasets
- Errors inform the automatic creation, updating of training sets
IDEAL
2023
33. 33
MLFlow?
General pattern: data and models interleaved
Data Competitive
Modelling
Learning-ready
data
Evaluation
(model
performance)
Monitoring
Curation
Engineering
Synthetic data
Augmentation
Data
Selection
How do we capture
these interventions in
a principled way?
1. Identify data interventions
2. Track data operations
3. Link to model tracking
MLFlow and other model
management frameworks
MLDev
Data interventions MLOps
IDEAL
2023
34. 34
Capturing provenance: Layer I
CSi Di
Model
training
Dn
Mbest MLOps
Typical implementation:
- Pandas / Spark python pipeline / Dataframe datasets
- CS can be a method call or a code block:
Use case 1: Layer I (coarse): Process-level observer
1 - method call:
Di = CS(Dn)
2 - Code block:
Dn
Di
“Begin CS”
--
--
--
“End CS”
Dn Di
CS
wasGeneratedBy
wasDerivedFrom
used
wasDerivedFrom
used
wasGeneratedBy
IDEAL
2023
35. 35
Capturing provenance: Layer II
Layer II (data-granular): Interpreter-level observer
- Requires observer at the boundaries of CS, i.e. to tell which x.label have changed
- Observer has access to individual dataframe elements
- But it is unaware of data transformation semantics
[14] A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing
pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
[15] A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12
(2022), 3614–3617. https://doi.org/10.14778/3554821.3554857
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
A possible starting point:
Data Provenance for Data Science (DPDS)
IDEAL
2023
36. 36
A generic dataframe observer for Pandas
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control surfaced
IDEAL
2023
38. 39
Benefits and current limitations
Designed for Layer II provenance, but can be used for Layer I (atomic data, atomic transformation)
Implementation on a Neo4J backend explainability questions map well to graph queries (Cypher)
✗ Unaware of operator semantics provenance not always precise
✗ Scalability challenges when provenance is granular and changes pervasive
✗ Requires summarization / compression
IDEAL
2023
39. 40
Capturing provenance: Layer III
Layer III (data- and process-granular):
Requires explanation generator as part of the transformation logic
Approach: operator send “explanations” to provenance server using API at runtime
At a chosen granularity: dataset data item
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
CS
D D’
Prov-DB
{xi, x’i, expli}
IDEAL
2023
40. 41
Summary of goals and action plan
problem instances
Prov-DB
Data Training Ops
Enable
reuse
Observe /
record
Reproduce /
explain
Curated
Data toolkit
Goals: to support
• Reusability and emerging best practices for
complex data intervention + usage patterns
• Reproducibility, explainability of pipeline instances
How:
- Enable data processing observations / capture
- Build a curated catalogue of interventions + usage patterns
- Associate provenance with data + model versions
Challenges:
- Observability: Instrumenting common runtime for transparent capture
- Granularity: pick a layer (I-II-III): precision vs scalability how much do we need?
- “why?” vocabulary and language for expressing explanations
IDEAL
2023
Editor's Notes
This brings us to the notion of provenance and versioning, and we can articulate it into 3 levels