In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Towards explanations for Data-Centric AI using provenance records
1. Prof. Paolo Missier
School of Computer Science
University of Birmingham, UK
April 5th, 2024
Towards explanations for Data-Centric AI
using provenance records
My contacts:
2. 2
<event
name>
Outline
• Basics of data provenance for DAG pipelines
• Provenance in the context of Data-Centric AI use cases: Levels of detail / granularity
• Data Provenance for Data Science: methods and tooling
• Challenge: Why+provenance
3. 3
<event
name>
Summary of data-centric use cases
1. Model-driven incremental data cleaning
1. Training set cleaning
2. Label correction
2. Training set optimization
1. Removing hard/easy examples
2. Reducing redundancies
4. 4
<event
name>
Summary of data-centric use cases
Context Type of operation strategy Data processing and model training
ActiveClean Select items from training set
for manual cleaning
Item transformation:
x -> x’
Iterative batch cleaning strategy
driven by SGD
ActiveClean processing is interleaved with model
training, both stop at the same time.
Training set debugging Select items from training set
for label correction
Item transformation:
y -> y’
Aims to rank data points and
minimize manual corrections
The re-labelling strategy is incremental and
interleaved with model retraining. However,
winning strategy not published and thus its
generalizability is not clear.
Training set optimization,
reducing redundancy by
removing similar points
Prune items from training set
Filtering:
remove (y)
Cluster data points in
embedded space, select
representatives from each
cluster
Training set pruning happens before model
training
Training set optimization,
reducing redundancy by
pruning hard/easy examples
Prune items from training set Identify simple / hard examples,
sample from those depending
on training set size
Training set pruning happens before model
training
5. 5
<event
name>
Reproducibility, explainability
The use cases provide examples of complex data transformations and data filtering operators
We aim to answer three types of questions:
• Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
• Dataset level
• Which of the individual data items were affected by each of the transformations, and what
was the effect?
• Data item level
• Why was a specific data item transformed?
6. 6
Representing provenance
A formal, interoperable data model and syntax for generic provenance constructs
- accommodates layers I, II
- extensible to a domain vocabulary à eg DC-Check
Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable Machine
Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
7. 7
The W3C PROV model (2013)
processing
Input 1
Input n
usage
usage
Output 1
Output m
generation
generation
(derivation)
(derivation)
8. 8
<event
name>
Basic data derivation pattern: transformation
Consider an abstract data transformation operator: 𝐷 → 𝐷ʹ
D D’
A
wasGeneratedBy
wasDerivedFrom
used
We can record the provenance of 𝐷ʹ as a derivation from 𝐷
- mediated by some abstract activity 𝐴 that represents the cleaning or pruning operations
9. 9
<event
name>
Item-level data transformation (1-1)
This high-level provenance is not very informative if we want to account for how 𝐴 operates on each data item
In the simple examples, 𝐴 performs 1-1, item-wise transformations:
𝑥 ∈ 𝐷 → 𝑥ʹ ∈ 𝐷ʹ
where either 𝑥ʹ = 𝑥 or 𝑥ʹ is a clean version of 𝑥
A wasGeneratedBy
used
wasDerivedFrom
x1
xn
x’1
x’n
…
wasDerivedFrom
PROV-N representation
10. 10
<event
name>
Item-level data transformation (1-many)
This notation can also be used to capture M-N transformations
- to represent the effects of data imputation using statistics that affect multiple data points simultaneously
- This can be achieved by adding relationship instances as needed
Example: {WasDerivedFrom(𝑥𝑖ʹ, 𝑦)}𝑖∶1,𝑛
Denotes a single value 𝑦 ∈ 𝐷 used to produce multiple values 𝑥1ʹ, ... , 𝑥𝑛ʹ
12. 12
<event
name>
Item-level data selection
Here we only need to represent whether each input datapoint survives the selection operator
PROV can be used to assert that operator op has removed datapoint 𝑥 ∈ 𝐷 from its output 𝐷ʹ
There is actually no need to represent the provenance of the surviving datapoints
Suppose op removed 𝑚 items from 𝐷. Using PROV, this is asserted as:
13. 13
<event
name>
Data derivation through pipelines
When operators are composed into pipelines, provenance is a composition of the corresponding provenance patterns
Consider a sequential pipeline consisting of abstract data processing operators op1 ... op𝑛 and a training operator Tr
Each op𝑖 takes an input dataset 𝐷 and produces an output 𝐷ʹ: 𝐷ʹ = op𝑖(𝐷)
Similarly, training takes some 𝐷 and produces a model 𝑀: 𝑀 = Tr(𝐷)
Starting from initial “raw” dataset 𝐷0, and denoting with 𝐷𝑖 the intermediate datasets, this pipeline can be written
as
{𝐷𝑖 = op𝑖(𝐷𝑖−1)}𝑛
𝑖∶1, 𝑀 = Tr(𝐷𝑛)
Corresponding provenance:
D0 OP1 D1 OPn Dn Tr M
…
Dn Tr M
used wgby
used
D0 OP1 D1
wgby
…
14. 14
<event
name>
Extension to DAG topologies is straightforward
These assertions extend naturally to pipelines with multiple inputs and outputs --> Directed Acyclic Graphs
Example: inputs 𝐷0
𝑎, 𝐷0
𝑏 Dc
0 are processed independently and eventually merged into 𝐷𝑛:
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
used
used
used
used
wgby
wgby
15. 15
IDEAL
2023
Data Cleaning simulation pattern
cleaning
priority
strategy
D’
Model
training
M’
Model
eval
Dtr
corrupt
labels
Dn
Fixed Training
code
Eval
Score
clean
Model
training
Competitor side Evaluator side
A noisy version Dn is generated from
Dtr (eg label flipping)
Target performance recorded by
training on Dtr and testing on Dtest
Strategies are scored based on number
of cleaning actions required to achieve
95% of target performance
- Corrupt some of the labels in Dtr à Dn
- Let Pn be the model performance when using Dn for training. Pn will be less than P
- Strategy must suggest ranking of examples in Dn such that by "cleaning" those in order,
performance increases approximating P
16. 16
What can be learnt from this exercise?
cleaning
strategy
D’
Model
training
M’
eval
Dn
Mbest
MLOps
The challenge is effectively a simulation of a 2-levels iterative process:
Challenge winners will have developed and demonstrated new
strategies for training set debugging
However:
Strategy may be optimized for dataset Dn, task T, and the pre-
selected model
IDEAL
2023
17. 17
Provenance and versioning
CSi
Di
Model
training
M’
eval
Dn
Mbest
MLOps
We would like to:
1. Document that Di was derived from Dn using
CSi, as part of a longer pipeline
2. Be able to identify:
1. What effect CSi had on Dn:
1. Which data labels were cleaned
2. Why they were cleaned
3. Make sure CSi can be reused safely:
1. Specify assumptions, pre-requisites
2. Provide examples of past usages
IDEAL
2023
18. 18
Provenance layer I: whole dataset
Assumptions:
- Dn, Di atomic units of data
- CS atomic unit of processing
Reproducibility: “Outer layer” questions
- Where does Di come from?
- Which version Di was used to train Mbest?
Derivation:
Di was derived from Dn using CSi
Mbest was trained on Di
Attribution:
CSi was created by <creator C>
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
20. 20
Provenance layer II: data-granular provenance
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS atomic unit of processing
Explainability: Data-level Questions
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were
cleaned to achieve a target performance?
Derivations:
for each xi
j that has been cleaned by CSi:
xi
j was derived from xnj
IDEAL
2023
21. 21
Provenance layer II specification
Assumptions:
- Dn = {xnj}, Di = {xi
j}
- CS atomic unit of processing
Explainability: Data-level Questions
- which xnj were cleaned?
- “how dirty was Dn?”
in aggregate: how many labels were
cleaned to achieve a target performance?
Derivations:
for each xi
j that has been cleaned by CSi:
xi
j was derived from xnj
IDEAL
2023
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
xnj
xnj
xi
j
xi
j
22. 22
<event
name>
Representing entanglements
The term “entanglement” denotes an iterative interleaving of data preparation and modelling
During generic iteration 𝑖:
- Assess processor takes a partially cleaned training set 𝐷𝑖 along with current model version 𝑀𝑖 trained on 𝐷𝑖
- determine next batch of items in 𝐷𝑖 to be cleaned
- Clean is a separate processor, yielding a new version 𝐷𝑖+1
- This is used to train 𝑀𝑖+1 etc
Train
D0
M1 Di+1
Assess Clean
Train
Cleaning
targets
Mi+1
23. 23
<event
name>
Provenance of entanglements
- PROV can be used to express a provenance graph for this process
- the graph must capture an unfolding of the process execution over the set of its iterations
Starting from version 𝐷𝑖+1 of the data and move backwards in time:
- 𝐷𝑖+1 was generated by instance 𝑖 + 1 of the clean processor
- This took as input the batch of data items identified by assess𝑖+1 as targets for cleaning
- This required 𝑀𝑖, which was generated by the 𝑖-th training iteration and 𝐷𝑖
- PROV allows for annotations to be added to each entity, activity, and relationships
- These annotations may be drawn from:
- a standard vocabulary (role to qualify the role of a processor in the pipeline)
- custom vocabularies, for instance to associate performance metrics with each version of the model
Cleani Traini-1 Assessi+1 Cleani+1
Di Mi Di+1
targets
Cleani-1 Traini Assessi Cleani
Di-1 Mi-1 targets
24. 24
Use case 2: training set optimisation
Motivation: training efficiency
à model performance (test loss) correlates with training data size D according to a power law [11]
However, “Since scalings with N (model size), D (training tokens), Cmin (compute budget) are power-laws,
there are diminishing returns with increasing scale.” [11]
[11] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
This motivates trying to optimize D:
1- Redundancy in D leads to wasted training time
2- Not all training examples are equally important for
training:
Ø which ones should be kept / removed?
IDEAL
2023
25. 25
Training set optimization Task 1: reducing redundancy
[12] Abbas, Amro, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. ‘SemDeDup: Data-Efficient Learning at Web-Scale through
Semantic Deduplication’. arXiv, 22 March 2023. http://arxiv.org/abs/2303.09540.
Approach [12]:
1. Map the training set D to an embedded space – using pre-trained foundation models
2. Cluster all data points in embedded space using k-means
3. Using cosine similarity, identify similar points within each cluster. Threshold and select
IDEAL
2023
26. 26
Training set optimization Task 2: pruning easy/hard examples
Main findings from [13]:
1. Not all training examples are created equal
• Hard vs easy
2. The best pruning strategy depends on the
amount of initial data
• Small TS à keep the easy examples
• Large TS à keep the hard examples
[13] Sorscher, Ben, et al. Beyond neural scaling laws: beating power law scaling via data pruning, Advances in Neural Information Processing
Systems 35 (2022): 19523-19536.
Repr from [13]
A real simple pruning method – very similar to Task 1
"To compute a self-supervised pruning metric for ImageNet, we perform k-means clustering
in the embedding space of an ImageNet pre-trained self-supervised model and define the
difficulty of each data point by the Euclidean distance to its nearest cluster centroid, or
prototype"
Caveat: only tested on ImageNet!
IDEAL
2023
27. 27
Filter
Provenance for training set optimization
This is a classic filter pipeline – only a little more sophisticated:
TSfull TSopt
Filter
wasGeneratedBy
wasDerivedFrom
used
TSfull Embed Cluster Select TSopt
Layers I and II are very similar to Use Case 1:
Reproducibility:
- Where does TSopt come from?
à black / gray box options
TSfull TSopt
wasDerivedFrom
used
Embed Cluster Select
TSemb TSclus
used
wgby
used
wgby wgby
IDEAL
2023
28. 28
Provenance for training set optimization / Layer II
Assumptions:
- TSfull = {ti}, TSopt = {ti}
- Filter is an atomic unit of processing
Explainability: Data-level Questions:
- which ti were filtered out?
- “how redundant was TSfull?”
Derivations:
for each ti that has been removed by Filter:
ti was invalidated by Filter
TSfull ti
Filter
wasInvalidatedBy
used ti
ti
IDEAL
2023
29. 29
How can we generate these provenance graphs?
Key idea for Layer II (data-granular): Interpreter-level observer
- Requires observer at the boundaries of CS, i.e. to tell which x.label have changed
- Observer has access to individual dataframe elements
- But it is unaware of data transformation semantics
[14] A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing
pipelines in data science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
[15] A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12
(2022), 3614–3617. https://doi.org/10.14778/3554821.3554857
Adriane Chapman, Luca Lauro, Paolo Missier, and Riccardo Torlone. 2024. Supporting Better Insights of Data Science Pipelines with Fine-
grained Provenance. ACM Trans. Database Syst. Just Accepted (February 2024). https://doi.org/10.1145/3644385
xnj xi
j
CSi
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
A starting point:
Data Provenance for Data Science (DPDS)
IDEAL
2023
30. 30
Capturing provenance: Layer I
CSi Di
Model
training
Dn Mbest MLOps
Typical implementation:
- Pandas / Spark python pipeline / Dataframe datasets
- CS can be a method call or a code block:
Layer I (coarse): Process-level observer
1 - method call:
Di = CS(Dn)
2 - Code block:
Dn à
à Di
“Begin CS”
--
--
--
“End CS”
Dn Di
CS
wasGeneratedBy
wasDerivedFrom
used
wasDerivedFrom
used
wasGeneratedBy
IDEAL
2023
32. 32
Aims
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
34. 34
A generic dataframe observer for Pandas
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control surfaced
IDEAL
2023
35. 35
Approach to design (II)
- Grounded in well-known dataframe transformation operators
- Open: accommodates any transformation within three broad classes
39. 39
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
DL
./t
C DR
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
DL
] DR
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
DL
] DR
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
DL
./inner
DL.CId=DR.CId DR
40. 40
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>
↵!
f1(Age):ageRange(D)
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
41. 41
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
For i : 1 . . . n :
used ent.:[hF = Xm, I = i, V = Di,Xm
i|Xm 2 X]
generated ent.:[hF0
= Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
43. 43
Implementation
We use templates in combination with dataframe diff:
(*) extends to joins, append
For each input/output pair Din, Dout of dataframes:
1. Compare both the shapes and values of Din, Dout (*)
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
• Generate an instantiated provlet
44. 44
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1 = Da ./left
K1,K2
Db
D2 = ⌧f1(⇤)(D1)
D3 = D2 ./left
K1,K2
Dc
D4 = ⌧f2(E,F )(D3)
D5 = ↵!
h(E):{E4,Ex,E1}(D4)
D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
45. 45
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
46. 46
Running Example
Dataframes Diff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1 value change, reduced nulls à imputation Data transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3 value change, reduced nulls à imputation Data transformation
D45 ß D4 Shape change, column(s) added <wait!>
D6 ß D5 Shape change, column(s) removed Data transformation, composite
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
52. 52
Representing provenance: Layer III
Need a language to express solution-specific explanations:
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
- Why was ti selected/removed?
1. ti belongs to cluster Ch,
2. there exists tj in Ch such that d(ti,tj) < δ,
à ti and tj redundant,
à tj selected, ti removed”
- “Why was xnj cleaned?”
TSfull ti
Filter
wasInvalidatedBy
used
why
53. 53
Capturing provenance: Layer III
Layer III (data- and process-granular):
Requires explanation generator as part of the transformation logic
Approach: operator send “explanations” to provenance server using API at runtime
At a chosen granularity: dataset à data item
xnj xi
j
CS
wasGeneratedBy
used
C
wasAssociatedWith
wasDerivedFrom
why
CS
D D’
Prov-DB
{xi, x’i, expli}
IDEAL
2023
54. 56
<event
name>
Layer III provenance: Preliminary ideas
We frame the general provenance granularity problem in terms of the two orthogonal
dimensions:
- data derivation, from dataset to item-level
- detail of processor behaviour, from class to internal logic
dataset item
Data
detail
Processor
detail
class level
logic level
- transformation
- selection
D à D’
D à D’ ⊆ D
{ x à x’}x ∈ D, x’ ∈ D’
{ x ∈ D | 𝜎(x’) = True}
Processor
logic
Why x? (transformation, selection)
Why x’? (transformation, augmentation)
⟶
⟶
55. 57
<event
name>
Processor logic at dataset level
dataset item
Data
detail
Processor
detail
class level
logic level
- transformation
- selection
D à D’
D à D’ ⊆ D
{ x à x’}x ∈ D, x’ ∈ D’
{ x ∈ D | 𝜎(x’) = True}
Processor
logic
Why x? (transformation, selection)
Why x’? (transformation, augmentation)
⟶
⟶
wgby
used
…
Ai Ti Mi
Di
Ci-i
CTi
Mi-1
Di-1
wgby wgby
used used
Cleaning
targets
Assessment Training
Cleaning Model
56. 58
<event
name>
Processor logic at item level
dataset item
Data
detail
Processor
detail
class level
logic level
- transformation
- selection
D à D’
D à D’ ⊆ D
{ x à x’}x ∈ D, x’ ∈ D’
{ x ∈ D | 𝜎(x’) = True}
Processor
logic
Why x? (transformation, selection)
Why x’? (transformation, augmentation)
⟶
⟶
“why did the assessor 𝐴 choose 𝑥 for cleaning?”
“how did the cleaner 𝐶 choose the replacement value?”
“why did 𝑥 ∈ 𝐷 get selected for removal from the training set?”
57. 59
<event
name>
A possible vocabulary / library: DC-Check
[4] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of
Reliable Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
58. 62
Summary of goals and action plan
problem instances
Prov-DB
Data Training Ops
Enable
reuse
Observe /
record
Reproduce /
explain
Curated
Data toolkit
Goals: to support
• Reusability and emerging best practices for
complex data intervention + usage patterns
• Reproducibility, explainability of pipeline instances
How:
- Enable data processing observations / capture
- Build a curated catalogue of interventions + usage patterns
- Associate provenance with data + model versions
Challenges:
- Observability: Instrumenting common runtime for transparent capture
- Granularity: pick a layer (I-II-III): precision vs scalability à how much do we need?
- “why?” vocabulary and language for expressing explanations
IDEAL
2023
59. 63
Summary of references
[1] Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. ‘DC-Check: A Data-Centric AI Checklist to Guide the Development of Reliable
Machine Learning Systems’. arXiv, 9 November 2022. http://arxiv.org/abs/2211.05764.
[2] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (August 2023), 84–92.
https://doi.org/10.1145/3571724
[3] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. ‘Data-Centric AI: Perspectives and Challenges’. arXiv, 2 April 2023.
http://arxiv.org/abs/2301.04819.
[4] Zha, Daochen, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. ‘Data-Centric Artificial Intelligence: A
Survey’. arXiv, 11 June 2023. https://doi.org/10.48550/arXiv.2303.10158.
[5] Singh, Prerna. ‘Systematic Review of Data-Centric Approaches in Artificial Intelligence and Machine Learning’. Data Science and Management 6,
no. 3 (1 September 2023): 144–57. https://doi.org/10.1016/j.dsm.2023.06.001.
[6] Neutatz, Felix, et al. "From Cleaning before ML to Cleaning for ML." IEEE Data Eng. Bull. 44.1 (2021): 24-41.
[7] Mazumder, Mark, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, et al. ‘DataPerf: Benchmarks for
Data-Centric AI Development’. arXiv, 13 October 2023. https://doi.org/10.48550/arXiv.2207.10062.
[8] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
[9] Abbas, Amro, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. ‘SemDeDup: Data-Efficient Learning at Web-Scale through Semantic
Deduplication’. arXiv, 22 March 2023. http://arxiv.org/abs/2303.09540.
[10] Sorscher, Ben, et al., Advances in Neural Information Processing Systems 35 (2022): 19523-19536. Beyond neural scaling laws: beating power
law scaling via data pruning
[11] A. Chapman, P. Missier, G. Simonelli, and R. Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data
science. Proc. VLDB Endow. 14, 4 (December 2020), 507–520. https://doi.org/10.14778/3436905.3436911
[12] A. Chapman, L. Lauro, P. Missier, and R. Torlone. 2022. DPDS: assisting data science with data provenance. Proc. VLDB Endow. 15, 12 (2022),
3614–3617. https://doi.org/10.14778/3554821.3554857
IDEAL
2023