Capturing and querying fine-grained provenance of preprocessing pipelines in data science(DP4DS)

1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Giulia Simonelli3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
VLDB 2021

2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations

3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T

4
Typical operators used in data prep
vertical augmentation
Example:

5
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Ex.: vertical augmentation  adding columns
- Values change
- Shape change

6
Provenance patterns for each operator

7
Provenance templates
Template + binding rules = instantiated provenance fragment
+
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}

8
This applies to all operators…

9
Making your code provenance-aware
df = pd.DataFrame(…)
# Create a new provenance document
p = pr.Provenance(df, savepath)
# create provanance tracker
tracker=ProvenanceTracker.ProvenanceTracker(df, p)
# instance generation
tracker.df = tracker.df.append({'key2': 'K4'},
ignore_index=True)
# imputation
tracker.df = tracker.df.fillna('imputato')
# feature transformation of column D
tracker.df['D'] = tracker.df['D']*2
# Feature transformation of column key2
tracker.df['key2'] = tracker.df['key2']*2
Approach:
A python tracker object intercepts dataframe
operations
Operations that are channeled through the tracker
generate provenance fragments

10
Shape change example: one-hot encoding
Regular pandas operators are “observed” by
the tracker
The tracker object should be constantly in sync
with the state of the underlying dataframe
1
2

11
Shape change example: one-hot encoding
1
2
New entities with unknown derivation
Runtime analysis Provenance construction
Detect shape change operator
1
New columns added
1
2 Column ‘c’ removed
Inference: space transformation
- New columns derived from ‘c’,
- Column ‘c’ invalidated

13
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column

14
Evaluation – benchmark queries

15
Evaluation: Provenance capture and query times

16
Scalability
Synthetic Benchmarking datasets created using TPC-DI. (*)
- 6 operations tested in isolation (no pipeline)
(*) Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014),

17
Scalability
(*) Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014),
• IG only affects a small number of data values
• FS touches every data item but only Invalidates cells
• VT and Imputation only touch a small number of cells
• FT and ST are more likely to touch every data item and
create new entities

18
Summary
Practical and efficient, but:
1. Can it be extended to arbitrary python / pandas programs?
2. What is the killer app for such granular provenance?
A method and infrastructure for collecting and querying very fine-grained
provenance from data processing pipelines

Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

More Related Content

What's hot

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

More from Paolo Missier

Recently uploaded