1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Giulia Simonelli3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
VLDB 2021
2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
4
Typical operators used in data prep
vertical augmentation
Example:
5
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Ex.: vertical augmentation  adding columns
- Values change
- Shape change
6
Provenance patterns for each operator
7
Provenance templates
Template + binding rules = instantiated provenance fragment
+
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
8
This applies to all operators…
9
Making your code provenance-aware
df = pd.DataFrame(…)
# Create a new provenance document
p = pr.Provenance(df, savepath)
# create provanance tracker
tracker=ProvenanceTracker.ProvenanceTracker(df, p)
# instance generation
tracker.df = tracker.df.append({'key2': 'K4'},
ignore_index=True)
# imputation
tracker.df = tracker.df.fillna('imputato')
# feature transformation of column D
tracker.df['D'] = tracker.df['D']*2
# Feature transformation of column key2
tracker.df['key2'] = tracker.df['key2']*2
Approach:
A python tracker object intercepts dataframe
operations
Operations that are channeled through the tracker
generate provenance fragments
10
Shape change example: one-hot encoding
Regular pandas operators are “observed” by
the tracker
The tracker object should be constantly in sync
with the state of the underlying dataframe
1
2
11
Shape change example: one-hot encoding
1
2
New entities with unknown derivation
Runtime analysis Provenance construction
Detect shape change operator
1
New columns added
1
2 Column ‘c’ removed
Inference: space transformation
- New columns derived from ‘c’,
- Column ‘c’ invalidated
12
Putting it all together
13
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column
14
Evaluation – benchmark queries
15
Evaluation: Provenance capture and query times
16
Scalability
Synthetic Benchmarking datasets created using TPC-DI. (*)
- 6 operations tested in isolation (no pipeline)
(*) Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014),
17
Scalability
(*) Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014),
• IG only affects a small number of data values
• FS touches every data item but only Invalidates cells
• VT and Imputation only touch a small number of cells
• FT and ST are more likely to touch every data item and
create new entities
18
Summary
Practical and efficient, but:
1. Can it be extended to arbitrary python / pandas programs?
2. What is the killer app for such granular provenance?
A method and infrastructure for collecting and querying very fine-grained
provenance from data processing pipelines

Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

  • 1.
    1 Capturing and queryingfine-grained provenance of preprocessing pipelines in data science (DP4DS) Adriane Chapman1, Paolo Missier2, Giulia Simonelli3, Riccardo Torlone3 (1) University of Southampton, UK (2) Newcastle University, UK (3) Universita’ Roma Tre, Italy VLDB 2021
  • 2.
    2 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training/ test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 3.
    3 <event name> Provenance of what? Basecase: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input P0 - Transparent program PT - coarse-grained datasets PT f if c: y1  x1 else: y1  x2 Y2  f(x1, x2) - Transparent program PT - Fine-grained datasets PT … … … … … … … … - Transparent pipeline - Fine-grained datasets P’T … … … … … … … … Pn T Pn T Pn T
  • 4.
    4 Typical operators usedin data prep vertical augmentation Example:
  • 5.
    5 Operators 14/03/2021 03_ b_c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Ex.: vertical augmentation  adding columns - Values change - Shape change
  • 6.
  • 7.
    7 Provenance templates Template +binding rules = instantiated provenance fragment + 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’}
  • 8.
    8 This applies toall operators…
  • 9.
    9 Making your codeprovenance-aware df = pd.DataFrame(…) # Create a new provenance document p = pr.Provenance(df, savepath) # create provanance tracker tracker=ProvenanceTracker.ProvenanceTracker(df, p) # instance generation tracker.df = tracker.df.append({'key2': 'K4'}, ignore_index=True) # imputation tracker.df = tracker.df.fillna('imputato') # feature transformation of column D tracker.df['D'] = tracker.df['D']*2 # Feature transformation of column key2 tracker.df['key2'] = tracker.df['key2']*2 Approach: A python tracker object intercepts dataframe operations Operations that are channeled through the tracker generate provenance fragments
  • 10.
    10 Shape change example:one-hot encoding Regular pandas operators are “observed” by the tracker The tracker object should be constantly in sync with the state of the underlying dataframe 1 2
  • 11.
    11 Shape change example:one-hot encoding 1 2 New entities with unknown derivation Runtime analysis Provenance construction Detect shape change operator 1 New columns added 1 2 Column ‘c’ removed Inference: space transformation - New columns derived from ‘c’, - Column ‘c’ invalidated
  • 12.
  • 13.
    13 Evaluation – benchmarkdatasets Census pipeline: Clerical cleaning on every cell (removing blanks) Replace all ‘?’ with NaN One-hot encoding 7 categorical variables Map binary labels to 0,1 Drop one column
  • 14.
  • 15.
  • 16.
    16 Scalability Synthetic Benchmarking datasetscreated using TPC-DI. (*) - 6 operations tested in isolation (no pipeline) (*) Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014),
  • 17.
    17 Scalability (*) Meikel Poess,Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC-DI: The First Industry Benchmark for Data Integration. VLDB 7, 13(Aug. 2014), • IG only affects a small number of data values • FS touches every data item but only Invalidates cells • VT and Imputation only touch a small number of cells • FT and ST are more likely to touch every data item and create new entities
  • 18.
    18 Summary Practical and efficient,but: 1. Can it be extended to arbitrary python / pandas programs? 2. What is the killer app for such granular provenance? A method and infrastructure for collecting and querying very fine-grained provenance from data processing pipelines

Editor's Notes

  • #6 \newcommand{\f}{\textbf{a}} \text{features}~ X=[\f_1 \ldots \f_k] \text{new features}~ Y=[\f'_1 \ldots \f'_l] \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features