FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Capturing and querying fine-grained provenance of preprocessing pipelines in data science(DP4DS)
1. 1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2. 2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
3. 3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1 x1
else:
y1 x2
Y2 f(x1, x2)
Runtime: c == True
10. 10
Capturing provenance: Assumptions
- Common data abstraction: (Pandas) dataframes
- Observability: runtime execution of a (python) program can be observed
- Each input and output dataframe to each operator can be inspected
12. 12
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} {new values: F’, J, V’}
+
Binding rules
14. 14
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
Used
Left Right Output
Used
wasDerivedFrom
15. 15
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
Used
Left Right Output
wasDerivedFrom
16. 17
Capturing provenance: a more practical approach
The approach just described requires recognizing the type of operation from the source code
Restricts to a closed set of operators needs to be maintained over time
(*) extends to joins, append
We take a more generic route to implementing the same idea:
1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator
2. Dataframe diff: Compare both the shapes and values of Din, Dout (*)
3. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
17. 18
Example
Consider the following sequence: Imputation join append one hot encoding
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append
Add
‘B0,’ ‘B1’ Remove ‘B’
D4 D5
7
<event
name>
18. 19
Example
Dataframes Diff template
D1, Da value change, reduced number of
null values
Data transformation
D2, {Da, Db} join provenance
D3, {D1, D2} append provenance
D4, D3 Shape change, column(s) added <wait!>
D5, D4 Shape change, column(s) removed Data transformation, composite
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append Remove ‘B’
D4 D5
Add
‘B0,’ ‘B1’
19. 20
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
20. 21
Code instrumentation
A python tracker object intercepts dataframe operations, using an observer pattern
The tracker collects the values required to generate the bindings
Create a provenance object and a tracker object
Simple column transform
One-hot encoding
join
21. 22
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column
27. 28
Tool demo
DPDS: Assisting Data Science with Data
Provenance. Chapman, A.; Missier, P.; Lauro, L.; and
Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022.
(demo paper)
28. 29
Summary
1. What is the killer app for such granular provenance?
2. How general is the technique with respect to arbitrary pandas programs?
A method, infrastructure and tooling for collecting, querying, and visualizing
very fine-grained provenance from data processing pipelines
Editor's Notes
$f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise
$f_2$, which computes the average of a set of numbers.