Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

1
DP4DS: Scalable and efficient provenance collection
from Data Science pipelines
Adriane Chapman1, Paolo Missier2, Luca Lauro3, (Giulia Simonelli3), Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.

2
One-slide summary
A tool to collect fine-grained provenance from data processing pipelines
- Specifically for dataframe-based python scripts. (Pandas)
- Prototype-level
Demonstrated scalable provenance generation, storage, query
Work in progress:
- Ad hoc provenance compression (but no trivial provenance recorded)
- Demonstrate generality i.e. wrt standard relational operators
- Where is this practically useful?

3
Running example: A simple pipeline
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
One-hot encoding
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)

4
Aims
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)

5
<event
name>
Granularity
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True

6
Approach to design (I)
Provenance capture control surfaced at program source level:
p = pr.Provenance(df_A, '', savepath)
# create provanance tracker
tracker = ProvenanceTracker.ProvenanceTracker(df_A, p)
# …
# Imputation
tracker.df = tracker.df.fillna(value={'E':'Ex', 'F':'Fx’})
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(tracker.df[c]))
df_dummies = pd.concat(dummies, axis=1)
tracker.df = pd.concat((tracker.df, df_dummies), axis=1)
tracker.df = tracker.df.drop([c], axis=1)

7
Approach to design (II)
- Grounded in well-known dataframe transformation operators
- Open: accommodates any transformation within three broad classes

8
Data reduction
- Projection, Selection

9
Data augmentation
Vertical augmentation
group by gender
avg(age)
Horizontal augmentation

10
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip

11
Data fusion: join and append

13
Conceptual provenance capture model: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator

14
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules

15
This applies to all operators

19
Implementation
We use templates in combination with dataframe diff:
(*) extends to joins, append
For each input/output pair Din, Dout of dataframes:
1. Compare both the shapes and values of Din, Dout (*)
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
• Generate an instantiated provlet

20
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5

21
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations

22
Running Example
Dataframes Diff template
D1  {Da, Db} Explicit join provenance pattern
D2  D1 value change, reduced nulls  imputation Data transformation
D3  {D2, Dc} Explicit join provenance pattern
D4  D3 value change, reduced nulls  imputation Data transformation
D45  D4 Shape change, column(s) added <wait!>
D6  D5 Shape change, column(s) removed Data transformation, composite
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5

23
Evaluation: Provenance capture times

24
Evaluation: Provenance query times on Neo4J

25
Scalability
Synthetic Benchmarking datasets created using TPC-DI

26
Scalability: capture and storage / TCI-DI datasets
Basic operators Join + append operators

28
Summary and open questions
- But, does it help explaining data science findings from real pipelines?
- Fine-grained provenance collection from data processing pipelines
- Specifically for dataframe-based python scripts
- Demonstrated scalable provenance generation, storage, query
- work in progress

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

Recommended

Recommended

More Related Content

Similar to Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

Similar to Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science) (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

Editor's Notes