Unraveling Multimodality with Large Language Models.pdf
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
1. 1
DP4DS: Scalable and efficient provenance collection
from Data Science pipelines
Adriane Chapman1, Paolo Missier2, Luca Lauro3, (Giulia Simonelli3), Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2. 2
One-slide summary
A tool to collect fine-grained provenance from data processing pipelines
- Specifically for dataframe-based python scripts. (Pandas)
- Prototype-level
Demonstrated scalable provenance generation, storage, query
Work in progress:
- Ad hoc provenance compression (but no trivial provenance recorded)
- Demonstrate generality i.e. wrt standard relational operators
- Where is this practically useful?
4. 4
Aims
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
5. 5
<event
name>
Granularity
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1 x1
else:
y1 x2
Y2 f(x1, x2)
Runtime: c == True
6. 6
Approach to design (I)
Provenance capture control surfaced at program source level:
p = pr.Provenance(df_A, '', savepath)
# create provanance tracker
tracker = ProvenanceTracker.ProvenanceTracker(df_A, p)
# …
# Imputation
tracker.df = tracker.df.fillna(value={'E':'Ex', 'F':'Fx’})
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(tracker.df[c]))
df_dummies = pd.concat(dummies, axis=1)
tracker.df = pd.concat((tracker.df, df_dummies), axis=1)
tracker.df = tracker.df.drop([c], axis=1)
7. 7
Approach to design (II)
- Grounded in well-known dataframe transformation operators
- Open: accommodates any transformation within three broad classes
12. 13
Conceptual provenance capture model: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
13. 14
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} {new values: F’, J, V’}
+
Binding rules
15. 19
Implementation
We use templates in combination with dataframe diff:
(*) extends to joins, append
For each input/output pair Din, Dout of dataframes:
1. Compare both the shapes and values of Din, Dout (*)
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
• Generate an instantiated provlet
16. 20
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
17. 21
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
18. 22
Running Example
Dataframes Diff template
D1 {Da, Db} Explicit join provenance pattern
D2 D1 value change, reduced nulls imputation Data transformation
D3 {D2, Dc} Explicit join provenance pattern
D4 D3 value change, reduced nulls imputation Data transformation
D45 D4 Shape change, column(s) added <wait!>
D6 D5 Shape change, column(s) removed Data transformation, composite
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
23. 28
Summary and open questions
- But, does it help explaining data science findings from real pipelines?
- Fine-grained provenance collection from data processing pipelines
- Specifically for dataframe-based python scripts
- Demonstrated scalable provenance generation, storage, query
- work in progress
$f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise
$f_2$, which computes the average of a set of numbers.
ST needs to create provenance data for every newvalue in the new column. Join (JO) and Append (AP) operations require more time as they need togenerate a quite large quantity of provenance.