SlideShare a Scribd company logo
1 of 23
1
DP4DS: Scalable and efficient provenance collection
from Data Science pipelines
Adriane Chapman1, Paolo Missier2, Luca Lauro3, (Giulia Simonelli3), Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2
One-slide summary
A tool to collect fine-grained provenance from data processing pipelines
- Specifically for dataframe-based python scripts. (Pandas)
- Prototype-level
Demonstrated scalable provenance generation, storage, query
Work in progress:
- Ad hoc provenance compression (but no trivial provenance recorded)
- Demonstrate generality i.e. wrt standard relational operators
- Where is this practically useful?
3
Running example: A simple pipeline
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
One-hot encoding
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
4
Aims
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
5
<event
name>
Granularity
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True
6
Approach to design (I)
Provenance capture control surfaced at program source level:
p = pr.Provenance(df_A, '', savepath)
# create provanance tracker
tracker = ProvenanceTracker.ProvenanceTracker(df_A, p)
# …
# Imputation
tracker.df = tracker.df.fillna(value={'E':'Ex', 'F':'Fx’})
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(tracker.df[c]))
df_dummies = pd.concat(dummies, axis=1)
tracker.df = pd.concat((tracker.df, df_dummies), axis=1)
tracker.df = tracker.df.drop([c], axis=1)
7
Approach to design (II)
- Grounded in well-known dataframe transformation operators
- Open: accommodates any transformation within three broad classes
8
Data reduction
- Projection, Selection
9
Data augmentation
Vertical augmentation
group by gender
avg(age)
Horizontal augmentation
10
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
11
Data fusion: join and append
13
Conceptual provenance capture model: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
14
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules
15
This applies to all operators
19
Implementation
We use templates in combination with dataframe diff:
(*) extends to joins, append
For each input/output pair Din, Dout of dataframes:
1. Compare both the shapes and values of Din, Dout (*)
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
• Generate an instantiated provlet
20
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
21
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
22
Running Example
Dataframes Diff template
D1  {Da, Db} Explicit join provenance pattern
D2  D1 value change, reduced nulls  imputation Data transformation
D3  {D2, Dc} Explicit join provenance pattern
D4  D3 value change, reduced nulls  imputation Data transformation
D45  D4 Shape change, column(s) added <wait!>
D6  D5 Shape change, column(s) removed Data transformation, composite
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
23
Evaluation: Provenance capture times
24
Evaluation: Provenance query times on Neo4J
25
Scalability
Synthetic Benchmarking datasets created using TPC-DI
26
Scalability: capture and storage / TCI-DI datasets
Basic operators Join + append operators
28
Summary and open questions
- But, does it help explaining data science findings from real pipelines?
- Fine-grained provenance collection from data processing pipelines
- Specifically for dataframe-based python scripts
- Demonstrated scalable provenance generation, storage, query
- work in progress

More Related Content

Similar to Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

Slides
SlidesSlides
Slidesbutest
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_grJosef Heinen
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting SpatialFAO
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
R getting spatial
R getting spatialR getting spatial
R getting spatialFAO
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Generalized Notions of Data Depth
Generalized Notions of Data DepthGeneralized Notions of Data Depth
Generalized Notions of Data DepthMukund Raj
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 

Similar to Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science) (20)

Slides
SlidesSlides
Slides
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_gr
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting Spatial
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Generalized Notions of Data Depth
Generalized Notions of Data DepthGeneralized Notions of Data Depth
Generalized Notions of Data Depth
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
l7-pointers.ppt
l7-pointers.pptl7-pointers.ppt
l7-pointers.ppt
 
Cs gate-2011
Cs gate-2011Cs gate-2011
Cs gate-2011
 
Cs gate-2011
Cs gate-2011Cs gate-2011
Cs gate-2011
 

More from Paolo Missier

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 

More from Paolo Missier (20)

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

  • 1. 1 DP4DS: Scalable and efficient provenance collection from Data Science pipelines Adriane Chapman1, Paolo Missier2, Luca Lauro3, (Giulia Simonelli3), Riccardo Torlone3 (1) University of Southampton, UK (2) Newcastle University, UK (3) Universita’ Roma Tre, Italy [1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. PVLDB, 14(4): 507–520. January 2021. [2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
  • 2. 2 One-slide summary A tool to collect fine-grained provenance from data processing pipelines - Specifically for dataframe-based python scripts. (Pandas) - Prototype-level Demonstrated scalable provenance generation, storage, query Work in progress: - Ad hoc provenance compression (but no trivial provenance recorded) - Demonstrate generality i.e. wrt standard relational operators - Where is this practically useful?
  • 3. 3 Running example: A simple pipeline D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 One-hot encoding df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join df = df.fillna('imputed’) # Imputation df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation # one-hot encoding c = 'E' dummies = [] dummies.append(pd.get_dummies(df[c])) df_dummies = pd.concat(dummies, axis=1) df = pd.concat((df, df_dummies), axis=1) df = df_A.drop([c], axis=1)
  • 4. 4 Aims Capture, store and query element-level provenance - Derivation of each element of each intermediate dataframe (when possible) - Efficiently, at scale fillna Join df_1 df_B (df_0) df_A (df_-1)
  • 5. 5 <event name> Granularity Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input P0 - Transparent program PT - Fine-grained datasets PT … … … … … … … … - Transparent pipeline - Fine-grained datasets P’T … … … … … … … … Pn T Pn T Pn T - Transparent program PT - coarse-grained datasets PT f if c: y1  x1 else: y1  x2 Y2  f(x1, x2) Runtime: c == True
  • 6. 6 Approach to design (I) Provenance capture control surfaced at program source level: p = pr.Provenance(df_A, '', savepath) # create provanance tracker tracker = ProvenanceTracker.ProvenanceTracker(df_A, p) # … # Imputation tracker.df = tracker.df.fillna(value={'E':'Ex', 'F':'Fx’}) # one-hot encoding c = 'E' dummies = [] dummies.append(pd.get_dummies(tracker.df[c])) df_dummies = pd.concat(dummies, axis=1) tracker.df = pd.concat((tracker.df, df_dummies), axis=1) tracker.df = tracker.df.drop([c], axis=1)
  • 7. 7 Approach to design (II) - Grounded in well-known dataframe transformation operators - Open: accommodates any transformation within three broad classes
  • 9. 9 Data augmentation Vertical augmentation group by gender avg(age) Horizontal augmentation
  • 10. 10 Data transformation Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip
  • 11. 11 Data fusion: join and append
  • 12. 13 Conceptual provenance capture model: templates A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 13. 14 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} + Binding rules
  • 14. 15 This applies to all operators
  • 15. 19 Implementation We use templates in combination with dataframe diff: (*) extends to joins, append For each input/output pair Din, Dout of dataframes: 1. Compare both the shapes and values of Din, Dout (*) 2. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes • Generate an instantiated provlet
  • 16. 20 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5
  • 17. 21 Summary: Shape and value changes Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations
  • 18. 22 Running Example Dataframes Diff template D1  {Da, Db} Explicit join provenance pattern D2  D1 value change, reduced nulls  imputation Data transformation D3  {D2, Dc} Explicit join provenance pattern D4  D3 value change, reduced nulls  imputation Data transformation D45  D4 Shape change, column(s) added <wait!> D6  D5 Shape change, column(s) removed Data transformation, composite D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5
  • 22. 26 Scalability: capture and storage / TCI-DI datasets Basic operators Join + append operators
  • 23. 28 Summary and open questions - But, does it help explaining data science findings from real pipelines? - Fine-grained provenance collection from data processing pipelines - Specifically for dataframe-based python scripts - Demonstrated scalable provenance generation, storage, query - work in progress

Editor's Notes

  1.     & D_1=\tau_{f(K)}(D_a)\\     & D_2=D_b \join^{\tt outer}_{K_1=K_2} D_c\\     & D_3=D_1 \union D_2 \\     & D_4=\horaug_{h(B)}(D_3)\\     & D_5=\pi_{\{A,B_0, B_1\}}(D_4)\\
  2. $f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise $f_2$, which computes the average of a set of numbers.
  3.     & D_1=D_a \join^{\tt left}_{K_1,K_2} D_b\\     & D_2=\tau_{f_1(*)}(D_1)\\     & D_3=D_2 \join^{\tt left}_{K_1,K_2} D_c\\     & D_4=\tau_{f_2(E,F)}(D_3)\\     & D_{5}=\horaug_{h(E):\{E_4, E_x, E_1\}}(D_4)\\     & D_6=\pi_{\{A_x, B, A_y, D, C, F, E_4, E_x, E_1, \}}(D_{5})
  4. ST needs to create provenance data for every new value in the new column. Join (JO) and Append (AP) operations require more time as they need to generate a quite large quantity of provenance.