SlideShare a Scribd company logo
1 of 28
1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True
4
Typical operators used in data prep
5
Data reduction
- Conditional projection
- Selection
6
Data augmentation
Vertical augmentation
Horizontal augmentation
avg(age)
group by age
7
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
8
Data fusion: join and append
9
Provenance model
10
Capturing provenance: Assumptions
- Common data abstraction: (Pandas) dataframes
- Observability: runtime execution of a (python) program can be observed
- Each input and output dataframe to each operator can be inspected
11
Capturing provenance: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
12
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules
13
This applies to all operators
14
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
Used
Left Right Output
Used
wasDerivedFrom
15
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
Used
Left Right Output
wasDerivedFrom
17
Capturing provenance: a more practical approach
The approach just described requires recognizing the type of operation from the source code
Restricts to a closed set of operators  needs to be maintained over time
(*) extends to joins, append
We take a more generic route to implementing the same idea:
1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator
2. Dataframe diff: Compare both the shapes and values of Din, Dout (*)
3. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
18
Example
Consider the following sequence: Imputation  join  append  one hot encoding
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append
Add
‘B0,’ ‘B1’ Remove ‘B’
D4 D5
7
<event
name>
19
Example
Dataframes Diff template
D1, Da value change, reduced number of
null values
Data transformation
D2, {Da, Db} join provenance
D3, {D1, D2} append provenance
D4, D3 Shape change, column(s) added <wait!>
D5, D4 Shape change, column(s) removed Data transformation, composite
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append Remove ‘B’
D4 D5
Add
‘B0,’ ‘B1’
20
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
21
Code instrumentation
A python tracker object intercepts dataframe operations, using an observer pattern
The tracker collects the values required to generate the bindings
Create a provenance object and a tracker object
Simple column transform
One-hot encoding
join
22
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column
23
Evaluation – benchmark pipelines
24
Evaluation: Provenance capture times
25
Evaluation: Provenance query times on Neo4J
26
Scalability: provenance query times
Synthetic Benchmarking datasets created using TPC-DI
27
Scalability: operations on TCI-DI datasets
Basic operators Join + append operators
28
Tool demo
DPDS: Assisting Data Science with Data
Provenance. Chapman, A.; Missier, P.; Lauro, L.; and
Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022.
(demo paper)
29
Summary
1. What is the killer app for such granular provenance?
2. How general is the technique with respect to arbitrary pandas programs?
A method, infrastructure and tooling for collecting, querying, and visualizing
very fine-grained provenance from data processing pipelines

More Related Content

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATALuhSm
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overviewTetsuya Sakai
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptPerumalPitchandi
 
Duflow manual1995
Duflow manual1995Duflow manual1995
Duflow manual1995isMetal
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Daniel Valcarce
 
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-EngineDSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-EngineDeltares
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
GNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launchGNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launchAdaCore
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016Ankit Dubey
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016Ankit Dubey
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016Ankit Dubey
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkUniversidade de São Paulo
 

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) (20)

Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
 
Duflow manual1995
Duflow manual1995Duflow manual1995
Duflow manual1995
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
 
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-EngineDSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
GNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launchGNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launch
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
 

More from Paolo Missier

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 

More from Paolo Missier (20)

Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

  • 1. 1 Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3 (1) University of Southampton, UK (2) Newcastle University, UK (3) Universita’ Roma Tre, Italy [1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. PVLDB, 14(4): 507–520. January 2021. [2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
  • 2. 2 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training / test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 3. 3 <event name> Provenance of what? Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input P0 - Transparent program PT - Fine-grained datasets PT … … … … … … … … - Transparent pipeline - Fine-grained datasets P’T … … … … … … … … Pn T Pn T Pn T - Transparent program PT - coarse-grained datasets PT f if c: y1  x1 else: y1  x2 Y2  f(x1, x2) Runtime: c == True
  • 5. 5 Data reduction - Conditional projection - Selection
  • 6. 6 Data augmentation Vertical augmentation Horizontal augmentation avg(age) group by age
  • 7. 7 Data transformation Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip
  • 8. 8 Data fusion: join and append
  • 10. 10 Capturing provenance: Assumptions - Common data abstraction: (Pandas) dataframes - Observability: runtime execution of a (python) program can be observed - Each input and output dataframe to each operator can be inspected
  • 11. 11 Capturing provenance: templates A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 12. 12 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} + Binding rules
  • 13. 13 This applies to all operators
  • 14. 14 Join provenance pattern -- keys Join activity wasGeneratedBy Used Left Right Output Used wasDerivedFrom
  • 15. 15 Join provenance pattern -- non-key elements Join activity wasGeneratedBy Used Left Right Output wasDerivedFrom
  • 16. 17 Capturing provenance: a more practical approach The approach just described requires recognizing the type of operation from the source code Restricts to a closed set of operators  needs to be maintained over time (*) extends to joins, append We take a more generic route to implementing the same idea: 1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator 2. Dataframe diff: Compare both the shapes and values of Din, Dout (*) 3. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes
  • 17. 18 Example Consider the following sequence: Imputation  join  append  one hot encoding Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Add ‘B0,’ ‘B1’ Remove ‘B’ D4 D5 7 <event name>
  • 18. 19 Example Dataframes Diff template D1, Da value change, reduced number of null values Data transformation D2, {Da, Db} join provenance D3, {D1, D2} append provenance D4, D3 Shape change, column(s) added <wait!> D5, D4 Shape change, column(s) removed Data transformation, composite Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Remove ‘B’ D4 D5 Add ‘B0,’ ‘B1’
  • 19. 20 Summary: Shape and value changes Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations
  • 20. 21 Code instrumentation A python tracker object intercepts dataframe operations, using an observer pattern The tracker collects the values required to generate the bindings Create a provenance object and a tracker object Simple column transform One-hot encoding join
  • 21. 22 Evaluation – benchmark datasets Census pipeline: Clerical cleaning on every cell (removing blanks) Replace all ‘?’ with NaN One-hot encoding 7 categorical variables Map binary labels to 0,1 Drop one column
  • 25. 26 Scalability: provenance query times Synthetic Benchmarking datasets created using TPC-DI
  • 26. 27 Scalability: operations on TCI-DI datasets Basic operators Join + append operators
  • 27. 28 Tool demo DPDS: Assisting Data Science with Data Provenance. Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022. (demo paper)
  • 28. 29 Summary 1. What is the killer app for such granular provenance? 2. How general is the technique with respect to arbitrary pandas programs? A method, infrastructure and tooling for collecting, querying, and visualizing very fine-grained provenance from data processing pipelines

Editor's Notes

  1. $f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise $f_2$, which computes the average of a set of numbers.
  2.     & D_1=\tau_{f(K)}(D_a)\\     & D_2=D_b \join^{\tt outer}_{K_1=K_2} D_c\\     & D_3=D_1 \union D_2 \\     & D_4=\horaug_{h(B)}(D_3)\\     & D_5=\pi_{\{A,B_0, B_1\}}(D_4)\\