SlideShare a Scribd company logo
1 of 19
Download to read offline
UP & DOWN: Improving
Provenance Precision by
Combining Workflow- and
Trace-Land Information
By Khalid Belhajjame, Paris-Dauphine University, LAMSADE
In collaboration with:
Saumen Dey1, David Koop2, Tianhong Song1, Bertram Ludeascher1, Paolo
Missier3
1 University of California
2 New York University
3 Newcastle University
Scientific Workflows
!   We have recorded a dramatic increase in
the number of scientist who utilize
scientific modules as building in the
composition of their experiments
!   In 2011, the EBI recorded 21
millions invocation to the scientific
modules they host
!   Typically, an experiment is designed as a
workflow, the steps of which represent
invocation to scientific modules, which
we will refer to using the term actors.
2
Scientific Workflow and
Provenance
•  Many scientific workflows have been instrumented to capture
workflow execution events.
•  Typically, such events record the actors that were invoked as
part of the workflow execution as well as the data used as
input and output by actor invocations.
•  Collected executions events can be used, amongst other
applications, to understand the lineage of a given workflow
results.
•  In practice, however, this application is limited by the black box
nature of the actors implementing the steps of the workflow
3
Data Dependencies
A
I1
I2
O1
O2
I1
I2
O1
O2
I1
I2
O1
O2
I1
I2
O1
O2
•  In general, if there are n input ports and m output ports for an actor, then
there are 2mn possible internal port dependencies.
•  With k such actors in a workflow, there are 2kmn possible port dependency
models of the workflow.
4
Related Work: Looking Inside
the Black Box
•  Capturing Data Provenance using Dynamic
Instrumentation (by Manolis Stamatogiannakis, Paul
Groth, and Herbert Bos in IPAW 2014)
•  DataTracker, which captures data provenance based on
taint tracking, a technique used in the security and reverse
engineering fields.
5
Framework for Generating Possible
Data Dependency Models
•  We set out to build a framework for generating
possible data dependency models within a workflow
•  By dependency model, we mean a graph in which
the nodes represent the input and output ports of the
actors within the workflow, and the edges specify
derivation dependencies between the ports in
question.
•  The objective is to reduce the space of possible port
dependency models to those for which we have
evidence that they hold.
6
Element of the Solution
1.  Dependencies specified during workflow design: the
workflow designer would indicates dependencies s/
he knows hold or not.
2.  Infer data dependencies by examining the workflow
provenance traces
3.  Identify data dependencies through actor probing
7
1. Using Workflow Specifications
•  The idea here is to use dependencies that are specified at the
level of the workfow to infer derivation relationships at the
level of the provenance graph 3
source normalize filter sink
y x y x y x
cf fa b
source:1 normalize:1 filter:1 sink:1
y x y x y x
cf fa b
d1
d2 d5 d7
d3 d4 d6 d8
(a)
(b)
source:1 normalize:1 filter:1 sink:1
y x y x y x
cf fa b
d1
d2 d5 d7
d3 d4 d6 d8
(c)
Fig. 1. (a) An example workflow specification, (b) an execution showing the first in-
vocation step of each actor, and (c) the corresponding data dependencies.
provenance rules into queries over schema instances. We also give examples of
The Figure is taken from: S. Bowers, T. McPhilips, B. Ludeascher. Declarative Rules for Inferring Fine-Grained Data Provenance
from Scientific Workfow Execution Traces. IPAW, 2012
8
2. Examining Provenance
Traces
The idea here is to trawl provenance traces generated by the workflow,
identify derivation relationships, which may have been manually specified
by a user, and infer dependencies between actor ports.
9
2. Examining Provenance
Traces
T U
V
W
X
Z
Y
derBy(Z,Y) and derBy(Z,W)
We conclude that :
The output port associated with Z depends on the input ports associated with
W and Y.
10
2. Examining Provenance
Traces
More generally, given provenance traces Gs, the derBy
relations in Gs are analyzed
edge could specify the dependency of (i) an output and an
input of some actor or (ii) an output of one actor and an
input of another actor. The framework infers these con-
straints using an algorithm, which is represented by the
datalog rules 4 and 5 as shown below. Based on these two
inferences, we get that port P2 depends on P1, which is
a workflow specification level information by analyzing
the derBy/2 relation from the provenance graph. Note
that (i) provide exact dependency, but (ii) provides an
over estimate.
(4) pdep(P1,P2) :-
derBy(Y1,X1),data(X1,P1,A,R,_),
data(Y1,P2,A,R,_).
(5) pdep(P1,P2) :-
derBy(Y1,X1),data(X1,P1,A,R,_),
data(D,P2,A,R,_),pdep⇤(D,Y1),
pdep⇤(X1,D).
Input port dependencies can also be inferred by ana-
lyzing the input and output data values. The framework
computes the constraints using an algorithm, which is
kn
m
4
In
tio
re
as
ed
on
m
fo
Si
m
be
11
3. Inferring Port Dependencies
Through Actor Probing
•  An additional source of information that we can use
for inferring port dependencies is through actor
probing.
•  Given an actor A , to identify if the data values
produced by a given output port of A depends on
the data values used to feed a given input port of A,
we probe the actor A by invoking it using different
data values, and analyze the values delivered by the
actor.
12
3. Inferring Port Dependencies
Through Actor Probing
•  Consider that the actor A has two input ports I1 and I2 and two
output ports O1 and O2
•  To determine if the data values delivered by the output O1
depends on those used to feed the input I2, we invoke the actor
A using multiple pairs to feed its input ports I1 and I2:
•  If the output values of O1 are the same for all invocations, then
we conclude that the output O1 may not depend on the input I2.
6
produced by a given output port of A depends on the data values used to feed a
given input port of A, we probe the actors A by invoking it using di↵erent data
values.
The approach is perhaps better explained using an example. Consider that
the actor A has two input ports I1 and I2 and two output ports O1 and O2.
To determine if the data values delivered by the output port O1 depends on the
data values used to feed the input port I1, we invoke the actor A using multiple
pairs to feed its input ports I1 and I2:
xvI1 , vi
I2
y, 1 § i § n
such that the values vi
I2
are di↵erent. vI1 is a valid value for the input port I1,
i.e., it belongs to the domain of its legal values, and vi
I2
are valid values for the
the input port I2.
We then examine the values delivered by the output port O1: vi
O1
, 1 § i § n.
If these output values are the same for all invocations, then we conclude that the
data values of the output port O1 may not be dependent on the data values used
A
I1
I2
O1
O2
13
Prototype
•  We have an implementation of
the framework which given a
workflow specification and
provenance traces of their
execution generate possible
port dependency models.
•  Example: Climat Data
Collection Workflow
14
Climate Data Collector
Workfow
•  Climate Data Collector workflow represented by our framework.
•  There are 32768 possible dependency models for this workflow !
15
Climate Data Collector
Workfow
pdep(p02,p01).
pdep(p03,p01).
pdep(p04,p01).
pdep(p05,p01).
pdep(p06,p01).
pdep(p10,p07).
pdep(p10,p08).
pdep(p11,p09).
pdep(p14,p12).
pdep(p14,p13).
pdep(p17,p15).
pdep(p17,p16).
(a) (b)
(c) (d)
16
Climate Data Collector
Workfow
•  Using our framwork, we reduced the number of possible dependency
models to 6
17
Limitations and Future Work
•  We assumed that the port dependencies are static, by
which we mean that they are valid across runs.
•  This is not the case in general unfortunately!
•  The framework that we presented in this paper needs to
be refined in order to cater for the identification of
dynamic dependencies.
•  we treat port dependencies equally.
•  A classification of dependencies is needed to provide
the user with better understanding on the kind of
relationship between the input and output ports.
18
Limitations and Future Work
•  We did not consider Input port dependencies
•  A substantial number of actors (modules) have this
kind of dependencies
database
sequence
report
19

More Related Content

What's hot

Java8 training - Class 1
Java8 training  - Class 1Java8 training  - Class 1
Java8 training - Class 1Marut Singh
 
Hoisting Nested Functions
Hoisting Nested Functions Hoisting Nested Functions
Hoisting Nested Functions Feras Tanan
 
Java8 training - class 3
Java8 training - class 3Java8 training - class 3
Java8 training - class 3Marut Singh
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs Lisong Guo
 
Java 8 lambda expressions
Java 8 lambda expressionsJava 8 lambda expressions
Java 8 lambda expressionsLogan Chien
 
Java8: Language Enhancements
Java8: Language EnhancementsJava8: Language Enhancements
Java8: Language EnhancementsYuriy Bondaruk
 
Linq & lambda overview C#.net
Linq & lambda overview C#.netLinq & lambda overview C#.net
Linq & lambda overview C#.netDhairya Joshi
 
A Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIA Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIJörn Guy Süß JGS
 
Vhd lhigh2003
Vhd lhigh2003Vhd lhigh2003
Vhd lhigh2003gkumawat
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04IIUM
 
Lambda Expressions in Java
Lambda Expressions in JavaLambda Expressions in Java
Lambda Expressions in JavaErhan Bagdemir
 
02. functions & introduction to class
02. functions & introduction to class02. functions & introduction to class
02. functions & introduction to classHaresh Jaiswal
 
02 Java Language And OOP PART II
02 Java Language And OOP PART II02 Java Language And OOP PART II
02 Java Language And OOP PART IIHari Christian
 
Lambda expression par Christophe Huntzinger
Lambda expression par Christophe HuntzingerLambda expression par Christophe Huntzinger
Lambda expression par Christophe HuntzingerMik_Arber
 

What's hot (20)

Java 8 Intro - Core Features
Java 8 Intro - Core FeaturesJava 8 Intro - Core Features
Java 8 Intro - Core Features
 
Java8 training - Class 1
Java8 training  - Class 1Java8 training  - Class 1
Java8 training - Class 1
 
Hoisting Nested Functions
Hoisting Nested Functions Hoisting Nested Functions
Hoisting Nested Functions
 
Java8 training - class 3
Java8 training - class 3Java8 training - class 3
Java8 training - class 3
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
 
Java 8 lambda expressions
Java 8 lambda expressionsJava 8 lambda expressions
Java 8 lambda expressions
 
Stacks
StacksStacks
Stacks
 
Java8: Language Enhancements
Java8: Language EnhancementsJava8: Language Enhancements
Java8: Language Enhancements
 
Linq & lambda overview C#.net
Linq & lambda overview C#.netLinq & lambda overview C#.net
Linq & lambda overview C#.net
 
Unit3 cspc
Unit3 cspcUnit3 cspc
Unit3 cspc
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
A Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its APIA Brief Conceptual Introduction to Functional Java 8 and its API
A Brief Conceptual Introduction to Functional Java 8 and its API
 
Vhd lhigh2003
Vhd lhigh2003Vhd lhigh2003
Vhd lhigh2003
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04
 
Storage class
Storage classStorage class
Storage class
 
Lambda Expressions in Java
Lambda Expressions in JavaLambda Expressions in Java
Lambda Expressions in Java
 
02. functions & introduction to class
02. functions & introduction to class02. functions & introduction to class
02. functions & introduction to class
 
Welcome to rx java2
Welcome to rx java2Welcome to rx java2
Welcome to rx java2
 
02 Java Language And OOP PART II
02 Java Language And OOP PART II02 Java Language And OOP PART II
02 Java Language And OOP PART II
 
Lambda expression par Christophe Huntzinger
Lambda expression par Christophe HuntzingerLambda expression par Christophe Huntzinger
Lambda expression par Christophe Huntzinger
 

Similar to Tapp 2014 (belhajjame)

Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingJinho Choi
 
Interoperable data management OBSEA
Interoperable data management OBSEAInteroperable data management OBSEA
Interoperable data management OBSEAÆdel Aerospace GmbH
 
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Areej Qasrawi
 
Pattern-based Definition and Generation of Components for a Synchronous React...
Pattern-based Definition and Generation of Components for a Synchronous React...Pattern-based Definition and Generation of Components for a Synchronous React...
Pattern-based Definition and Generation of Components for a Synchronous React...ijeukens
 
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc Networks
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc NetworksAn Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc Networks
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc NetworksNancy Ideker
 
Anomaly detection final
Anomaly detection finalAnomaly detection final
Anomaly detection finalAkshay Bansal
 
Traffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNTraffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNijtsrd
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Data mining final report
Data mining final reportData mining final report
Data mining final reportKedar Kumar
 
SDN - OpenFlow protocol
SDN - OpenFlow protocolSDN - OpenFlow protocol
SDN - OpenFlow protocolUlf Marxen
 
Network Anomaly Detection Using Autonomous System Flow Aggregates
Network Anomaly Detection Using Autonomous System Flow AggregatesNetwork Anomaly Detection Using Autonomous System Flow Aggregates
Network Anomaly Detection Using Autonomous System Flow AggregatesThienne Johnson
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesRoger Gomes
 
Mr201304 open flow_security_eng
Mr201304 open flow_security_engMr201304 open flow_security_eng
Mr201304 open flow_security_engFFRI, Inc.
 
Serving Ireland's Geospatial Information as Linked Data
Serving Ireland's Geospatial Information as Linked DataServing Ireland's Geospatial Information as Linked Data
Serving Ireland's Geospatial Information as Linked DataChristophe Debruyne
 
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...ijcsit
 
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...AIRCC Publishing Corporation
 

Similar to Tapp 2014 (belhajjame) (20)

Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Labreport
LabreportLabreport
Labreport
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency Parsing
 
Interoperable data management OBSEA
Interoperable data management OBSEAInteroperable data management OBSEA
Interoperable data management OBSEA
 
Open Access Repository Junction
Open Access Repository JunctionOpen Access Repository Junction
Open Access Repository Junction
 
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
Traffic Features Extraction and Clustering Analysis for Abnormal Behavior Det...
 
Pattern-based Definition and Generation of Components for a Synchronous React...
Pattern-based Definition and Generation of Components for a Synchronous React...Pattern-based Definition and Generation of Components for a Synchronous React...
Pattern-based Definition and Generation of Components for a Synchronous React...
 
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc Networks
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc NetworksAn Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc Networks
An Optimal Algorithm For Relay Node Assignment In Cooperative Ad Hoc Networks
 
Anomaly detection final
Anomaly detection finalAnomaly detection final
Anomaly detection final
 
Traffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDNTraffic Engineering in Software Defined Networking SDN
Traffic Engineering in Software Defined Networking SDN
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Data mining final report
Data mining final reportData mining final report
Data mining final report
 
SDN - OpenFlow protocol
SDN - OpenFlow protocolSDN - OpenFlow protocol
SDN - OpenFlow protocol
 
Network Anomaly Detection Using Autonomous System Flow Aggregates
Network Anomaly Detection Using Autonomous System Flow AggregatesNetwork Anomaly Detection Using Autonomous System Flow Aggregates
Network Anomaly Detection Using Autonomous System Flow Aggregates
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devices
 
Mr201304 open flow_security_eng
Mr201304 open flow_security_engMr201304 open flow_security_eng
Mr201304 open flow_security_eng
 
ready_valid_interface.pdf
ready_valid_interface.pdfready_valid_interface.pdf
ready_valid_interface.pdf
 
Serving Ireland's Geospatial Information as Linked Data
Serving Ireland's Geospatial Information as Linked DataServing Ireland's Geospatial Information as Linked Data
Serving Ireland's Geospatial Information as Linked Data
 
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
 
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
SIMULATION OF SOFTWARE DEFINED NETWORKS WITH OPEN NETWORK OPERATING SYSTEM AN...
 

More from Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

More from Khalid Belhajjame (17)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Recently uploaded

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 

Recently uploaded (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 

Tapp 2014 (belhajjame)

  • 1. UP & DOWN: Improving Provenance Precision by Combining Workflow- and Trace-Land Information By Khalid Belhajjame, Paris-Dauphine University, LAMSADE In collaboration with: Saumen Dey1, David Koop2, Tianhong Song1, Bertram Ludeascher1, Paolo Missier3 1 University of California 2 New York University 3 Newcastle University
  • 2. Scientific Workflows !   We have recorded a dramatic increase in the number of scientist who utilize scientific modules as building in the composition of their experiments !   In 2011, the EBI recorded 21 millions invocation to the scientific modules they host !   Typically, an experiment is designed as a workflow, the steps of which represent invocation to scientific modules, which we will refer to using the term actors. 2
  • 3. Scientific Workflow and Provenance •  Many scientific workflows have been instrumented to capture workflow execution events. •  Typically, such events record the actors that were invoked as part of the workflow execution as well as the data used as input and output by actor invocations. •  Collected executions events can be used, amongst other applications, to understand the lineage of a given workflow results. •  In practice, however, this application is limited by the black box nature of the actors implementing the steps of the workflow 3
  • 4. Data Dependencies A I1 I2 O1 O2 I1 I2 O1 O2 I1 I2 O1 O2 I1 I2 O1 O2 •  In general, if there are n input ports and m output ports for an actor, then there are 2mn possible internal port dependencies. •  With k such actors in a workflow, there are 2kmn possible port dependency models of the workflow. 4
  • 5. Related Work: Looking Inside the Black Box •  Capturing Data Provenance using Dynamic Instrumentation (by Manolis Stamatogiannakis, Paul Groth, and Herbert Bos in IPAW 2014) •  DataTracker, which captures data provenance based on taint tracking, a technique used in the security and reverse engineering fields. 5
  • 6. Framework for Generating Possible Data Dependency Models •  We set out to build a framework for generating possible data dependency models within a workflow •  By dependency model, we mean a graph in which the nodes represent the input and output ports of the actors within the workflow, and the edges specify derivation dependencies between the ports in question. •  The objective is to reduce the space of possible port dependency models to those for which we have evidence that they hold. 6
  • 7. Element of the Solution 1.  Dependencies specified during workflow design: the workflow designer would indicates dependencies s/ he knows hold or not. 2.  Infer data dependencies by examining the workflow provenance traces 3.  Identify data dependencies through actor probing 7
  • 8. 1. Using Workflow Specifications •  The idea here is to use dependencies that are specified at the level of the workfow to infer derivation relationships at the level of the provenance graph 3 source normalize filter sink y x y x y x cf fa b source:1 normalize:1 filter:1 sink:1 y x y x y x cf fa b d1 d2 d5 d7 d3 d4 d6 d8 (a) (b) source:1 normalize:1 filter:1 sink:1 y x y x y x cf fa b d1 d2 d5 d7 d3 d4 d6 d8 (c) Fig. 1. (a) An example workflow specification, (b) an execution showing the first in- vocation step of each actor, and (c) the corresponding data dependencies. provenance rules into queries over schema instances. We also give examples of The Figure is taken from: S. Bowers, T. McPhilips, B. Ludeascher. Declarative Rules for Inferring Fine-Grained Data Provenance from Scientific Workfow Execution Traces. IPAW, 2012 8
  • 9. 2. Examining Provenance Traces The idea here is to trawl provenance traces generated by the workflow, identify derivation relationships, which may have been manually specified by a user, and infer dependencies between actor ports. 9
  • 10. 2. Examining Provenance Traces T U V W X Z Y derBy(Z,Y) and derBy(Z,W) We conclude that : The output port associated with Z depends on the input ports associated with W and Y. 10
  • 11. 2. Examining Provenance Traces More generally, given provenance traces Gs, the derBy relations in Gs are analyzed edge could specify the dependency of (i) an output and an input of some actor or (ii) an output of one actor and an input of another actor. The framework infers these con- straints using an algorithm, which is represented by the datalog rules 4 and 5 as shown below. Based on these two inferences, we get that port P2 depends on P1, which is a workflow specification level information by analyzing the derBy/2 relation from the provenance graph. Note that (i) provide exact dependency, but (ii) provides an over estimate. (4) pdep(P1,P2) :- derBy(Y1,X1),data(X1,P1,A,R,_), data(Y1,P2,A,R,_). (5) pdep(P1,P2) :- derBy(Y1,X1),data(X1,P1,A,R,_), data(D,P2,A,R,_),pdep⇤(D,Y1), pdep⇤(X1,D). Input port dependencies can also be inferred by ana- lyzing the input and output data values. The framework computes the constraints using an algorithm, which is kn m 4 In tio re as ed on m fo Si m be 11
  • 12. 3. Inferring Port Dependencies Through Actor Probing •  An additional source of information that we can use for inferring port dependencies is through actor probing. •  Given an actor A , to identify if the data values produced by a given output port of A depends on the data values used to feed a given input port of A, we probe the actor A by invoking it using different data values, and analyze the values delivered by the actor. 12
  • 13. 3. Inferring Port Dependencies Through Actor Probing •  Consider that the actor A has two input ports I1 and I2 and two output ports O1 and O2 •  To determine if the data values delivered by the output O1 depends on those used to feed the input I2, we invoke the actor A using multiple pairs to feed its input ports I1 and I2: •  If the output values of O1 are the same for all invocations, then we conclude that the output O1 may not depend on the input I2. 6 produced by a given output port of A depends on the data values used to feed a given input port of A, we probe the actors A by invoking it using di↵erent data values. The approach is perhaps better explained using an example. Consider that the actor A has two input ports I1 and I2 and two output ports O1 and O2. To determine if the data values delivered by the output port O1 depends on the data values used to feed the input port I1, we invoke the actor A using multiple pairs to feed its input ports I1 and I2: xvI1 , vi I2 y, 1 § i § n such that the values vi I2 are di↵erent. vI1 is a valid value for the input port I1, i.e., it belongs to the domain of its legal values, and vi I2 are valid values for the the input port I2. We then examine the values delivered by the output port O1: vi O1 , 1 § i § n. If these output values are the same for all invocations, then we conclude that the data values of the output port O1 may not be dependent on the data values used A I1 I2 O1 O2 13
  • 14. Prototype •  We have an implementation of the framework which given a workflow specification and provenance traces of their execution generate possible port dependency models. •  Example: Climat Data Collection Workflow 14
  • 15. Climate Data Collector Workfow •  Climate Data Collector workflow represented by our framework. •  There are 32768 possible dependency models for this workflow ! 15
  • 17. Climate Data Collector Workfow •  Using our framwork, we reduced the number of possible dependency models to 6 17
  • 18. Limitations and Future Work •  We assumed that the port dependencies are static, by which we mean that they are valid across runs. •  This is not the case in general unfortunately! •  The framework that we presented in this paper needs to be refined in order to cater for the identification of dynamic dependencies. •  we treat port dependencies equally. •  A classification of dependencies is needed to provide the user with better understanding on the kind of relationship between the input and output ports. 18
  • 19. Limitations and Future Work •  We did not consider Input port dependencies •  A substantial number of actors (modules) have this kind of dependencies database sequence report 19