<ul><li>Paolo Missier (1) , Bertram Ludäscher (2) , Shawn Bowers (3) , </li></ul><ul><li>Saumen Dey (2) , Anandarup Sarka...
Context: Data Sharing <ul><li>Implicit collaboration  through data sharing </li></ul><ul><ul><li>Alice uses  n th  generat...
Motivation: Virtual Joint Experiments <ul><li>How do we ensure that Charlie gets a complete account of the history of W c ...
Provenance Composition: the  Data Tree of Life (DToL)  <ul><li>We can formulate our questions in terms of provenance of th...
Test scenario: 1 st  Provenance Challenge Workflow <ul><li>DataONE Summer-of-Code Project </li></ul><ul><ul><li>Split  Fir...
Common Model of Provenance (approx. OPM) Data provenance for a  single  workflow run is well understood T A   trace instan...
Data and Invocation Dependencies ( ddep ,  idep ) -  read ,  write  are natural observables for a workflow run - possible ...
Provenance queries <ul><li>Local  (“non-closure”) queries on a trace  T : </li></ul><ul><ul><li>Find the data and traces p...
Issues in Provenance Composition  <ul><li>Main problems and approaches: </li></ul><ul><ul><li>heterogeneity of both  workf...
Part I – Provenance Stitching <ul><li>The missing link: make every  data copy step  provenance-aware </li></ul>-  r  : dat...
Part II - Mapping to a Common Provenance Model <ul><li>Mapping rules (= code, queries) defined from Kepler and Taverna pro...
Part III – Data Identifier Reconciliation <ul><li>We have seen that the copy operation …  </li></ul><ul><li>r’ = copy(r, S...
Extended (across-runs) Provenance Queries <ul><li>Closure queries are redefined on the extended provenance trace that incl...
Prototype Architecture
Conclusions 1/2 <ul><li>In  theory , provenance interoperability should be solved/easy using e.g. OPM </li></ul><ul><li>In...
<ul><li>DataONE: </li></ul><ul><ul><li>http://www.dataone.org/   </li></ul></ul><ul><li>Data Tree-of-Life (DToL Summer Pro...
Upcoming SlideShare
Loading in …5
×

Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

484 views

Published on

Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
484
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • add logos and jazz up as needed
  • Well-supported:KeplerTavernaPegasusVisTrails...
  • Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

    1. 1. <ul><li>Paolo Missier (1) , Bertram Ludäscher (2) , Shawn Bowers (3) , </li></ul><ul><li>Saumen Dey (2) , Anandarup Sarkar (3) , Biva Shrestha (4) , </li></ul><ul><li>Ilkay Altintas (5) , Manish Kumar Anand (5) , Carole Goble (1) </li></ul>Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science <ul><li>School of Computer Science, University of Manchester </li></ul><ul><li>Dept. of Computer Science, University of California, Davis </li></ul><ul><li>Dept. of Computer Science, Gonzaga University </li></ul><ul><li>Dept. of Computer Science, Appalachian State University </li></ul><ul><li>San Diego Supercomputer Center, University of California, San Diego </li></ul>WORKS’10, New Orleans
    2. 2. Context: Data Sharing <ul><li>Implicit collaboration through data sharing </li></ul><ul><ul><li>Alice uses n th generation input dataset x and produces n +1 st output dataset z </li></ul></ul><ul><ul><li>… as part of run R A of workflow W A </li></ul></ul><ul><ul><li>… output z is published in some data-space. </li></ul></ul><ul><ul><li>Bob uses Alice’s outputs z and produces n +2 nd generation dataset v </li></ul></ul><ul><ul><li>… using workflow W B , possibly with pre-processing f </li></ul></ul><ul><ul><li>Alice and Bob may not know each other </li></ul></ul>
    3. 3. Motivation: Virtual Joint Experiments <ul><li>How do we ensure that Charlie gets a complete account of the history of W c ’s outputs? </li></ul><ul><li>How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v ? </li></ul><ul><ul><li> traces T A and T B will be critical </li></ul></ul><ul><ul><li> need to compose them to obtain T C </li></ul></ul>We can view the composition W C as a new, virtual workflow
    4. 4. Provenance Composition: the Data Tree of Life (DToL) <ul><li>We can formulate our questions in terms of provenance of the datasets produced by virtual workflow W C : </li></ul><ul><ul><li>What is the complete provenance of v ? </li></ul></ul><ul><li>Answering the question requires tracing v ’s derivation all the way to x </li></ul><ul><li>But, to achieve this, we need to ensure: </li></ul><ul><ul><li>T A and T B are properly connected </li></ul></ul><ul><ul><li>Provenance queries run seamlessly over and across T A and T B </li></ul></ul>
    5. 5. Test scenario: 1 st Provenance Challenge Workflow <ul><li>DataONE Summer-of-Code Project </li></ul><ul><ul><li>Split First Provenance Challenge workflow at various points </li></ul></ul><ul><ul><li>Publish Part-I from system X , use as input for Part-II on system Y </li></ul></ul><ul><ul><ul><li>X , Y in { Kepler / SDF , Kepler / COMAD, Taverna } </li></ul></ul></ul>
    6. 6. Common Model of Provenance (approx. OPM) Data provenance for a single workflow run is well understood T A trace instance of W A : h : T A ➔ W A homomorphism h (x 1 ➔ a 1 ) = h (x 2 ➔ a 2 ) = X➔A, h (a 1 ➔ y 1 ) = h (a 2 ➔ y 2 ) = A➔Y ... <ul><li>Workflow spec : digraph </li></ul><ul><ul><ul><li>W= (V W , E W ) </li></ul></ul></ul><ul><li>V W = A ∪ C </li></ul><ul><li>actors A ( processors ) </li></ul><ul><li>channels C (FIFO data buffers ) </li></ul><ul><li>E W = E in ∪ E out </li></ul><ul><li>in edges E in ⊆ A x C </li></ul><ul><li>out edges E out ⊆ C x A </li></ul><ul><li>Trace graph : acyclic digraph </li></ul><ul><ul><ul><li>T = (V T , E T ) </li></ul></ul></ul><ul><li>V T = I ∪ D (invocations I , data D ) </li></ul><ul><li>E T = E read ∪ E write </li></ul><ul><li>read edges E read ⊆ D x I </li></ul><ul><li>write edges E write ⊆ I x D </li></ul>
    7. 7. Data and Invocation Dependencies ( ddep , idep ) - read , write are natural observables for a workflow run - possible additional relations (recorded or inferred): “ a 2 depends on a 1 ” because a 1 has written data d , a 2 has read d “ d 2 depends on d 1 ” … because some actor invocation a read d 1 prior to writing d 2 (Note: in some models of computation the rules above are not correct) <ul><li>invocation dependencies : </li></ul><ul><li>data dependencies : </li></ul>Explicit or via: Explicit or via:
    8. 8. Provenance queries <ul><li>Local (“non-closure”) queries on a trace T : </li></ul><ul><ul><li>Find the data and traces published by Alice / Bob </li></ul></ul><ul><ul><li>Find the inputs, outputs, and intermediate data products of T </li></ul></ul><ul><ul><li>Find (selected) actors and channels used in T </li></ul></ul><ul><ul><li>Find inputs and outputs of an invocation a i in T </li></ul></ul>Easy and not very interesting E.g. answer to (3) is just the set of nodes in h(T) <ul><li>Closure queries: </li></ul><ul><li>operate on the transitive closure ddep* over ddep : </li></ul><ul><li>suppose ddep* spans multiple traces T A , T B </li></ul><ul><li>we must define the standard query: </li></ul>so that it operates on the composition of T A , T B
    9. 9. Issues in Provenance Composition <ul><li>Main problems and approaches: </li></ul><ul><ul><li>heterogeneity of both workflow and provenance models </li></ul></ul>Closure queries now must span multiple provenance traces <ul><li>I - Trace disconnect: </li></ul><ul><li>traces that should “join” on the shared data, are really disconnected </li></ul><ul><li>make data sharing process itself provenance-aware </li></ul><ul><li>III - Data identifiers mismatch </li></ul><ul><li>different workflows adopt different data identification schemes </li></ul><ul><li>assert data equivalence as part of provenance </li></ul><ul><li>II - Model heterogeneity : </li></ul><ul><li>common provenance model with local ➔ global mapping </li></ul><ul><li>different workflow and provenance models </li></ul>
    10. 10. Part I – Provenance Stitching <ul><li>The missing link: make every data copy step provenance-aware </li></ul>- r : data reference in store S - trace-equivalence of data items d in S , d’ in S’ : d ≃ d’ if d’ is obtained by copying d from S to S’ :
    11. 11. Part II - Mapping to a Common Provenance Model <ul><li>Mapping rules (= code, queries) defined from Kepler and Taverna provenance models to common model (details omitted): </li></ul>In the result T P each reference r found in T S is replaced with ρ( r ) <ul><li>OPM used as intermediate target model </li></ul><ul><li>… doesn’t “nail” everything </li></ul><ul><ul><li>a mixed blessing … </li></ul></ul><ul><li>… but team-work made it work! </li></ul>
    12. 12. Part III – Data Identifier Reconciliation <ul><li>We have seen that the copy operation … </li></ul><ul><li>r’ = copy(r, S, S’) </li></ul><ul><li>… on shared data store S generates a data equivalence assertion </li></ul><ul><li>It also keep track of ID mappings: </li></ul>added to renaming map from a set of S -specific references to a set of public references
    13. 13. Extended (across-runs) Provenance Queries <ul><li>Closure queries are redefined on the extended provenance trace that includes trace-equivalences d ≃ d ’ </li></ul>as follows: for instance between
    14. 14. Prototype Architecture
    15. 15. Conclusions 1/2 <ul><li>In theory , provenance interoperability should be solved/easy using e.g. OPM </li></ul><ul><li>In practice it isn’t (cf. Provenance Challenge workshops), e.g. </li></ul><ul><ul><li>different mappings to OPM </li></ul></ul><ul><ul><li>different identifier schemes </li></ul></ul><ul><ul><li>traces broken “at the seams” </li></ul></ul><ul><li>Summer-of-code DToL prototype demonstrates feasibility of provenance-aware collaboration / workflow interoperation through data </li></ul><ul><ul><li>Extends potential of provenance analysis beyond isolated workflow-based experiments </li></ul></ul><ul><li>Findings relevant for data preservation in </li></ul><ul><ul><li>Tracing data access is key </li></ul></ul>
    16. 16. <ul><li>DataONE: </li></ul><ul><ul><li>http://www.dataone.org/ </li></ul></ul><ul><li>Data Tree-of-Life (DToL Summer Project) </li></ul><ul><ul><li> https://sites.google.com/site/datatolproject/ </li></ul></ul><ul><li>Runtime wf systems interoperability can be very hard </li></ul><ul><ul><li>… and benefits not clear (unless “layered” approach w/ different roles of wf systems) </li></ul></ul><ul><ul><li>wf provenance interoperability to the rescue! </li></ul></ul><ul><li>Next Steps: </li></ul><ul><ul><li>DataONE Working Group on Provenance for Scientific Workflows </li></ul></ul><ul><ul><li>Develop DOPM (DataONE Provenance Model; OPM++) </li></ul></ul>Conclusions 2/2

    ×