Report on Provenance Challenge 3, at the PC3 meeting, 2009

422 views

Published on

Amsterdam, June 10-11, 2009

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
422
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Report on Provenance Challenge 3, at the PC3 meeting, 2009

  1. 1. Provenance challenge 3 and Taverna Paolo Missier Information Management GroupSchool of Computer Science, University of Manchester, UK Provenance Challenge 3 meeting Amsterdam, June 10-11, 2009 1
  2. 2. Interpreting the challenge1. Implement the challenge workflow2. Answer the core (+optional) challenge queries3. Produce and export OPM4. Import and consume OPM 2
  3. 3. Interpreting the challenge1. Implement the challenge workflow ➡ Expose the same amount of data that is visible in the Trident version of the workflow2. Answer the core (+optional) challenge queries3. Produce and export OPM4. Import and consume OPM 2
  4. 4. Interpreting the challenge1. Implement the challenge workflow ➡ Expose the same amount of data that is visible in the Trident version of the workflow2. Answer the core (+optional) challenge queries ➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?3. Produce and export OPM4. Import and consume OPM 2
  5. 5. Interpreting the challenge1. Implement the challenge workflow ➡ Expose the same amount of data that is visible in the Trident version of the workflow2. Answer the core (+optional) challenge queries ➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?3. Produce and export OPM ➡ Export the smallest OPM graph that contains the query answer ➡(graph for the entire run is a special “degenerate” case)4. Import and consume OPM 2
  6. 6. Interpreting the challenge1. Implement the challenge workflow ➡ Expose the same amount of data that is visible in the Trident version of the workflow2. Answer the core (+optional) challenge queries ➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?3. Produce and export OPM ➡ Export the smallest OPM graph that contains the query answer ➡(graph for the entire run is a special “degenerate” case)4. Import and consume OPM ➡Map an OPM graph to an instance of a TP causal graph ➡Use TP queries to answer the same challenge queries as above 2
  7. 7. Our challenges 3
  8. 8. Our challenges➡Expose the same amount of data that is visible in the Trident version of the workflow: ➡ map the control flow to a pure dataflow model 3
  9. 9. Our challenges➡Expose the same amount of data that is visible in the Trident version of the workflow: ➡ map the control flow to a pure dataflow model➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries? ➡ challenge queries gave us good new requirements 3
  10. 10. Our challenges➡Expose the same amount of data that is visible in the Trident version of the workflow: ➡ map the control flow to a pure dataflow model➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries? ➡ challenge queries gave us good new requirements➡export the smallest OPM graph that contains the query answer ➡ OPM generation is now fully integrated into TP query answering 3
  11. 11. Our challenges➡Expose the same amount of data that is visible in the Trident version of the workflow: ➡ map the control flow to a pure dataflow model➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries? ➡ challenge queries gave us good new requirements➡export the smallest OPM graph that contains the query answer ➡ OPM generation is now fully integrated into TP query answering➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as above ➡ TP query processing requires a representation of the originating workflow ➡ This required generating the “minimal plausible 3 originating dataflow” (MPOD) for the OPM graph
  12. 12. The challenge workflow as a Taverna dataflow data links control links: “first LoadCSVFileIntoTable, then UpdateComputedColumns” 4
  13. 13. The challenge workflow as a Taverna dataflow this produces a list... data links control links: “first LoadCSVFileIntoTable, then UpdateComputedColumns” 4
  14. 14. The challenge workflow as a Taverna dataflow this produces a list... ...these consume one item of the list at a time.... data links control links: “first LoadCSVFileIntoTable, then UpdateComputedColumns” 4
  15. 15. The challenge workflow as a Taverna dataflow this produces a list... ...these consume one item of the list at a time.... the resulting iteration over the lists occurs automatically data links control links: “first LoadCSVFileIntoTable, then UpdateComputedColumns” 4
  16. 16. TP query model and new requirements - queries are purely structural, no semantics - queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow 5
  17. 17. TP query model and new requirements - queries are purely structural, no semantics - queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow trace the lineage of values observed here... 5
  18. 18. TP query model and new requirements - queries are purely structural, no semantics - queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow ...at these points in the workflow trace the lineage of values observed here... 5
  19. 19. TP query model and new requirements - queries are purely structural, no semantics - queries can be answered only at same level of granularity as- challenge query 1 “detailed version” seems to require more that of the service incapsulationknowledge on data dependencies thanof the data in thefrom this those obtained workflowstructural model- level of detection ID not available unless query black box inLoadCSVFileIntoTable is opened up- i.e.,CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,?,?,?,?) ...at these points in the workflow trace the lineage of values observed here... 5
  20. 20. Example: query 3Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1query.processors=ALL 6
  21. 21. Example: query 3Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1query.processors=ALL 6
  22. 22. Integrated OPM generation as part of TP query 7
  23. 23. Integrated OPM generation as part of TP query 7
  24. 24. Integrated OPM generation as part of TP query➡ the answer to any TP query can be viewed as an OPM graph➡ encoded as RDF/XML using the Tupelo provenance API. 7
  25. 25. MPOD generation rules - IMPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement output port R wasGeneratedBy (R) A P P R used (R) P A R P input port R 8
  26. 26. MPOD generation rules - IMPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement output port R wasGeneratedBy (R) A P P R used (R) P A R P input port R wgb(R5) P1 P2A1 wgb(R1) used(R3) A3 P1 R5 R6 P3 wgb(R6)A2 wgb(R2) used(R4) A4 P2 R3 R4 P3 R1 R2 8
  27. 27. MPOD generation rules - II wasDeterminedFrom Derived property: A2 A1 This is usually inferred, i.e. there exist P, R1, R2 such that: wasGeneratedBy (R2) A2 P R1 P used (R1) R2 P A1Note 1: if the corresponding “wgby” and “used” edges are not found,then new P, R1, R2 are created and added to the graph ‣ however, in all cases encountered so far, wasDeterminedFrom was inferred: P, R1, R2 appear in existing wgb(R) and used(R) edgesNote 2: wasControlledBy and wasTriggeredBy ignored for nowNote 3: a separate MPOD is created for each account in the OPM graph 9
  28. 28. Importing OPM for the challenge• OPM contributions successfully imported so far: – UC Davis – NCSA (links to PC3 / UoM MPOD wiki page) example! – SOTON• Example (UC Davis) Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?query.variables: LoadCSVFileIntoTable:2 / outquery.processors=ALL 10
  29. 29. Example -- query 3 result• Ideally, the imported graph + MPOD allow provenance queries to be submitted to the imported graph just as if it were a native TP graph• The answer is viewed as a new OPM graph itself 11
  30. 30. Lossless mappings using OPM and MPODProvenance Query: subgraph with query answer only exec → trace(f) Q f TP OPMQ(trace(f)) f, trace(f)Export to OPM: export is just a query exp(trace(f)) that returns the entire traceImport from OPM: import export OPM TP OPMexp(trace(f)) f, trace(f) MPODTP = Taverna Provenance model 12
  31. 31. Lossless mappings using OPM and MPODProvenance Query: subgraph with query answer only exec → trace(f) Q f TP OPMQ(trace(f)) f, trace(f)Export to OPM: export is just a query exp(trace(f)) that returns the entire traceImport from OPM: import export OPM TP OPMexp(trace(f)) f, trace(f) MPOD when is this transformation loss-less?TP = Taverna Provenance model 12
  32. 32. More on lossless-ness and OPMTavernadataflow exec → trace(f) export f TP OPMexp(trace(f)) f, trace(f) import TP f’, trace(f’) 13
  33. 33. More on lossless-ness and OPMTavernadataflow exec → trace(f) export f TP OPMexp(trace(f)) f, trace(f) import export TP f’, trace(f’) 13
  34. 34. More on lossless-ness and OPMTavernadataflow exec → trace(f) export f TP OPMexp(trace(f)) f, trace(f) =?= import export TP f’, trace(f’) 13
  35. 35. More on lossless-ness and OPM Taverna dataflow exec → trace(f) export f TP OPMexp(trace(f)) f, trace(f) =?= import export TP f’, trace(f’)This is indeed lossless when f is itself a Taverna dataflow:export ( import (export (trace(f)))) =?= export (trace(f))(requires proof) 13

×