Invited talk @ DCC09 workshop

837 views
789 views

Published on

Presentation at the REPRISE workshop, Digital Curation Conference 2009, London

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
837
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Invited talk @ DCC09 workshop

  1. 1. Scientific Workflow Management System Janus Provenance Research
objects,
myExperiment,
and
 Open
Provenance
for
collabora;ve
E‐science REPRISE
workshop
‐
IDCC’09 Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester 1 IDCC’09, London - P.Missier
  2. 2. Momentum on sharing and collaboration Special issue of Nature on Data Sharing (Sept. 2009) The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168– 169 (2009) Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009 http://www.nature.com/news/specials/datasharing/index.html 2 IDCC’09, London - P.Missier
  3. 3. Momentum on sharing and collaboration Special issue of Nature on Data Sharing (Sept. 2009) • timeliness requires rapid sharing • repurposing • the Human Genome project use case The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168– 169 (2009) Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009 http://www.nature.com/news/specials/datasharing/index.html 2 IDCC’09, London - P.Missier
  4. 4. Momentum on sharing and collaboration Special issue of Nature on Data Sharing (Sept. 2009) • timeliness requires rapid sharing • repurposing • the Human Genome project use case • Ongoing debate in several communities – Clinical trials [1] – Earth Sciences -- ESIP - data preservation / stewardship, 2009 – Long established in some communities - Atmospheric sciences, 1998 [2] • Science Commons recommendations for Open Science – Open Science recommendations from Science Commons (July 2008) [link] The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168– 169 (2009) Prepublication data sharing: Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9 September 2009 http://www.nature.com/news/specials/datasharing/index.html 2 IDCC’09, London - P.Missier
  5. 5. Reference scenario workflow workflow + execution input dataset specification 3
  6. 6. Reference scenario workflow workflow + execution input dataset specification ? 3
  7. 7. Reference scenario workflow workflow + execution input dataset specification ? outcome outcome (provenance) (data) 3
  8. 8. Reference scenario workflow workflow + execution input dataset specification ? outcome outcome (provenance) (data) Research Object Packaging 3
  9. 9. Reference scenario workflow workflow + execution input dataset specification ? outcome outcome (provenance) (data) Research Object Packaging 3
  10. 10. Reference scenario workflow workflow + execution input dataset specification ? outcome outcome (provenance) (data) browse Research query Object unbundle Packaging reuse 3
  11. 11. Reference scenario workflow workflow + execution input dataset specification ? Data-mediated outcome implicit outcome (provenance) collaboration (data) browse Research query Object unbundle Packaging reuse 3
  12. 12. Collaboration through data What is needed for B to make sense of A’s data? 1.Packaging: – standards for self-descriptive data + metadata bundles: Research Objects 2.Content: – data format standardization efforts – metadata representation • process provenance –workflow provenance 3.Container: – a repository for Research Objects 4 IDCC’09, London - P.Missier
  13. 13. Collaboration through data What is needed for B to make sense of A’s data? 1.Packaging: – standards for self-descriptive data + metadata bundles: Research Objects 2.Content: – data format standardization efforts – metadata representation • process provenance –workflow provenance 3.Container: – a repository for Research Objects 4 IDCC’09, London - P.Missier
  14. 14. Collaboration through data What is needed for B to make sense of A’s data? 1.Packaging: – standards for self-descriptive data + metadata bundles: Research Objects 2.Content: – data format standardization efforts – metadata representation • process provenance –workflow provenance 3.Container: – a repository for Research Objects 4 IDCC’09, London - P.Missier
  15. 15. Collaboration through data What is needed for B to make sense of A’s data? 1.Packaging: – standards for self-descriptive data + metadata bundles: Research Objects 2.Content: – data format standardization efforts – metadata representation • process provenance –workflow provenance 3.Container: – a repository for Research Objects 4 IDCC’09, London - P.Missier
  16. 16. Paul’s
 Paul’s
Pack QTL Research
 Object Common pathways
  17. 17. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 Results Logs Slides Workflow 13 Paper Results Common pathways
  18. 18. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 Results Logs Slides Workflow 13 Paper Representation Results Common pathways
  19. 19. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 Results Logs Slides Workflow 13 Paper Representation Results Domain Relations Common pathways
  20. 20. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 produces Results Included in Included in Published in Logs Slides produces Feeds into Included in Included in Workflow 13 Paper produces Published in Representation Results Domain Relations Common pathways
  21. 21. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 produces Results Included in Included in Published in Logs Slides produces Feeds into Included in Included in Workflow 13 Paper produces Published in Representation Results Domain Relations Aggregation Common pathways
  22. 22. Paul’s
 Paul’s
Pack QTL Research
 Object Workflow 16 produces Results Included in Included in Published in Logs Slides produces Feeds into Included in Included in Workflow 13 Paper Metadata produces Published in Representation Results Domain Relations Aggregation Common pathways
  23. 23. ORE: representing generic aggregations Resource Map Data structure (descriptor) http://www.openarchives.org/ore/1.0/primer.html section 4 A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009. 6
  24. 24. Content: Workflow provenance A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced 8
  25. 25. Content: Workflow provenance A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced 8
  26. 26. Content: Workflow provenance A detailed trace of workflow execution lister - tasks performed, data transformations get pathways by genes1 - inputs used, outputs produced merge pathways gene_id concat gene pathway ids output pathway_genes 8
  27. 27. Why provenance matters, if done right • To establish quality, relevance, trust • To track information attribution through complex transformations • To describe one’s experiment to others, for understanding / reuse • To provide evidence in support of scientific claims • To enable post hoc process analysis for improvement, re-design The W3C Incubator on Provenance has been collecting numerous use cases: http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases# IDCC’09, London - P.Missier
  28. 28. What users expect to learn • Causal relations: - which pathways come from which genes? - which processes contributed to producing an lister image? - which process(es) caused data to be incorrect? get pathways by genes1 - which data caused a process to fail? merge pathways • Process and data analytics: – analyze variations in output vs an input gene_id parameter sweep (multiple process runs) – how often has my favourite service been concat gene pathway ids executed? on what inputs? – who produced this data? output – how often does this pathway turn up when the input genes range over a certain set S? pathway_genes 10 IDCC’09, London - P.Missier
  29. 29. Open Provenance Model • graph of causal dependencies involving data and processors • not necessarily generated by a workflow! • v1.0.1 currently open for comments wasGeneratedBy (R) A P Goal: used (R) P A standardize causal dependencies to enable provenance metadata exchange wgb(R5) A1 wgb(R1) used(R3) A3 P1 P3 wgb(R6) A2 wgb(R2) used(R4) A4 P2 11 IDCC’09, London - P.Missier
  30. 30. The 3rd provenance challenge • Chosen workflow from the Pan-STARRS project – Panoramic Survey Telescope & Rapid Response Syste • http://twiki.ipaw.info/bin/view/Challenge/ ThirdProvenanceChallenge • Goal: – demonstrate “provenance interoperability” at query level 12 IDCC’09, London - P.Missier
  31. 31. The 3rd provenance challenge workflow read input file load database verify 13
  32. 32. The 3rd provenance challenge workflow read input file load database verify 13
  33. 33. OPM and query-interoperability Team A prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) prov(WA) 14
  34. 34. OPM and query-interoperability Team A prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) prov(WA) Team B Q(PWA) PWA = import(OPM(prov(WA))) execute import query Q 14
  35. 35. OPM and query-interoperability Team A prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) prov(WA) ? Team B Q(PWA) PWA = import(OPM(prov(WA))) execute import query Q 14
  36. 36. OPM in Taverna skippable 15
  37. 37. OPM in Taverna skippable 15
  38. 38. OPM in Taverna skippable ➡ the answer to any TP query can be viewed as an OPM graph ➡ encoded as RDF/XML (using the Tupelo provenance API) 15
  39. 39. Additional requirements 16
  40. 40. Additional requirements • Artifact values require uniform common identifier scheme – each group used artifacts to refer to its own data results – but those results were expressed using proprietary naming conventions – Linked Data in OPM? 16
  41. 41. Additional requirements • Artifact values require uniform common identifier scheme – each group used artifacts to refer to its own data results – but those results were expressed using proprietary naming conventions – Linked Data in OPM? • OPM accounts for structural causal relationships – additional domain-specific knowledge required – attaching semantic annotations to OPM graph nodes 16
  42. 42. Additional requirements • Artifact values require uniform common identifier scheme – each group used artifacts to refer to its own data results – but those results were expressed using proprietary naming conventions – Linked Data in OPM? • OPM accounts for structural causal relationships – additional domain-specific knowledge required – attaching semantic annotations to OPM graph nodes • OPM graphs can grow very large – reduce size by exporting only query results • Taverna approach – multiple levels of abstraction • through OPM accounts (“points of view”) 16
  43. 43. Query results as OPM graphs prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) prov(WA)
  44. 44. Query results as OPM graphs prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) prov(WA)
  45. 45. Query results as OPM graphs prov(WA) encode W execute run WA as WA query Q OPM(prov(WA)) export Q(prov(WA)) Q(prov(WA))
  46. 46. Query results as OPM graphs prov(WA) encode W execute run WA as WA query Q OPM(Q(prov(WA))) export Q(prov(WA)) Q(prov(WA))
  47. 47. Query results as OPM graphs prov(WA) encode W execute run WA as WA query Q OPM(Q(prov(WA))) export Q(prov(WA)) Q(prov(WA)) - Approach implemented in Taverna 2.1 - Internal provenance DB with ad hoc query language - To be released soon
  48. 48. Full-fledged data-mediated collaborations exp. A workflow A + input A Research Object result result A provenance datasets A A 18
  49. 49. Full-fledged data-mediated collaborations exp. A workflow A + input A Research Object result result A provenance datasets A A 18
  50. 50. Full-fledged data-mediated collaborations exp. A workflow A + input A Research Object result result A provenance datasets A A result A → input B 18
  51. 51. Full-fledged data-mediated collaborations exp. A workflow A + input A Research Object result result A provenance datasets A A workflow B+ input B Research Object result exp. B result B provenance result A → input B datasets B B 18
  52. 52. Full-fledged data-mediated collaborations workflow A + input A workflow B + inputB result A → input B Research result Object result datasets result A+B provenance A datasets A+B B 18
  53. 53. Full-fledged data-mediated collaborations workflow A + input A workflow B + inputB result A → input B Research result Object result datasets result A+B provenance A datasets A+B B Provenance composition accounts for implicit collaboration 18
  54. 54. Full-fledged data-mediated collaborations workflow A + input A workflow B + inputB result A → input B Research result Object result datasets result A+B provenance A datasets A+B B Provenance composition accounts for implicit collaboration Aligned with focus of upcoming Provenance Challenge 4: “connect my provenance to yours" into a whole OPM provenance graph. 18
  55. 55. Contacts The myGrid Consortium (Manchester, Southampton) http://mygrid.org.uk http://www.myexperiment.org Janus Me: pmissier@acm.org Provenance 19

×