SlideShare a Scribd company logo
1 of 16
Download to read offline
GoldenTrail: Retrieving the Data History that Matters
                                  from a Comprehensive Provenance Repository

                                   Paolo Missier, Newcastle University, UK
                Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, UC Davis, USA
                         Shawn Bowers and Michael Agun, Gonzaga University, USA
                                     Ilkay Altintas, UC San Diego, USA


                                                     IDCC
                                            Bristol, 6-7 Dec. 2011




Wednesday, December 7, 2011
IDCC ’11 - P.Missier et al.   Prologue: DCC “REPRISE” workshop, 2009




    2

Wednesday, December 7, 2011
IDCC ’11 - P.Missier et al.   “Virtual experimental science” (DCC’09)




    3

Wednesday, December 7, 2011
Provenance in experimental science

                                 A provenance trace is an account of the history of a data item through
                                 multiple processing steps

                                  • Instrumental to verification and reuse of results -- Trustworthiness
                                  • Enabler for “reproducible science” [1]

                                     provenance trace (graph)                           how did d4 come to be?
                                                                                        what other datasets contributed to it?
                                                                                        which processes were involved?
                                   d1                               d4

                                           i1       d3       i2

                                    d2                               d5
   IDCC ’11 - P.Missier et al.




                                    i1 used d1 and d2        d4, d5 were generated by i2


                                   [1] Mesirov , Jill, P. (2010). Accessible Reproducible Research. Science, 327.
        4
                                   Retrieved from www.sciencemag.org
Wednesday, December 7, 2011
Prior work on provenance composition
                                   2010: the DataTree Of Life summer project [2]
                                 • Provenance stitching:
                                 • Multiple, independently produced provenance traces expressed using
                                   the Open Provenance Model (OPM) can be “joined up” on shared
                                   datasets
                                 • provided the data resides in a provenance-aware data repository.


                                                                       i3   d7
                                                                                       Limitations:
                                    d1                    d4     d4

                                         i1    d3   i2

                                    d2                    d5     d5         d8
                                                                                        – automated “stitching” requires data
                                                                       i4
                                                                                          ID mapping and provenance-aware
                                                                                          data copy operations
                                                                 d6         d9


                                                                                        – in general, it requires human
                                                                                          intervention
   IDCC ’11 - P.Missier et al.




                                 [2] Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., Sarkar, A., et al.
                                 (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science.
        5                        Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).

Wednesday, December 7, 2011
A broader vision
                                 • Experimental science is explorative and evolutionary
                                   – many experiments, few will succeed
                                   – from parameter sweeps to changes in methods

                                 • E-science infrastructure should be able to capture the exploration
                                   process in addition to the “good” results
                                   – Implicit collaboration becomes “just” a special scenario
   IDCC ’11 - P.Missier et al.




        6



Wednesday, December 7, 2011
A broader vision
                                 • Experimental science is explorative and evolutionary
                                   – many experiments, few will succeed
                                   – from parameter sweeps to changes in methods

                                 • E-science infrastructure should be able to capture the exploration
                                   process in addition to the “good” results
                                   – Implicit collaboration becomes “just” a special scenario




                                 • Golden Data: the dataset(s) that scientists decide to share/publish
                                 • Golden Trail: an account of how the Golden Data was obtained
                                   • a view over the provenance of the entire experiments history
                                   • describes a virtual experiment
   IDCC ’11 - P.Missier et al.




        6



Wednesday, December 7, 2011
Approach: a generalized provenance base
                                   PBase Requirements

                                 • Account for multiplicity of
                                   – workflow specifications and runs
                                   – workflow models
                                   – users

                                 • Capture details of every execution into a persistent provenance
                                   repository
                                 • Let scientists upload new provenance traces
                                 • Support the provenance stitching process interactively
                                 • Support queries on the provenance base to compute Golden Trails
   IDCC ’11 - P.Missier et al.




        7



Wednesday, December 7, 2011
Goal and associated technical challenges

                                       Long-term goal of the DataONE Provenance Working Group:
                                          To offer an extensible framework for building PBases



                                 • The Open Provenance Model is adequate for describing traces of
                                   workflow execution: “trace-land”
                                   – to be superseded by PROV-DM, currently W3C Public Working Draft (*)

                                 • But we also need to record workflow specifications: “workflow-land”
                                   – by supporting multiple heterogeneous workflow models
                                   – e.g. ASKALON, Galaxy, Kepler, Taverna, Pegasus, Vistrails, etc.
                                   – currently only Kepler (UCSD, UC Davis), Taverna (myGrid, UK) supported

                                 • Integration with the DataONE data preservation architecture
   IDCC ’11 - P.Missier et al.




                                   – Provenance base as a new type of Member Node



        8                        (*) FPWD as of October, 2001: http://www.w3.org/TR/2011/WD-prov-dm-20111018/

Wednesday, December 7, 2011
D-OPM - a minimal model
                                 • Trace-land inspired by the OPM
                                 • Workflow-land inspired by Janus [1]




                                 • Actor, a single computational step within a workflow
                                 • Run: a single execution of an entire workflow
                                 • Actor invocations: executions of individual steps that either Use or
                                   Generate Data Items
   IDCC ’11 - P.Missier et al.




                                 • Attribution: reference to users who run the workflow and thus “own” the
                                   traces.
                                 [1] Missier, P., Sahoo, S. S., Zhao, J., Sheth, A., & Goble, C. (2010). Janus: from Workflows
        9                        to Semantic Provenance and Linked Open Data. Procs. IPAW 2010. Troy, NY.

Wednesday, December 7, 2011
GoldenTrail PBase architecture
                                                    7@(-$
                                                  /)*(-012($

                                                                                     78&91:$!7/$                                       ;<(-=$!7/$
                                                            !-18I$
                                       #-12($51-@(-$
                                                         O'@<1&'K1P9)$
                                                                                                !"#$%&'()*+,(-.(-$/)*(-012($3!"#$45%6$
                                                 C1*1$,*9-($
                                                                                        78&91:$                                             ;<(-=$

                                                                          >?@*-12*$#-12($51-@(-$                                   CJ#$        !-18I.'
                                 • UI: upload a new trace                                                                         51-@(-$         K$
                                                                                                                                                         N/#$

                                 • Trace Parser                          #1.(-)1$     A(8&(-$      %9B1:$
                                                                          51-@(-$     51-@(-$      51-@(-$
                                    – maps native formats to                                                                >?@*-12*$CD$;<(-=$/)*(-012($
                                      D-OPM                              >?@*-12*$CD$78&91:$/)*(-012($
                                 • Graph Visualization
                                    – displays provenance                                                    >?@*-12*$CD$/)*(-012($
                                      graphs
                                                                                    E(9FG$4H,#$>5/$                             L=,;M$CD$/)*(-012($
                                 • Data Store: provenance
                                   store
   IDCC ’11 - P.Missier et al.




                                                                                    E(9FG$!-18I$CD$                                   L=,;M$CD$




   10



Wednesday, December 7, 2011
Provenance queries
                                 • Exploit the synergy between workflow-land and trace-land
                                               Data-level and                            Workflow-              User-related
                                             actor-level queries                       level queries              queries

                                      Ancestor / Descendant queries
                                      (backwards / forward traversal)            Find all data that flowed    Find all data items
                                                                                  through a workflow W       used / generated on
                                                                                     during one run R          behalf of a user
                                  Find all Actors that     Find all data D’
                                    contributed to /     that contributed to /
                                     impacted the           impacted the
                                    generation of D        generation of D



                                                           d1                             d4     i3     d7

                                                                   i1     d3      i2

                                                           d2                             d5            d8
                                                used
   IDCC ’11 - P.Missier et al.




                                                                                                 i4
                                                                    genBy                  d6           d9



   11



Wednesday, December 7, 2011
UI - upload




                                  Provide user name
                                   and the workflow
                                        name

                                    Workflow system
                                 (e.g. Kepler, COMAD,
                                      Taverna, etc)

                                 Browse the trace file
                                    to be loaded
   IDCC ’11 - P.Missier et al.




  12

Wednesday, December 7, 2011
UI - query



                                  Select provenance detail
                                 level and dependency type

                                      Filter results using
                                           conditions


                                  Add additional conditions
   IDCC ’11 - P.Missier et al.




                                        Query conditions

  13

Wednesday, December 7, 2011
UI - results rendering




                                 In tabular format


                                     In graphical format
   IDCC ’11 - P.Missier et al.




  14

Wednesday, December 7, 2011
Summary, Ongoing work
                                 • GoldenTrail: a “Provenance Base” for workflow-related datasets
                                    – across users
                                    – across workflow models
                                    – across sessions

                                    – dedicated provenance model and query layer


                                 • State:
                                    – early prototype completed (summer 2011) [1]


                                 • Ongoing work within the DataONE project, Provenance Working
                                   Group
                                    – PBase to be integrated into DataONE as Member Node
                                    – Ongoing engagement with the scientific workflow community
                                       • get buy-in on the PBase idea
   IDCC ’11 - P.Missier et al.




                                       • collect feedback on current prototype
                                       • collect additional use cases

   15
                                 [1] Dec. 6 2011: prototype available at: http://lore.genomecenter.ucdavis.edu:8080/GoldenApp/
Wednesday, December 7, 2011

More Related Content

Viewers also liked

provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialPaolo Missier
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Paolo Missier
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Paolo Missier
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 

Viewers also liked (6)

provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorial
 
Tapp11 presentation
Tapp11 presentationTapp11 presentation
Tapp11 presentation
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07
 
Охота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVEОхота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVE
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 

Similar to Paper talk: Idcc 11

OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objectsseanb
 
DataOne - Suzie Allard - RDAP12
DataOne - Suzie Allard - RDAP12DataOne - Suzie Allard - RDAP12
DataOne - Suzie Allard - RDAP12ASIS&T
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesKerstin Forsberg
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
VO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectVO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectJoint ALMA Observatory
 
Research Data Sharing LERU
Research Data Sharing LERU Research Data Sharing LERU
Research Data Sharing LERU LIBER Europe
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science Robert H. McDonald
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataGudmundur Thorisson
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsJian Qin
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
 
Keeping things in context a comparative evaluation of focus plus context scre...
Keeping things in context a comparative evaluation of focus plus context scre...Keeping things in context a comparative evaluation of focus plus context scre...
Keeping things in context a comparative evaluation of focus plus context scre...Debaleena Chattopadhyay
 
Curation of scientifica data: Challenges for repositories
Curation of scientifica data: Challenges for repositoriesCuration of scientifica data: Challenges for repositories
Curation of scientifica data: Challenges for repositoriesChris Rusbridge
 

Similar to Paper talk: Idcc 11 (20)

OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
DataOne - Suzie Allard - RDAP12
DataOne - Suzie Allard - RDAP12DataOne - Suzie Allard - RDAP12
DataOne - Suzie Allard - RDAP12
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiences
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
VO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectVO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever project
 
Research Data Sharing LERU
Research Data Sharing LERU Research Data Sharing LERU
Research Data Sharing LERU
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Michener Plenary PPSR2012
Michener Plenary PPSR2012Michener Plenary PPSR2012
Michener Plenary PPSR2012
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research Data
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management
 
報告
報告報告
報告
 
Digital Science
Digital ScienceDigital Science
Digital Science
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future Jobs
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
Keeping things in context a comparative evaluation of focus plus context scre...
Keeping things in context a comparative evaluation of focus plus context scre...Keeping things in context a comparative evaluation of focus plus context scre...
Keeping things in context a comparative evaluation of focus plus context scre...
 
Curation of scientifica data: Challenges for repositories
Curation of scientifica data: Challenges for repositoriesCuration of scientifica data: Challenges for repositories
Curation of scientifica data: Challenges for repositories
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 

Paper talk: Idcc 11

  • 1. GoldenTrail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository Paolo Missier, Newcastle University, UK Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, UC Davis, USA Shawn Bowers and Michael Agun, Gonzaga University, USA Ilkay Altintas, UC San Diego, USA IDCC Bristol, 6-7 Dec. 2011 Wednesday, December 7, 2011
  • 2. IDCC ’11 - P.Missier et al. Prologue: DCC “REPRISE” workshop, 2009 2 Wednesday, December 7, 2011
  • 3. IDCC ’11 - P.Missier et al. “Virtual experimental science” (DCC’09) 3 Wednesday, December 7, 2011
  • 4. Provenance in experimental science A provenance trace is an account of the history of a data item through multiple processing steps • Instrumental to verification and reuse of results -- Trustworthiness • Enabler for “reproducible science” [1] provenance trace (graph) how did d4 come to be? what other datasets contributed to it? which processes were involved? d1 d4 i1 d3 i2 d2 d5 IDCC ’11 - P.Missier et al. i1 used d1 and d2 d4, d5 were generated by i2 [1] Mesirov , Jill, P. (2010). Accessible Reproducible Research. Science, 327. 4 Retrieved from www.sciencemag.org Wednesday, December 7, 2011
  • 5. Prior work on provenance composition 2010: the DataTree Of Life summer project [2] • Provenance stitching: • Multiple, independently produced provenance traces expressed using the Open Provenance Model (OPM) can be “joined up” on shared datasets • provided the data resides in a provenance-aware data repository. i3 d7 Limitations: d1 d4 d4 i1 d3 i2 d2 d5 d5 d8 – automated “stitching” requires data i4 ID mapping and provenance-aware data copy operations d6 d9 – in general, it requires human intervention IDCC ’11 - P.Missier et al. [2] Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., Sarkar, A., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. 5 Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS). Wednesday, December 7, 2011
  • 6. A broader vision • Experimental science is explorative and evolutionary – many experiments, few will succeed – from parameter sweeps to changes in methods • E-science infrastructure should be able to capture the exploration process in addition to the “good” results – Implicit collaboration becomes “just” a special scenario IDCC ’11 - P.Missier et al. 6 Wednesday, December 7, 2011
  • 7. A broader vision • Experimental science is explorative and evolutionary – many experiments, few will succeed – from parameter sweeps to changes in methods • E-science infrastructure should be able to capture the exploration process in addition to the “good” results – Implicit collaboration becomes “just” a special scenario • Golden Data: the dataset(s) that scientists decide to share/publish • Golden Trail: an account of how the Golden Data was obtained • a view over the provenance of the entire experiments history • describes a virtual experiment IDCC ’11 - P.Missier et al. 6 Wednesday, December 7, 2011
  • 8. Approach: a generalized provenance base PBase Requirements • Account for multiplicity of – workflow specifications and runs – workflow models – users • Capture details of every execution into a persistent provenance repository • Let scientists upload new provenance traces • Support the provenance stitching process interactively • Support queries on the provenance base to compute Golden Trails IDCC ’11 - P.Missier et al. 7 Wednesday, December 7, 2011
  • 9. Goal and associated technical challenges Long-term goal of the DataONE Provenance Working Group: To offer an extensible framework for building PBases • The Open Provenance Model is adequate for describing traces of workflow execution: “trace-land” – to be superseded by PROV-DM, currently W3C Public Working Draft (*) • But we also need to record workflow specifications: “workflow-land” – by supporting multiple heterogeneous workflow models – e.g. ASKALON, Galaxy, Kepler, Taverna, Pegasus, Vistrails, etc. – currently only Kepler (UCSD, UC Davis), Taverna (myGrid, UK) supported • Integration with the DataONE data preservation architecture IDCC ’11 - P.Missier et al. – Provenance base as a new type of Member Node 8 (*) FPWD as of October, 2001: http://www.w3.org/TR/2011/WD-prov-dm-20111018/ Wednesday, December 7, 2011
  • 10. D-OPM - a minimal model • Trace-land inspired by the OPM • Workflow-land inspired by Janus [1] • Actor, a single computational step within a workflow • Run: a single execution of an entire workflow • Actor invocations: executions of individual steps that either Use or Generate Data Items IDCC ’11 - P.Missier et al. • Attribution: reference to users who run the workflow and thus “own” the traces. [1] Missier, P., Sahoo, S. S., Zhao, J., Sheth, A., & Goble, C. (2010). Janus: from Workflows 9 to Semantic Provenance and Linked Open Data. Procs. IPAW 2010. Troy, NY. Wednesday, December 7, 2011
  • 11. GoldenTrail PBase architecture 7@(-$ /)*(-012($ 78&91:$!7/$ ;<(-=$!7/$ !-18I$ #-12($51-@(-$ O'@<1&'K1P9)$ !"#$%&'()*+,(-.(-$/)*(-012($3!"#$45%6$ C1*1$,*9-($ 78&91:$ ;<(-=$ >?@*-12*$#-12($51-@(-$ CJ#$ !-18I.' • UI: upload a new trace 51-@(-$ K$ N/#$ • Trace Parser #1.(-)1$ A(8&(-$ %9B1:$ 51-@(-$ 51-@(-$ 51-@(-$ – maps native formats to >?@*-12*$CD$;<(-=$/)*(-012($ D-OPM >?@*-12*$CD$78&91:$/)*(-012($ • Graph Visualization – displays provenance >?@*-12*$CD$/)*(-012($ graphs E(9FG$4H,#$>5/$ L=,;M$CD$/)*(-012($ • Data Store: provenance store IDCC ’11 - P.Missier et al. E(9FG$!-18I$CD$ L=,;M$CD$ 10 Wednesday, December 7, 2011
  • 12. Provenance queries • Exploit the synergy between workflow-land and trace-land Data-level and Workflow- User-related actor-level queries level queries queries Ancestor / Descendant queries (backwards / forward traversal) Find all data that flowed Find all data items through a workflow W used / generated on during one run R behalf of a user Find all Actors that Find all data D’ contributed to / that contributed to / impacted the impacted the generation of D generation of D d1 d4 i3 d7 i1 d3 i2 d2 d5 d8 used IDCC ’11 - P.Missier et al. i4 genBy d6 d9 11 Wednesday, December 7, 2011
  • 13. UI - upload Provide user name and the workflow name Workflow system (e.g. Kepler, COMAD, Taverna, etc) Browse the trace file to be loaded IDCC ’11 - P.Missier et al. 12 Wednesday, December 7, 2011
  • 14. UI - query Select provenance detail level and dependency type Filter results using conditions Add additional conditions IDCC ’11 - P.Missier et al. Query conditions 13 Wednesday, December 7, 2011
  • 15. UI - results rendering In tabular format In graphical format IDCC ’11 - P.Missier et al. 14 Wednesday, December 7, 2011
  • 16. Summary, Ongoing work • GoldenTrail: a “Provenance Base” for workflow-related datasets – across users – across workflow models – across sessions – dedicated provenance model and query layer • State: – early prototype completed (summer 2011) [1] • Ongoing work within the DataONE project, Provenance Working Group – PBase to be integrated into DataONE as Member Node – Ongoing engagement with the scientific workflow community • get buy-in on the PBase idea IDCC ’11 - P.Missier et al. • collect feedback on current prototype • collect additional use cases 15 [1] Dec. 6 2011: prototype available at: http://lore.genomecenter.ucdavis.edu:8080/GoldenApp/ Wednesday, December 7, 2011