Use Case for D-PROV:Querying Provenance Traces Produced byWorkflows Enacted by Different Systems Khalid Belhajjame, Fernando Seabra Chirigati, Victor Cuevas
Context and Objective• D-PROV is a model that capture both workflow definitions, their provenance as well as the provenance of the results obtained by their execution. It expressive enough to capture the definition of workflows and provenance traces that are specified in multiple workflow systems, in particular Kepler, Taverna and VisTrails• D-PROV provides users with an integrated access to workflow definitions and associated provenance traces• It uses (extends) the W3C PROV model to capture the provenance traces produced by the execution of such workflows• The objective of this use case is to show that D-PROV users are able to query (and combine) provenance traces that are produced by (equivalent) workflows that are specified and enacted using different systems, namely Taverna and VisTrails• Note that while in the use case we focus on two equivalent workflows, generally speaking, D-PROV is expected to allow users to query and combine provenance traces of workflows that are not necessarily equivalent.
Approach• The approach adopted in the use case is a four-step process that is illustrated in the figure below Enact the workflows within their native Enact the workflows done within their native system system Export the provenance traces in the native done format of the workflow systems Map the workflows and associated ongoing provenance traces to D-PROV Query the provenance traces produced by the workflow system using D-PROV
WorkflowsWe used two (equivalent) workflows specified within Taverna andVisTrails. Both workflows implement a simple in-silico experiment forpathway analysis. Given gene IDs, the workflows fetch thecorresponding pathways. To do so, they make use of two KEGG webservices Taverna Workflow Vistrails Workflow
Provenance Traces• The two workflows were enacted within their respective system using different (yet overlapping) set of gene Ids as inputs• The provenance traces were then captured and exported in different formats • From the Taverna workflow, we used PROVO and JANUS formats • From the VisTrails workflow, we used their own provenance format (based on XML) and OPM• The workflows and their provenance are accessible through myExperiment • Workflows and their provenance traces are now being mapped to D- PROV http://www.myexperiment.org/packs/317.html
QueriesOnce the mapping is done, we would like to issue some queries, as theones specified below, against D-PROV:• Q1: Give the pathways that were produced by the pathway analysis workflow (as is defined within D-PROV), specifying the gene IDs that were used as inputs to that workflow • The result of this query should be the union of pathways returned by Taverna and VisTrails workflows, together with the gene IDS used as input to both workflows.• Q2: Give the pathways that were produced by the Taverna workflow, and that are associated with gene IDs that were not used as input to the VisTrails workflow • This is a diff query