citation:
Missier, P., Sahoo, S. S., Zhao, J., Sheth, A., & Goble, C. (2010). Janus: from Workflows to Semantic Provenance and Linked Open Data. Procs. IPAW 2010. Troy, NY.
Paper talk @ Ipaw 2010: Janus: from Workflows to Semantic Provenance and Linked Open Data
1. Janus
Provenance
Janus:
from Workflows to Semantic Provenance
and Linked Open Data
Paolo Missier Jun Zhao Satya S. Sahoo
Carole Goble Amit Sheth
University of Manchester, UK University of Oxford, UK Wright State University, USA
IPAW’ 10
Troy, NY
June 15-16, 2010
2. Key ideas
• Janus:
– a semantic provenance model with domain-specific
extensions
– designed around the Taverna workflow model
• From domain-agnostic provenance graphs
• To domain-aware graphs through explicit
annotations
• From local provenance graphs and queries
scoped to the graph
• To
– Graphs published as Linked Data
– Queries extended into the Web of Data
2
Janus -- IPAW, Troy, NY, June 15-17, 2010
4. Baseline provenance of a workflow run
QTL → mmu:12575
Ensembl Genes
v1 ... vn w
Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
path:mmu04012
exec
Uniprot Gene → Entrez Gene → a1 ... an b1 ... bm
Kegg Gene Kegg Gene
mmu:26416
merge gene IDs
path:mmu04010 y11 ymn
Gene → Pathway
...
path:mmu04010→derives_from→mmu:26416
path:mmu04012→derives_from→mmu:12575
• The graph encodes all direct data dependency relations
• Baseline query model: compute paths amongst sets of nodes
• Transitive closure over data dependency relations
4
Janus -- IPAW, Troy, NY, June 15-17, 2010
5. Reference user questions
QTL →
Ensembl Genes Q0: Find all intermediate and initial input
values that contribute to the computation of a
certain output value.
Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
Q1. Find all those genes within the input QTL
Uniprot Gene → Entrez Gene →
region that are involved in a given KEGG
Kegg Gene Kegg Gene pathway.
merge gene IDs
Q2: Find all Uniprot-sourced genes
Gene → Pathway
Q3: Find all Entrez genes that encode proteins involved in ATP binding (go:
0005524).
Q4: List relevant PubMed publications for the pathways listed in the result set.
5
Janus -- IPAW, Troy, NY, June 15-17, 2010
6. Query types and model capabilities
Query formulation effort Annotation requirements Query Scope
Q0 - Requires knowledge of
process structure and
data values
No annotations required Single run graph or
Multi-run graphs
- Graphical query
constructor may be
available
Q1 Q2
Requires domain annotations
Use of domain terms Single run graph or
on workflow tasks and on data
facilitates query formulation Multi-run graphs
values
Q3 Q4
- Use of domain terms - Requires domain annotations
facilitates query on workflow tasks and on
formulation. data values
The Web of Data
- Can be integrated with - Relies on completeness of
browsers for LoD sources Linked Data Sources
6
Janus -- IPAW, Troy, NY, June 15-17, 2010
7. Query types and model capabilities
Query formulation effort Annotation requirements Query Scope
Q0 - Requires knowledge of
process structure and
data values
No annotations required Single run graph or
Multi-run graphs
- Graphical query
constructor may be
available
Q1 Q2
Requires domain annotations
Use of domain terms Single run graph or
on workflow tasks and on data
facilitates query formulation Multi-run graphs
values
Q3 Q4
- Use of domain terms - Requires domain annotations
facilitates query on workflow tasks and on
formulation. data values
The Web of Data
- Can be integrated with - Relies on completeness of
browsers for LoD sources Linked Data Sources
6
Janus -- IPAW, Troy, NY, June 15-17, 2010
8. Janus: baseline model of provenance
• The semantic provenance model is an OWL ontology
– defined for domain-agnostic provenance graphs
– naturally extensible to domain concepts
• extends the Provenir upper ontology [*]
– Itself an extension of the Basic Formal Ontology (BFO)
• abstract concepts include data, process, and agent
– Provenir adds 11 types of relationships:
• partonomy relations
• temporal information
• precedence
• causal relationships
• ...
[*] S. Sahoo and A. Sheth. Provenir ontology: Towards a Framework for eScience
Provenance Management, Knoesis Center Tech Report, 2009.
7
Janus -- IPAW, Troy, NY, June 15-17, 2010
15. Desiderata: from structure to values annotations
protein
sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
11
Janus -- IPAW, Troy, NY, June 15-17, 2010
16. Desiderata: from structure to values annotations
protein
sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
protein
sequence
port
v1 v2 v3 value
X1
exec
X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
report 11
Janus -- IPAW, Troy, NY, June 15-17, 2010
17. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
12
Janus -- IPAW, Troy, NY, June 15-17, 2010
18. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
?
interpro spec port
scan v1 v2 v3 value
X1 X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
report
12
Janus -- IPAW, Troy, NY, June 15-17, 2010
19. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
protein
processor sequence
interpro spec port
scan v1 v2 v3 value
X1 X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
report
12
Janus -- IPAW, Troy, NY, June 15-17, 2010
20. Annotations as semantic overlay
v1 w1
X3
Y2
Provenance graph has_port_value has_port_value
X2
P
fragment
Y1
X1
vn wm
has-input-type Pathway has-output-type
search
service
instance-of
instance-of instance-of
Gene v1 w1 Pathway
has-source
X3
has-source
has_port_value Y2 has_port_value
X2
P
instance-of
instance-of
Y1
Kegg wm
X1
vn Kegg
has-source has-source
13
Janus -- IPAW, Troy, NY, June 15-17, 2010
21. Example Janus domain-aware fragment
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/test1625">
<janus:has_iteration>[]</janus:has_iteration>
<rdf:type rdf:resource="http://purl.org/net/taverna/janus#port_value"/>
<rdf:type rdf:resource="http://purl.org/obo/owl/sequence#gene"/>
<janus:has_source rdf:resource="http://purl.org/net/taverna/janus#KEGG"/>
</rdf:Description>
this is rule-
defined, too
14
Janus -- IPAW, Troy, NY, June 15-17, 2010
24. I - Mapping data values to LoD URIs
In our prototype we map data values to Bio2RDF as follows:
Entrez Genes
Uniprot Genes
KEGG Genes
KEGG Pathways
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/create_report/entrezGeneId">
<janus:has_value_binding rdf:resource="http://purl.org/net/taverna/janus/test18"/>
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/test18">
<rdf:type rdf:resource="http://purl.org/net/taverna/janus#port_value"/>
<rdfs:comment>11835</rdfs:comment>
<rdf:type rdf:resource="http://purl.org/obo/owl/sequence#gene"/>
<janus:has_source rdf:resource="http://purl.org/net/taverna/janus#entrez_gene"/>
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/test18">
<rdfs:seeAlso rdf:resource="http://bio2rdf.org/geneid:11835"/>
16
Janus -- IPAW, Troy, NY, June 15-17, 2010
25. II - extended queries in the LoD setting
Strategy:
- use the SQUIN LoD query engine to query multiple “Web of Data” sources
- only Bio2RDF in our case
- combine graph patterns on local provenance with conditions on remote LoD
graphs
Q5: Find all Entrez genes that encode proteins involved in ATP binding
(GO:0005524).
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX : <http://www.taverna.org.uk/janus#>
SELECT distinct ?entrezgene
WHERE {
?protein uniprot:classifiedWith <http://bio2rdf.org/go:0005524> . Bio2RDF
?entrezgene <http://bio2rdf.org/bio2rdf_resource:xPath> ?protein .
?gene rdfs:seeAlso ?entrezgene
?gene rdf:type :port_gene
?gene :has_source :entrez_gene . }
local provenance graph
17
Janus -- IPAW, Troy, NY, June 15-17, 2010
26. Current status
Current Taverna provenance architecture:
Lab prototype Production
– “Export as...” Janus RDF – “native” (relational) graphs
– currently only queried using – simple, efficient query language on
SPARQL native provenance
e
– manually published
– manually annotated
Taverna
Taverna <<Events stream>> Provenance
runtime events capture
query <scope ... /> Lineage
<select .../> relational
query
<focus .../>
processor DB
query response
OPM graph
complete Janus RDF
graph exporter
Provenance
API (Java)
18
Janus -- IPAW, Troy, NY, June 15-17, 2010
27. Summary, and moving forward
• Janus: a semantic model for workflow provenance
– OWL ontology, extension of Provenir
– should include attribution + system level provenance
– alignment with OPM?
• Domain-aware graphs through annotations:
– automatically propagated from workflow annotations when
possible
– but in practice no real workflows are annotated
• LoD integration:
– powerful provenance publishing and query broadening
– mapping rules currently limited
– no completeness guarantee -- all joins are outer joins!
19
Janus -- IPAW, Troy, NY, June 15-17, 2010