Session talk @ AGU09

Scientific Workflow Management System

Janus
Provenance

Towards systema-c informa-on exchange 
and reuse in e‐laboratories
AGU Fall mee-ng, Dec. 2009
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK

with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Momentum on sharing and collaboration

Special issue of Nature on Data Sharing (Sept. 2009)

http://www.nature.com/news/specials/datasharing/index.html

2


• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case

2


• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case

• Debate is much further along in Earth Sciences
– ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [1]
• Science Commons recommendations for Open Science
– (July 2008) [link]

[1] Strebel DE, Landis DR, Huemmrich KF, Newcomer JA, Meeson BW: The FIFE Data
Publication Experiment. Journal of the Atmospheric Sciences 1998, 55:1277-1283 2

Collaboration in workflow-based science

workflow workflow
+ execution
input dataset
specification



workflow workflow
+ execution
input dataset
specification

outcome
outcome (provenance)
(data)



workflow workflow
+ execution
input dataset
specification

outcome
outcome (provenance)
(data)

Research
Object
Packaging



workflow workflow
+ execution
input dataset
specification

outcome
ul outcome
Pa (data)
(provenance)

browse Research
query Object
unbundle Packaging
reuse



workflow workflow
+ execution
input dataset
specification

Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)

browse Research
query Object
unbundle Packaging
reuse



What is needed for Paul to make sense of third party data?

Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)

browse Research
query Object
unbundle Packaging
reuse


Paul’s 
Paul’s Pack
QTL

Research 
Object

Common pathways


Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Results
Common pathways


Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Representation

Results
Common pathways


Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Representation

Results Domain Relations

Common pathways


Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16
produces
Results
Included in Included in Published in

Logs Slides
produces
Feeds into
Included in Included in

Workflow 13 Paper

produces Published in

Representation


Common pathways


Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16
produces
Results
Included in Included in Published in

Logs Slides
produces
Feeds into
Included in Included in

Workflow 13 Paper
Metadata produces Published in

Representation


Aggregation
Common pathways


ORE: representing generic aggregations

Resource Map Data structure
(descriptor)

http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.


Content: Workflow provenance

A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced


Content: Workflow provenance

A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways

gene_id

concat gene pathway ids

output

pathway_genes


Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design

The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#


What users expect to learn

• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?

merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?

pathway_genes
9

Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.1 out soon

wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange

wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2


Additional requirements on OPM
• Artifact values require uniform common identifier
scheme
– Linked Data in OPM?

• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”)


Query results as OPM graphs

prov(W)
execute
W run W
query Q

export Q(prov(W))
OPM(Q(prov(W)))
prov(WA)
Q(prov(W))

- Approach implemented in the Taverna 2.1 workflow system
- Internal provenance DB with ad hoc query language

Just released!


Full-fledged data-mediated collaborations

exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A



exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A

result A → input B



exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A

workflow B+
input B

Research
Object result
exp. B result B
provenance
result A → input B datasets B
B



workflow A +
input A workflow B +
inputB
Research
result Object result
datasets result A+B provenance
A datasets A+B
B



workflow A +
inputB
Research
A datasets A+B
B

Provenance composition
accounts for implicit
collaboration



workflow A +
inputB
Research
A datasets A+B
B

Provenance composition
accounts for implicit
collaboration

Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph. - P.Missier
AGU Fall meeting, San Francisco, Dec. 2009

Contacts

The myGrid Consortium (Manchester, Southampton)

http://mygrid.org.uk

http://www.myexperiment.org

Janus Me: pmissier@acm.org
Provenance


Session talk @ AGU09

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Session talk @ AGU09