Invited talk @ DCC09 workshop

Scientific Workflow Management System

Janus
Provenance

Research objects, myExperiment, and 
Open Provenance for collabora;ve E‐science
REPRISE workshop ‐ IDCC’09
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK

with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
1
IDCC’09, London - P.Missier

Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)

The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2

• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case

169 (2009)

• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
• Ongoing debate in several communities
– Clinical trials [1]
– Earth Sciences -- ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [2]
• Science Commons recommendations for Open Science
– Open Science recommendations from Science Commons (July 2008) [link]

169 (2009)

Reference scenario

workflow workflow
+ execution
input dataset
specification

3

Reference scenario

workflow workflow
+ execution
input dataset
specification

?

3

Reference scenario

workflow workflow
+ execution
input dataset
specification

?
outcome
outcome (provenance)
(data)

3

Reference scenario

workflow workflow
+ execution
input dataset
specification

?
outcome
(data)

Research
Object
Packaging

3

Reference scenario

workflow workflow
+ execution
input dataset
specification

?
outcome
(data)

browse Research
query Object
unbundle Packaging
reuse

3

Reference scenario

workflow workflow
+ execution
input dataset
specification

?
Data-mediated outcome
implicit outcome (provenance)
collaboration (data)

browse Research
query Object
unbundle Packaging
reuse

3

Collaboration through data

What is needed for B to make sense of A’s data?

1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects

2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance

3.Container:
– a repository for Research Objects 4

Paul’s 
Paul’s Pack
QTL

Research 
Object

Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Results
Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Representation

Results
Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16

Results

Logs Slides

Workflow 13 Paper

Representation

Results Domain Relations

Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16
produces
Results
Included in Included in Published in

Logs Slides
produces
Feeds into
Included in Included in

Workflow 13 Paper

produces Published in

Representation


Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16
produces
Results

Logs Slides
produces
Feeds into

Workflow 13 Paper

produces Published in

Representation


Aggregation
Common pathways

Paul’s 
Paul’s Pack
QTL

Research 
Object Workflow 16
produces
Results

Logs Slides
produces
Feeds into

Workflow 13 Paper
Metadata produces Published in

Representation


Aggregation
Common pathways

ORE: representing generic aggregations

Resource Map Data structure
(descriptor)

http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.

6

Content: Workflow provenance

A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced

8

Content: Workflow provenance

A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways

gene_id

concat gene pathway ids

output

pathway_genes
8

Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design

The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#


What users expect to learn

• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?

merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?

pathway_genes
10

Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.0.1 currently open for comments

wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange

wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2

11

The 3rd provenance challenge

• Chosen workflow from the Pan-STARRS project
– Panoramic Survey Telescope & Rapid Response Syste

• http://twiki.ipaw.info/bin/view/Challenge/
ThirdProvenanceChallenge

• Goal:
– demonstrate “provenance interoperability” at query level

12

The 3rd provenance challenge workflow

read input file

load database

verify

13

OPM and query-interoperability
Team A
prov(WA)
encode W execute
run WA
as WA query Q

OPM(prov(WA)) export Q(prov(WA))
prov(WA)

14

Team A
prov(WA)
encode W execute
run WA
as WA query Q

prov(WA)

Team B
Q(PWA)

PWA =
import(OPM(prov(WA)))

execute
import
query Q
14

Team A
prov(WA)
encode W execute
run WA
as WA query Q

prov(WA)

?
Team B
Q(PWA)

PWA =
import(OPM(prov(WA)))

execute
import
query Q
14

OPM in Taverna
skippable

15

OPM in Taverna
skippable

➡ the answer to any TP query can be viewed as an OPM graph
➡ encoded as RDF/XML (using the Tupelo provenance API)

15

Additional requirements

16

• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?

16

scheme
naming conventions

• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes

16

scheme
naming conventions

• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”) 16

Query results as OPM graphs

prov(WA)
encode W execute
run WA
as WA query Q

prov(WA)


prov(WA)
encode W execute
run WA
as WA query Q

Q(prov(WA))


prov(WA)
encode W execute
run WA
as WA query Q

OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))


prov(WA)
encode W execute
run WA
as WA query Q

OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))

- Approach implemented in Taverna 2.1
- Internal provenance DB with ad hoc query language
- To be released soon

Full-fledged data-mediated collaborations

exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A

18


exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A

result A → input B

18


exp. A workflow A +
input A

Research
Object result
result A
provenance
datasets A
A

workflow B+
input B

Research
Object result
exp. B result B
provenance
result A → input B datasets B
B

18


workflow A +
input A workflow B +
inputB
Research
result Object result
datasets result A+B provenance
A datasets A+B
B

18


workflow A +
inputB
Research
A datasets A+B
B

Provenance composition
accounts for implicit
collaboration

18


workflow A +
inputB
Research
A datasets A+B
B

Provenance composition
accounts for implicit
collaboration

Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph.
18

Contacts

The myGrid Consortium (Manchester, Southampton)

http://mygrid.org.uk

http://www.myexperiment.org

Janus Me: pmissier@acm.org
Provenance

19

Invited talk @ DCC09 workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (17)

Similar to Invited talk @ DCC09 workshop

Similar to Invited talk @ DCC09 workshop (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Invited talk @ DCC09 workshop