Partners Funding bioexcel.eu Provenance

Partners Funding
bioexcel.eu
Provenance and Research Object
1
Stian Soiland-Reyes
eScience Lab, The University of Manchester
2017-11-03, Aix-en-Provence
CESAB workshop: Reproducible Workflows
orcid.org/0000-0001-9842-9718 @soilandreyes
This work is licensed under a
Creative Commons Attribution 4.0 International License.

bioexcel.eu
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
http://www.w3.org/TR/prov-overview/
Core PROV model
Entity – A “thing” in the world
Document, Excel file, database row, molecule,
LEGO structure, house, …
Activity – Something that happened
Usually defined start/end time
May use and generate entities
Agent – Someone/something
Participating in activities
Person, SoftwareAgent, Organization
Key principles:
Provenance statements point backwards in time
Any PROV document is one particular view on history
More than one entity can describe same “thing”

bioexcel.eu
Attribution
Who collected this sample? Who helped?
Which lab performed the sequencing?
Who did the data analysis?
Who wrote the analysis workflow?
Who made the data set used by analysis?
Who curated the results?
Alice
The
lab
Data
wasAttributedTo
actedOnBehalfOf
Why do I need this?
i. To be recognized for my work
ii. Who should I give credits to?
iii. Who should I complain to?
iv. Can I trust them?
v. Who should I make friends with?

bioexcel.eu
Derivation
Which sample was this metagenome sequenced from?
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
What is the previous revision of the new results?
wasDerivedFrom
wasQuotedFrom
Sequence
New
results
wasDerivedFrom
Sample
Meta -
genome
Old
results
wasRevisionOf
wasInfluencedBy
Why do I need this?
i. To verify consistency (did I use
the correct sequence?)
ii. To find the latest revision
iii. To backtrack where a diversion
appeared after a change
iv. To credit work I depend on
v. Auditing and defence for
peer review

bioexcel.eu
Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
Metagenome
Sample
wasAssociatedWith
Workflow
server
wasInformedBy
wasStartedBy
Workflow
run
wasGeneratedBy
Results
Sequencing
wasAssociatedWith
Alice
hadPlan
Workflow
definition
hadRole
Lab
technician
Results
Why do I need this?
i. To see which analysis was performed
ii. To find out who did what
iii. What was the metagenome
used for?
iv. To understand the whole process
“make me a Methods section”
v. To track down inconsistencies

bioexcel.eu
Input ports
Processors
Output ports
Workflow
Typical (?) workflow structure
Data links
http://taverna.incubator.apache.org/

bioexcel.eu
Workflow description (wfdesc)
http://purl.org/wf4ever/wfdesc#

bioexcel.eu
Workflow run provenance (wfprov)
http://purl.org/wf4ever/wfprov#

bioexcel.eu
Workflow Run Bundle
output/A.txt
output/C.jpg
output/B/
intermediates/
1.txt
2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
input/X.txt
workflow
URI
references
attribution
execution
environment
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zip
.ro/manifest.json
https://doi.org/10.5281/zenodo.51314
workflowrun.prov.ttl

bioexcel.euhttps://doi.org/10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Research Object Bundle
http://www.researchobject.org/

bioexcel.eu
A Research Object bundles and relates digital
resources of a scientific experiment/investigation +
context
Data used and results produced in
experimental study
Methods employed to produce and
analyse that data
Provenance and settings for the
experiments
People involved in the investigation
Annotations about these resources, to
improve understanding and
interpretation

bioexcel.eu
Standards-based metadata framework for bundling embedded and
referenced resources with context
Citable Reproducible Packaging
researchobject.org

bioexcel.eu
Systems Biology Research Objects
exchange, portability and maintenance
components
packaged into
various containers
ISA-TABchecksum

bioexcel.eu
Download as a
Research Object Bundle
Snapshots evolving CWL files in GitHub
Permalink to snapshot the workflow
 identifier for RO
Common Workflow
LanguageViewer
CWL files packaged in a RO
CWL RO + added richness
Lift out parts into the manifest

bioexcel.eu
Artists Impression

bioexcel.eu
https://osf.io/h59uh/ https://doi.org/10.1101/191783

bioexcel.eu
https://doi.org/10.1101/191783
identifiers.org

bioexcel.eu
identifiers.org
PROV
JSON
https://doi.org/10.1109/BigData.2016.7840618
manifest.json

bioexcel.eu
Provenance from cwltoolFarah Z Khan:
Modify cwltool reference implementation
to capture provenance
Generates Bag-It Research Object
Mints identifiers for data and run
Capture intermediate values
Workflow activities as PROV
 wfdesc, OPMW, ProvONE
http://doi.org/10.7490/f1000research.1114781.1

Partners Funding
bioexcel.eu
Acknowledgements
22
Farah Z Khan
Carole Goble
Michael R. Crusoe
Apache Taverna
BioExcel
Common Workflow Language
Research Object
W3C PROV WG

Partners Funding bioexcel.eu Provenance

Recommended

Recommended

More Related Content

Similar to Partners Funding bioexcel.eu Provenance

Similar to Partners Funding bioexcel.eu Provenance (20)

More from Stian Soiland-Reyes

More from Stian Soiland-Reyes (8)

Recently uploaded

Recently uploaded (20)

Partners Funding bioexcel.eu Provenance

Editor's Notes