This document discusses provenance and research objects. It introduces key concepts from the PROV model including entities, activities, and agents. It explains how research objects can bundle digital resources from a scientific experiment along with provenance and context. Finally, it provides an example of capturing provenance from workflow runs using the Common Workflow Language and storing it in a research object bundle.
1. Partners Funding
bioexcel.eu
Provenance and Research Object
1
Stian Soiland-Reyes
eScience Lab, The University of Manchester
2017-11-03, Aix-en-Provence
CESAB workshop: Reproducible Workflows
orcid.org/0000-0001-9842-9718 @soilandreyes
This work is licensed under a
Creative Commons Attribution 4.0 International License.
3. bioexcel.eu
Attribution
Who collected this sample? Who helped?
Which lab performed the sequencing?
Who did the data analysis?
Who wrote the analysis workflow?
Who made the data set used by analysis?
Who curated the results?
Alice
The
lab
Data
wasAttributedTo
actedOnBehalfOf
Why do I need this?
i. To be recognized for my work
ii. Who should I give credits to?
iii. Who should I complain to?
iv. Can I trust them?
v. Who should I make friends with?
4. bioexcel.eu
Derivation
Which sample was this metagenome sequenced from?
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
What is the previous revision of the new results?
wasDerivedFrom
wasQuotedFrom
Sequence
New
results
wasDerivedFrom
Sample
Meta -
genome
Old
results
wasRevisionOf
wasInfluencedBy
Why do I need this?
i. To verify consistency (did I use
the correct sequence?)
ii. To find the latest revision
iii. To backtrack where a diversion
appeared after a change
iv. To credit work I depend on
v. Auditing and defence for
peer review
5. bioexcel.eu
Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
Metagenome
Sample
wasAssociatedWith
Workflow
server
wasInformedBy
wasStartedBy
Workflow
run
wasGeneratedBy
Results
Sequencing
wasAssociatedWith
Alice
hadPlan
Workflow
definition
hadRole
Lab
technician
Results
Why do I need this?
i. To see which analysis was performed
ii. To find out who did what
iii. What was the metagenome
used for?
iv. To understand the whole process
“make me a Methods section”
v. To track down inconsistencies
11. bioexcel.eu
A Research Object bundles and relates digital
resources of a scientific experiment/investigation +
context
Data used and results produced in
experimental study
Methods employed to produce and
analyse that data
Provenance and settings for the
experiments
People involved in the investigation
Annotations about these resources, to
improve understanding and
interpretation
14. bioexcel.eu
Download as a
Research Object Bundle
Snapshots evolving CWL files in GitHub
Permalink to snapshot the workflow
identifier for RO
Common Workflow
LanguageViewer
CWL files packaged in a RO
CWL RO + added richness
Lift out parts into the manifest
19. bioexcel.eu
Provenance from cwltoolFarah Z Khan:
Modify cwltool reference implementation
to capture provenance
Generates Bag-It Research Object
Mints identifiers for data and run
Capture intermediate values
Workflow activities as PROV
wfdesc, OPMW, ProvONE
http://doi.org/10.7490/f1000research.1114781.1
S. Woodman, H. Hiden, P. Watson, P. Missier Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning. In: The 6th Workshop on Workflows in Support of Large-Scale Science. 2011, Seattle
Sequencing machines: illumina
Workflow descriptions is a model for describing the abstract workflow. This can be automatically extracted from existing workflow definition (ie. Taverna workflows as found on myExperiment) or be made manually by users.
Even if the workflow system used is no longer applicable, this model gives a description of the workflow at a level that can be reimplemented in other languages– such as done in SHIWA with (??)
The wfdesc model gives hooks to annotate the different steps with description, purpose, tasks and example values, and is also the recipe behind the abstract workflow provenance. (next slide0
Wfprov is not intended to be a provenance model. Rather it provides a place where other models can be hooked in. A “convergence layer”. It is easily mappable to OPM and PROV.
It relates to the wfdesc model, where you can see an actual workflow run, and relate artifacts found aggregated in the RO with their provenance within the workflow run.
This is how we represent a workflow run as a Workflow Results RO Bundle.
We aggregate the workflow outputs, , workflow definition, the inputs used for execution, a description of the execution environment, external URI references (such as the project homepage) and attribution to scientists who contributed to the bundle.
This effectively forms a Research Object, all tied together by the RO Bundle Manifest, which is in JSON-LD format. (normal JSON that is also valid RDF).
12
Mimetype: robundle+zip
ZIP or BagIt folder structure
JSON and YAML
Linked-ISA
it would be the same wherever the git commit lives. So the links can also be generated locally with a git checkout - e.g. as we're doing in the cwltool reference runner provenance when we need to refer to what workflow was run
solved the problem we had in Taverna where we didn't know where the workflow lived
we still might not know that.. but if it's a public workflow and it later is visualized, then CWL Viewer can show it
future-proof!
BioCompute Objects - BCO – is a community-driven project backed by FDA and George Washington University to standardize exchange of High-Troughput-Sequencing workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers. There is a particular challenge for regulatory bodies like FDA in areas like personalized medicine, as to review and approve the bioinformatic side they need to inspect and in some cases replicate the computational analytical workflow. The challenge here is not just the normal reproducibility thing about packaging software and providing required datasets, but also for human understanding of what has been done, by expressing the higher level steps of the workflow, their parameter spaces and algorithm settings.
At the heart of the BCO is a domain-specific object model which capture this essential information without going in details of the actual execution.
The BCO is expressed as a JSON format, which also includes additional metadata and external identifiers.
If we look at this JSON briefly, it is split into metadata, a brief overview of the pipeline with arguments and scripts. The actual workflow definition is defined outside. In addition we define parametric domain, and for verification the input output domains. This allows inspection to see what is the scope of the analysis.
(Click for Animation)
Many of these are actually external links.
The authors and contributors are identified using ORCID, which is a de-facto standard identifier for researchers;
Cross-references are given within the pipeline can be provided in any language, like Python
The workflow can be specified using Common Workflow Language – which gives portability as well as capturing execution environment, e.g. which Python version to use for the scripts.
The referenced data files are of course in multiple formats, like CSV or – for sequencing data - more specific formats like SAM
Now while the BCO references these resources in several places in its JSON structure, some may also be indirectly referenced. For instance the CWL workflow might reference particular Docker images that capture the Python version to use.
W3C PROV files might be provided, which can explain more detailed provenance of workflows; this might however become specific to the workflow engine used, and might not be directly identified all the resources seen in the BCO.
While we can identify authors with ORCID, they might author different parts of the BCO. If you made a clever Python script used by a BCO, then it is only FAIR that you should be attributed – even if you were nowhere in the vicinity when the BCO was later created.So you can think of these pink, green and blue arrows here as each giving partial picture of what is the whole BioCompute Objects.
There is also the question of how to move the BCO around – the JSON has many external references as well as relative references to plain files – how can you capture it all without understanding all of the BCO spec?
We are looking at using the BagIt Research Objects for this purpose.
Bag-It is a digital archive format used by Library of Congress and digital repositories. It handles checksums and completeness of files, even if they are large or external.
Research Object (RO) is a model for capturing and describing research outputs; embedding data, executables, publications, metadata, provenance and annotations. Although it is a general model, ROs have been used in particular for capturing reproducible workflows.
The combination of these, ro-bagit has recently used by the NIH-funded Big Data for Discovery Service for transferring and archiving very large HTS datasets in a location-independent way, so naturally this could be a good choice for how to archive BCOs.
(Click for Animation)
So here the manifest of the Research Object, ties everything together.
The manifest is in JSON-LD format – so it is Linked Data – but you don’t have to know unless you really want to – it is also just JSON.
The manifest **aggregates** all the other resources, including the BCO, but also external resources as well as outside references like identifiers.org.
The aggregation also provide attribution and provenance of each resource, so they get the credit they deserve. This is of course also important for regulatory purposes, e.g. to check if the latest version of a tool was used.
An important aspect of research objects is also to capture annotations, using the W3C Web Annotation Model. This allows any part of the BCO to be further described; textually or semantically; so you are not limited to what is supported by the specification of BCO or Research Object. In particular this might be where community-driven standards like BioSchemas can be used.