Presentation at BOSC 2013 / ISMB 2013. (PowerPoint 2013 source)
PDF: https://www.slideshare.net/soilandreyes/2013-0719bosc-2013myexperimentresearchobjectsslides
See also poster at http://www.slideshare.net/soilandreyes/2013-0718bosc-2013myexperimentresearchobjectsposter-24242509 or
submitted abstract: https://docs.google.com/document/d/1jaAuPV-EnbsyI14L56HKHBQP7eDVfeXGLlK-LwohnWw/edit?usp=sharing
We have evolved Research Objects as a mechanism to preserve digital resources related to research, by providing mechanisms, formats and architecture for describing aggregated resources (hypothesis, workflow, datasets, scripts, services), their relations (is input for, explains, used by), provenance (graph was derived from dataset A, B and C) and attribution (who contributed what, and when?).
The website myExperiment is already popular for collaborating on, publishing and sharing scientific workflows, however we have found that for understanding and preserving a workflow over time, its definition is not enough, specially faced with workflow decay, services and tools that change over time. We have therefore adapted the research object model as a foundation for the myExperiment packs, allowing uploading of workflow runs, inputs, outputs and other files relevant to the workflow, relating them with annotations and integrated the Wf4Ever architecture for performing decay analysis and tracking a research object’s evolution as it and its constituent resources change over time.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
1. myExperiment Research Objects:
Beyond Workflows and Packs
Stian Soiland-Reyes
myGrid, University of Manchester
BOSC 2013, ISMB, Berlin, 2013-07-19
This work is licensed under a
Creative Commons Attribution 3.0 Unported License
2. Motivation: Scientific workflows
Coordinated execution of
services and linked resources
Dataflow between services
Web services (SOAP, REST)
Command line tools
Scripts
User interactions
Components (nested workflows)
Method becomes:
Documented visually
Shareable as single definition
Reusable with new inputs
Repurposable other services
Reproducible?
http://www.myexperiment.org/workflows/3355
http://www.taverna.org.uk/
http://www.biovel.eu/
3. But workflows are complex machines
Outputs
Inputs
Configuration
Components
http://www.myexperiment.org/workflows/3355
• Will it still work after a year? 10 years?
• Expanding components, we see a workflow involves a
series of specific tools and services which
• Depend on datasets, software libraries, other tools
• Are often poorly described or understood
• Over time evolve, change, break or are replaced
• User interactions are not reproducible
• But can be tracked and replayed
4. Electronic Paper Not Enough
Hypothesis Experiment
Result
Analysis
Conclusions
Investigation
Data Data
Electronic
paper
Publish
http://www.force11.org/beyondthepdf2http://figshare.com/
Open Research movement: Openly share the data of your experiments
http://datadryad.org/
6. What is in a research object?
A Research Object bundles and relates digital
resources of a scientific experiment or
investigation:
Data used and results produced in
experimental study
Methods employed to produce and analyse
that data
Provenance and settings for the experiments
People involved in the investigation
Annotations about these resources, that are
essential to the understanding and
interpretation of the scientific outcomes
captured by a research object
http://www.researchobject.org/
7. Gathering everything
Research Objects (RO) aggregate related resources, their
provenance and annotations
Conveys “everything you need to know” about a
study/experiment/analysis/dataset/workflow
Shareable, evolvable, contributable, citable
ROs have their own provenance and lifecycles
8. Why Research Objects?
i. To share your research materials
(RO as a social object)
ii. To facilitate reproducibility and reuse of methods
iii. To be recognized and cited
(even for constituent resources)
iv. To preserve results and prevent decay
(curation of workflow definition;
using provenance for partial rerun)
13. Annotations in research objects
Types: “This document contains an hypothesis”
Relations: “These datasets are consumed by that tool”
Provenance: “These results came from this workflow run”
Descriptions: “Purpose of this step is to filter out invalid data”
Comments: “This method looks useful, but how do I install it?”
Examples: “This is how you could use it”
14. What is provenance?
By Dr Stephen Dann
licensed under Creative Commons Attribution-ShareAlike 2.0 Generic
http://www.flickr.com/photos/stephendann/3375055368/
Attribution
who did it?
Derivation
how did it change?
Activity
what happens to it?
Licensing
can I use it?
Attributes
what is it?
Origin
where is it from?
Annotations
what do others say about it?
Abstraction levels
shallots, sign, photo or flickr page?
Aggregation
what is it part of?
Date and tool
when was it made?
using what?
15. Attribution
Who collected this sample? Who helped?
Which lab performed the sequencing?
Who did the data analysis?
Who wrote the analysis workflow?
Who made the data set used by analysis?
Who curated the results?
Why do I need this?
i. To be recognized for my work
ii. Who should I give credits to?
iii. Who should I complain to?
iv. Can I trust them?
v. Who should I make friends with?
Alice
The
lab
Data
wasAttributedTo
actedOnBehalfOf
16. Derivation
Which sample was this metagenome sequenced from?
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
What is the previous revision of the new results?
Why do I need this?
i. To verify consistency (did I use
the correct sequence?)
ii. To find the latest revision
iii. To backtrack where a diversion
appeared after a change
iv. To credit work I depend on
v. Auditing and defence for
peer review
wasDerivedFrom
wasQuotedFrom
Sequence
New
results
wasDerivedFrom
Sample
Meta -
genome
Old
results
wasRevisionOf
wasInfluencedBy
17. Activities
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
Why do I need this?
i. To see which analysis was performed
ii. To find out who did what
iii. What was the metagenome
used for?
iv. To understand the whole process
“make me a Methods section”
v. To track down inconsistencies
used
wasGeneratedBy
wasStartedAt
"2012-06-21"
Metagenome
Sample
wasAssociatedWith
Workflow
server
wasInformedBy
wasStartedBy
Workflow
run
wasGeneratedBy
Results
Sequencing
wasAssociatedWith
Alice
hadPlan
Workflow
definition
hadRole
Lab
technician
Results
19. Provenance of what?
Who made the (content of) research object? Who
maintains it?
Who wrote this document? Who uploaded it?
Which CSV was this Excel file imported from?
Who wrote this description? When? How did we get
it?
What is the state of this RO? (Live or Published?)
What did the research object look like before?
(Revisions) – are there newer versions?
Which research objects are derived from this RO?
20. Research object model at a glance
Research
Object
Resource
Resource
Resource
Annotation
Annotation
Annotation
oa:hasTarget
Resource
Resource
Annotation graph
oa:hasBody
ore:aggregates
Manifest
21. Wf4Ever architecture
Semantic REST API
RDF triple store
(RO structure,
Annotations)
RO index
Uploaded files
PortalChecklist
service
Command
line
Workflow
runner
...
22. Saving a research object:
RO bundle
Single, transferrable research object
Self-contained snapshot
Which files in ZIP, which are URIs? (Up to user/application)
Regular ZIP file, explored and unpacked with standard tools
JSON manifest is programmatically accessible without RDF
understanding
Works offline and in desktop applications – no REST API
access required
Basis for RO-enabled file formats, e.g. Taverna run bundle
Exchanged with myExperiment and RO tools
23. Example RO bundle:
Workflow Results
workflowrun.prov.ttl
(RDF)
outputA.txt
outputC.jpg
outputB/
https://w3id.org/bundl
intermediates/
1.txt
2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
inputA.txt
workflow
URI
references
attribution
execution
environment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zip
.ro/manifest.json
25. Summary
Provide provenance records: Use (and extend) PROV.
Share your methods (tools, workflows), not just your data
Make research objects: Collect resources and relate them
Pick your RO integration level to adapt:
1. Conceptual model
2. RO Core ontology (aggregation and annotation)
3. Workflow description and provenance
4. Wf4Ever architecture for APIs
5. myExperiment as user interface
Editor's Notes
http://purl.org/wf4ever/model
http://purl.org/wf4ever/model
Typing of resources and relating them to each-other are individual annotations
Quality metrics are monitored over time, so as the research object evolves (say gaining additional annotations), the health of the annotations can be indicated graphically.
Sequencing machines:illumina
W3C standardisation body HTML
So for Wf4Ever, what kind of provenance are we talking about? It is a bit more down to earth as we are generally just talking about files and documents, but still, as you form a research object combining many different resources, the different dimensions of provenance become increasingly important to track.
So let’s have a look at what a Research Object looks like. The core is the concept of the Research Object itself, which you may also known as an ORE aggregation. This is described by the manifest, which is simply an RDF file. The RO aggregates a series of resources – in Linked Data these could be anywhere in the world. Additionally it aggregates a set of annotations, which we know is the link between a target resource (here aggregated in the RO), and an body resource. In Wf4Ever we typically provide the body as a separate RDF Graph, so that we can use existing vocabularies to describe and relate the resources.
So not everyone have access to set up a RESTful semantic web servers, in particular we’ve run into this with desktop applications – users just want to save files and then they decide where they are stored. So we decided to write a serialization format for Research Object, which we call the RO Bundle.We wanted this to be accessible for applicaton developers, so we’ve adopted ZIP and JSON, and in a way this would let you create research objects and make annotations without ever seing any RDF.
This shows how the JSON manifest focuses on the most common aspect of a research object – who made it? When? What is aggregated – files in the ZIP but also external URIs – up to the application or person making the bundle to decide what is to be included in the ZIP. Annotations are included at the bottom here, we see that there’s an annotation “about” (target) the analysis JPEG, and the content (the body) is within the annotations/ folder. Similarly, the next annotations relates the external resource (a blog post) with our aggregation of a resource.This is processable as JSON-LD – so it is not just JSON, it is also RDF, and out comes normal ORE aggregations and OA annotations.
This is how we represent a workflow run as a Workflow Results RO Bundle. We aggregate the workflowoutputs, , workflow definition, the inputs used for execution, a description of the execution environment, external URI references (such as the project homepage) and attribution to scientists who contributed to the bundle. This effectively forms a Research Object, all tied together by the RO Bundle Manifest, which is in JSON-LD format. (normal JSON that is also valid RDF).
So we have recently formed a W3C Community Group for Research Object, which has gathered significant interest, 75 participant. As you see, I am one of the chairs, and so is Rob which you already know from OA group. We are just starting up, and our focus in the RO community group is rather how to practically use Research Objects as a concept than to specify a new model – we’ll refer to existing models where it’s appropriate, but also explore other models which could be described as research objects.