Research Objects
Preserving scientific data and methods
Stian Soiland-Reyes, Khalid Belhajjame
School of Computer Science, Univ of Manchester
myGrid NIHBI meet-up Manchester 2013-01-17
Agenda
» Preserving digital science
» The Research Object
» Anatomy
» Lifecycle
» Wf4Ever Tools
» Future developments
2
Computation Processes in Today’s Research
» Research is being conducted in increasingly digital and
online environment
» This has led to the emergence of new digital artifacts
» In some respects, these objects can be regarded as data
» However, some objects include the description of the
research method that is captured as a computational
process
» Such processes encapsulate the knowledge related to the
generation, (re)use and general transformation of data in
experimental sciences
Raw data
Results
Computational process 3
Scientific Workflow
In this work, we focus on a particular kind of computational processes called scientific
workflows
» A scientific workflow is a precise, executable
description of a scientific procedure - a series of
analysis operations connected using data links
» Each operation represents the execution of a
computational process
» Can be supplied by independently developed
web services
» Can also use existing data sources that are
accessible on the Web
4
Preservation Challenges
Challenges deal with their executable aspects and their vulnerability to the volatility of the
resources required for their execution
» Changes by 3rd parties
» Workflow may produce
different lists at different
times
» Workflow may become
inoperable
» Workflow decay – The execution of the workflow may fail or yield different results,
due to dependencies on resources and services subject to independent changes,
e.g., EMBL-EBI. Even workflows that depend on local resources are vulnerable.
5
Repeat Reproduce
Within Lab Between Labs
Materials Publication
Materials
Methods Methods
Data
Instruments Instruments
Models, Techniques,
Algorithms
Laboratory Laboratory
Replicate / Repeat Provenance Reproduce
Exactly replicate the original Attribution Run experiment with
experiment and experimental Credit differences in experimental
conditions. Eliminate change. conditions.. Compare to test
Observe. for same result.
Observe.
Context
Investigation
Study
Experiment
Capture Curate Discover Use Reuse Preserve
From Electronic papers to Research objects
Scientists
Hypothesis
Experiments
Annotations
Research
Object
Electronic Results
paper
Provenance
Datasets
8
Why research objects?
A research object aggregates all elements deemed necessary to
understand research investigations
Promote reuse, sharing
Enable the verification of reproducibility of the results
Trackable, versionable, referenceable
11
Anatomy of a research object
ore:aggregates ore:describes
ro:Resource ro:Manifest
ro:ResearchObject
ore:proxyFor ore:aggregates
ro:annotatesAggregatedResource
ro:FolderEntry
Subclass of
ore:proxyIn
ro:SemanticAnnotation
ro:Folder
ao:body
RDF file
12
Grounding Workflow-centric Research
Objects Using Semantic Technologies
Workflow-centric research objects are encoded using RDF, according to a set
of ontologies that are publicly available
Research objects extend the Object Exchange and Reuse (ORE) model, to
represent aggregation.
ORE 13
Grounding Workflow-centric Research
Objects Using Semantic Technologies
We use the Annotation Ontology (AO) to annotate research object
resources and their relationships.
14
Relating resources in research object
Results Workflow_16
QTL
produces
Included
in
Included in
Feeds into Published in
Logs
produces Included in
Metadata
Included in
Paper
Slides
Published in
produces
Common pathways
Results
Workflow_13
The provenance of the RO elements is key to understanding, comparing and debugging scientific workflows and to
verifying the validity of a claim made within the context of a RO
15
Evolution of a research object
Live RO Live RO
Scientist
My supervisor calls me Reviews received
My supervisor calls me to A new PhD student
again and we decide to and final version
report my work continues my work
publish our RO+paper published
<<copy>> <<copy>> <<copy, filter
and curate>>
<<copy>>
<<versionOf>>
Scientist
RO snapshot RO snapshot <<versionOf>>
Identified by a URI
Identified by a URI
Some metadata
Some metadata
Some curation
Some curation
Mostly private (for my group
Mostly private (for my group)
and for paper reviewers)
Identified by a URI
Librarian/Curator Good metadata
Archived RO
and curation
16
Mostly public
PROV standard - Basis for evolution model
Candidate
Recommendation
http://www.w3.org/TR/prov-primer/
17
Current Status and Ongoing Work
Models/spec v0.1 public: http://purl.org/wf4ever/model
- Upcoming revision v0.2: (Q1 2013)
• Minor additions to workflow model terms
• “RO Terms” – Upper user level view of RO: hypothesis, results – many are “shortcuts” for structured model
- TODO: Update annotation model to Open Annotation Data Model (OAC)
- TODO: PAV for detailed authorship provenance
Showing, managing and sharing of Research Objects through
myExperiment web site
[3] http://www.myexperiment.org/ 22
22
Open Annotation Data Model
Community
Draft
“Almost final” spec: 2013-01-28
Roll out meeting in Manchester:
March 2013 http://www.openannotation.org/spec/core/
23
Some of the shared digital artefacts of digital research are executable in the sense that they describe an automated process which generates results.For example, an object might contain raw…
, i.e., a multi-step process to coordinate multiple components and tasks, like a script, that orchestrates the flow of data.Such as running a program, submitting a query to a database, submitting a job to a computational facility, or invoking a service over the Web to use a remote resource
It is important therefore to record the provenance of workflow outputs; i.e. the sources of information and processes involved in producing a particular listto changes in operating systems, data management sustainability and access to computational infrastructure. We note that workflows have many of the properties of software, such as the composition of components with external dependencies, and hence some aspects of software preservation [10] are applicable.
ORE defines standards for the description and exchange of aggregations of Web resources. Using ORE, a workflow-centric research object is defined as a resource that aggregates other resources, i.e., workflow(s), provenance, other objects and annotations. For ex- ample, the RDF turtle snippet illustrated below specifies that a research object identified by :wro aggregates a workflow template :pathway wf sp, a workflow run :pathway wf run, and an annotation :wfannot.
The elements that compose a Research Object may differ from one to another, and this difference may have consequences on the level of reproducibility that can be guaranteed.At one end of the spectrum, the Research Object is represented by a paper. As we progress to the other end the Research Object is enriched to include elements such asthe workow implementing the computation, annotations describing the experiment implemented and the hypothesis investigated, and provenance traces of past executions of the worflkow.Assessing the reproducibility of computations described using electronic papers can be tedious: a paper may just sketch the method implemented by the computation in question, without delving into details that are necessary to check that the results obtained, or claimed, in the paper can be reproduced. Verifying the reproducibility of ROs at the other end of the spectrum is less difficult. The provenance trace provides data examples to re-enact the workflow and means to verify that the results of workflow executions are comparable with prior resultsTo ensure the preservation of a workflow and the reproducibility of its results, the RO needs to be managed and curated throughout the lifecycle of the associated workflow
We will now illustrate research object lifecycle through a small example that shows how all the resources contained in a research object are bundled as the scientific experiment progresses. This example lifecycle is summarized graphically on the slide.A research object normally starts its life as an empty Live Research Object, with a first design of the experiments to be performed (which determines what workflows and resources will be added, by either retrieving them from an existing platform or creating them from scratch). Then the research object is filled incrementally by aggregating such workflows that are being created, reused or re-purposed, datasets, documents, etc. Any of these components can be changed at any point in time, removed, etc.In our scenario, we observe several points in time when this Live Research Object gets copied and kept into a Research Object snapshot, which aims to reflect the status of the research object at a given point in time. Such a snap- shot may be useful to release the current version of the research outcome of an experiment, submit it to be peer reviewed or to be published (with the appro- priate access control mechanisms), share it with supervisors or collaborators, or for acknowledgement and citation purposes.A snapshot may also contain a paper describing the research object in general and the experiment in particular, depending on the policies of the corresponding scientific communication channel, e.g., workshop, conference or journal. Such snapshots have their own identifiers, and may even be preserved, since it may be useful to be able to track the evolution of the research object over time, so as to allow, for example, retrieval of a previous state of the research object, reporting to funding agencies the evolution of the research conducted, etc.At some point in time, the research object may get published and archived, in what we know as an Archived Research Object, with a permanent identifier. Such a version of our research object may be the result of copying completely our Live Research Object, or it may be the result of some filtering or curation process where only some parts of the information available in the aggregation are actually published for others to reuse. As illustrated in Figure 4, a user can use an existing Archived Research Object as a starting point to his or her research, e.g., to repurpose it or its parts, in which case a new Live Research Object is created based on the existing Archived Research Object. This is only one of the many potential scenarios that could be foreseen for the lifecycle of a workflow-centric research object and we are currently defining different storyboards for their evolution. One important aspect to highlight is the fact that during its whole lifecycle, the research object is aggregating new ob- jects. The annotation process during the lifecycle of experimentation allows the generation of sufficient metadata about the research objects to support preser- vation and sharing. Therefore, when a scientists decides to preserve it most of the annotations that will be needed for that preservation process will be already available inside the research object.