Some of the shared digital artefacts of digital research are executable in the sense that they describe an automated process which generates results.For example, an object might contain raw…
, i.e., a multi-step process to coordinate multiple components and tasks, like a script, that orchestrates the flow of data.Such as running a program, submitting a query to a database, submitting a job to a computational facility, or invoking a service over the Web to use a remote resource
It is important therefore to record the provenance of workflow outputs; i.e. the sources of information and processes involved in producing a particular listto changes in operating systems, data management sustainability and access to computational infrastructure. We note that workflows have many of the properties of software, such as the composition of components with external dependencies, and hence some aspects of software preservation  are applicable.
ORE defines standards for the description and exchange of aggregations of Web resources. Using ORE, a workflow-centric research object is defined as a resource that aggregates other resources, i.e., workflow(s), provenance, other objects and annotations. For ex- ample, the RDF turtle snippet illustrated below specifies that a research object identified by :wro aggregates a workflow template :pathway wf sp, a workflow run :pathway wf run, and an annotation :wfannot.
The elements that compose a Research Object may differ from one to another, and this difference may have consequences on the level of reproducibility that can be guaranteed.At one end of the spectrum, the Research Object is represented by a paper. As we progress to the other end the Research Object is enriched to include elements such asthe workow implementing the computation, annotations describing the experiment implemented and the hypothesis investigated, and provenance traces of past executions of the worflkow.Assessing the reproducibility of computations described using electronic papers can be tedious: a paper may just sketch the method implemented by the computation in question, without delving into details that are necessary to check that the results obtained, or claimed, in the paper can be reproduced. Verifying the reproducibility of ROs at the other end of the spectrum is less difficult. The provenance trace provides data examples to re-enact the workflow and means to verify that the results of workflow executions are comparable with prior resultsTo ensure the preservation of a workflow and the reproducibility of its results, the RO needs to be managed and curated throughout the lifecycle of the associated workflow
We will now illustrate research object lifecycle through a small example that shows how all the resources contained in a research object are bundled as the scientific experiment progresses. This example lifecycle is summarized graphically on the slide.A research object normally starts its life as an empty Live Research Object, with a first design of the experiments to be performed (which determines what workflows and resources will be added, by either retrieving them from an existing platform or creating them from scratch). Then the research object is filled incrementally by aggregating such workflows that are being created, reused or re-purposed, datasets, documents, etc. Any of these components can be changed at any point in time, removed, etc.In our scenario, we observe several points in time when this Live Research Object gets copied and kept into a Research Object snapshot, which aims to reflect the status of the research object at a given point in time. Such a snap- shot may be useful to release the current version of the research outcome of an experiment, submit it to be peer reviewed or to be published (with the appro- priate access control mechanisms), share it with supervisors or collaborators, or for acknowledgement and citation purposes.A snapshot may also contain a paper describing the research object in general and the experiment in particular, depending on the policies of the corresponding scientific communication channel, e.g., workshop, conference or journal. Such snapshots have their own identifiers, and may even be preserved, since it may be useful to be able to track the evolution of the research object over time, so as to allow, for example, retrieval of a previous state of the research object, reporting to funding agencies the evolution of the research conducted, etc.At some point in time, the research object may get published and archived, in what we know as an Archived Research Object, with a permanent identifier. Such a version of our research object may be the result of copying completely our Live Research Object, or it may be the result of some filtering or curation process where only some parts of the information available in the aggregation are actually published for others to reuse. As illustrated in Figure 4, a user can use an existing Archived Research Object as a starting point to his or her research, e.g., to repurpose it or its parts, in which case a new Live Research Object is created based on the existing Archived Research Object. This is only one of the many potential scenarios that could be foreseen for the lifecycle of a workflow-centric research object and we are currently defining different storyboards for their evolution. One important aspect to highlight is the fact that during its whole lifecycle, the research object is aggregating new ob- jects. The annotation process during the lifecycle of experimentation allows the generation of sufficient metadata about the research objects to support preser- vation and sharing. Therefore, when a scientists decides to preserve it most of the annotations that will be needed for that preservation process will be already available inside the research object.
2013-01-17 Research Object
Research Objects Preserving scientific data and methods Stian Soiland-Reyes, Khalid BelhajjameSchool of Computer Science, Univ of Manchester myGrid NIHBI meet-up Manchester 2013-01-17
Agenda» Preserving digital science» The Research Object » Anatomy » Lifecycle» Wf4Ever Tools» Future developments 2
Computation Processes in Today’s Research » Research is being conducted in increasingly digital and online environment » This has led to the emergence of new digital artifacts» In some respects, these objects can be regarded as data » However, some objects include the description of the research method that is captured as a computational process» Such processes encapsulate the knowledge related to the generation, (re)use and general transformation of data in experimental sciences Raw data Results Computational process 3
Scientific WorkflowIn this work, we focus on a particular kind of computational processes called scientific workflows » A scientific workflow is a precise, executable description of a scientific procedure - a series of analysis operations connected using data links» Each operation represents the execution of a computational process » Can be supplied by independently developed web services » Can also use existing data sources that are accessible on the Web 4
Preservation ChallengesChallenges deal with their executable aspects and their vulnerability to the volatility of the resources required for their execution » Changes by 3rd parties » Workflow may produce different lists at different times » Workflow may become inoperable » Workflow decay – The execution of the workflow may fail or yield different results, due to dependencies on resources and services subject to independent changes, e.g., EMBL-EBI. Even workflows that depend on local resources are vulnerable. 5
Repeat Reproduce Within Lab Between Labs Materials Publication Materials Methods Methods Data Instruments Instruments Models, Techniques, Algorithms Laboratory Laboratory Replicate / Repeat Provenance Reproduce Exactly replicate the original Attribution Run experiment withexperiment and experimental Credit differences in experimentalconditions. Eliminate change. conditions.. Compare to test Observe. for same result. Observe. Context Investigation Study Experiment Capture Curate Discover Use Reuse Preserve
Why research objects? A research object aggregates all elements deemed necessary to understand research investigations Promote reuse, sharing Enable the verification of reproducibility of the results Trackable, versionable, referenceable 11
Anatomy of a research object ore:aggregates ore:describes ro:Resource ro:Manifest ro:ResearchObject ore:proxyFor ore:aggregates ro:annotatesAggregatedResourcero:FolderEntry Subclass of ore:proxyIn ro:SemanticAnnotation ro:Folder ao:body RDF file 12
Grounding Workflow-centric Research Objects Using Semantic Technologies Workflow-centric research objects are encoded using RDF, according to a set of ontologies that are publicly available Research objects extend the Object Exchange and Reuse (ORE) model, to represent aggregation. ORE 13
Grounding Workflow-centric Research Objects Using Semantic Technologies We use the Annotation Ontology (AO) to annotate research object resources and their relationships. 14
Relating resources in research object Results Workflow_16 QTL produces Included in Included in Feeds into Published inLogs produces Included inMetadata Included in Paper Slides Published in produces Common pathways Results Workflow_13The provenance of the RO elements is key to understanding, comparing and debugging scientific workflows and to verifying the validity of a claim made within the context of a RO 15
Evolution of a research object Live RO Live ROScientist My supervisor calls me Reviews received My supervisor calls me to A new PhD student again and we decide to and final version report my work continues my work publish our RO+paper published <<copy>> <<copy>> <<copy, filter and curate>> <<copy>> <<versionOf>>Scientist RO snapshot RO snapshot <<versionOf>> Identified by a URI Identified by a URI Some metadata Some metadata Some curation Some curation Mostly private (for my group Mostly private (for my group) and for paper reviewers) Identified by a URI Librarian/Curator Good metadata Archived RO and curation 16 Mostly public
PROV standard - Basis for evolution model Candidate Recommendation http://www.w3.org/TR/prov-primer/ 17
Current Status and Ongoing Work Models/spec v0.1 public: http://purl.org/wf4ever/model - Upcoming revision v0.2: (Q1 2013) • Minor additions to workflow model terms • “RO Terms” – Upper user level view of RO: hypothesis, results – many are “shortcuts” for structured model - TODO: Update annotation model to Open Annotation Data Model (OAC) - TODO: PAV for detailed authorship provenance Showing, managing and sharing of Research Objects through myExperiment web site  http://www.myexperiment.org/ 22 22
Open Annotation Data Model Community Draft“Almost final” spec: 2013-01-28Roll out meeting in Manchester: March 2013 http://www.openannotation.org/spec/core/ 23