A dip into
myGrid, University of Manchester
HARMONY 2014, Manchester, 2014-04-24
This work is licensed under a
Creative Commons Attribution 3.0 Unported
Saving a research object:
Single, transferrable research object
Which files in ZIP, which are URIs? (Up to
Regular ZIP file, explored and unpacked with standard
JSON manifest is programmatically accessible without
Works offline and in desktop applications – no REST
API access required
Basis for RO-enabled file formats, e.g. Taverna run
ZIP-based format (Adobe UCF,
Workflow Results Bundle
Aggregating in Research Object
ZIP folder structure (RO Bundle)
What is aggregated? File In ZIP or external URI
Who made the RO? When?
External URIs placed in folders
External annotation, e.g. blogpost
JSON-LD context RDF
Note: JSON "quotes" not shown above for brevity
<http://dbpedia.org/resource/John_Lennon> <http://xmlns.com/foaf/0.1/name> "John Lennon" .
<http://dbpedia.org/resource/John_Lennon> <http://schema.org/birthDate> "1940-10-09".
Defines RDF triples:
RO Bundle manifest as RDF
API for RO bundles
When I first heard about Provenance, I thought it was something French, like Provance. Provenance is classically understood as where something is coming from (Origin); like in this example – are the shallots from Holland or France? Was there some kind of Derivation that changed their nationality? Obviously if we are going to talk about somethings’ provenance, we have to be clear about what that thing is.. The shallots? The sign? The picture? Or this Flickr page?Provenance also covers other aspects, mainly Attribution (who did it), Dates (when?), and Activities (what happened). There are Attributes to describe the state of the thing. Perhaps not always considered provenance, but anyway relevant, are aggregations (one thing is part of another), Licensing (Can I use it?) and of course Annotations – what do others say about it?
Let’s take an example of a biomedical lab that sequences genome data. There would be lots of questions relating to attributions – different people play different roles, even act on behalf of others. We can call these Agents – things that can perform stuff. People are obvious agents, Organizations (like The Lab), but also Software can be active agents.
When we talk about things, or entities, we might want to relate them to each other. An extracted genome can be said to be derived from the sample. The sequence we select from the genome is a kind of quote. The result we get from analysing this is derived from the sequence, and is a revision of the old result – which again has its own chain of influences which might differ.
Activities is what is happening – typically using existing entities and generating new ones, somewhat under control by one or more agents. Taken together, you can describe a whole lineage of activities that generate and consume each other’s entities.
So these three classes are what is at the core of the W3C PROV model, which we have helped build. The Entity is derived from other entities, and attributed to an Agent. An Activity use one entity and generates another, and is associated with an agent.
Most of the user-contributed content in a research object is recorded as annotations
Typing of resources and relating them to each-other are individual annotations
The annotation framework basically allows “any annotation”, so we had to write guidelines on which annotation properties we are going to recommend and “natively” understand. Reused existing vocabularies like Dublin Core Terms, PROV and PAV, but also had to make our own more specific vocabularies.
So not everyone have access to set up a RESTful semantic web servers, in particular we’ve run into this with desktop applications – users just want to save files and then they decide where they are stored. So we decided to write a serialization format for Research Object, which we call the RO Bundle.We wanted this to be accessible for applicaton developers, so we’ve adopted ZIP and JSON, and in a way this would let you create research objects and make annotations without ever seing any RDF.
So let’s have a look at what a Research Object looks like. The core is the concept of the Research Object itself, which you may also known as an ORE aggregation. This is described by the manifest, which is simply an RDF file. The RO aggregates a series of resources – in Linked Data these could be anywhere in the world. Additionally it aggregates a set of annotations, which we know is the link between a target resource (here aggregated in the RO), and an body resource. In Wf4Ever we typically provide the body as a separate RDF Graph, so that we can use existing vocabularies to describe and relate the resources.
This is how we represent a workflow run as a Workflow Results RO Bundle. We aggregate the workflowoutputs, , workflow definition, the inputs used for execution, a description of the execution environment, external URI references (such as the project homepage) and attribution to scientists who contributed to the bundle. This effectively forms a Research Object, all tied together by the RO Bundle Manifest, which is in JSON-LD format. (normal JSON that is also valid RDF).
This shows how the JSON manifest focuses on the most common aspect of a research object – who made it? When? What is aggregated – files in the ZIP but also external URIs – up to the application or person making the bundle to decide what is to be included in the ZIP. Annotations are included at the bottom here, we see that there’s an annotation “about” (target) the analysis JPEG, and the content (the body) is within the annotations/ folder. Similarly, the next annotations relates the external resource (a blog post) with our aggregation of a resource.This is processable as JSON-LD – so it is not just JSON, it is also RDF, and out comes normal ORE aggregations and OA annotations.
Here’s another example of light-weight usage of RDFa to turn a normal index.html into a research object. Here the author is given as a creator of the RO, and the excel files that helped form this analysis are aggregated by the research object. This way of using the Research Object model requires not infrastructure or special packaging – and we have augmented this page to also have a downloadable RO Bundle so you can get all the aggregated resources in a one-go operation.