Your SlideShare is downloading. ×
  • Like
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

2013 06-24 Wf4Ever: Annotating research objects (PPTX)


Open Annotation Rollout, Manchester, 2013-06-25 …

Open Annotation Rollout, Manchester, 2013-06-25
See also PDF version:

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Most of the user-contributed content in a research object is recorded as annotations
  • Typing of resources and relating them to each-other are individual annotations
  • This shows the quality assessment of a research object – a Minimal Information Model forms a Checklist which indicates how preservability of a research object.Here, many of the checklist items indicate the existence of particular annotations, while others are checking availability of HTTP resources.Annotations are additionally picked up for UI (title/description) – this highlights how the importance of annotation model. This was implemented before OA, and the format of the description was not recorded, hence the HTML is rendered as plain text.
  • Quality metrics are monitored over time, so as the research object evolves (say gaining additional annotations), the health of the annotations can be indicated graphically.
  • The annotation framework basically allows “any annotation”, so we had to write guidelines on which annotation properties we are going to recommend and “natively” understand. Reused existing vocabularies like Dublin Core Terms, PROV and PAV, but also had to make our own more specific vocabularies.
  • When I first heard about Provenance, I thought it was something French, like Provance. Provenance is classically understood as where something is coming from (Origin); like in this example – are the shallots from Holland or France? Was there some kind of Derivation that changed their nationality? Obviously if we are going to talk about somethings’ provenance, we have to be clear about what that thing is.. The shallots? The sign? The picture? Or this Flickr page?Provenance also covers other aspects, mainly Attribution (who did it), Dates (when?), and Activities (what happened). There are Attributes to describe the state of the thing. Perhaps not always considered provenance, but anyway relevant, are aggregations (one thing is part of another), Licensing (Can I use it?) and of course Annotations – what do others say about it?
  • Let’s take an example of a biomedical lab that sequences genome data. There would be lots of questions relating to attributions – different people play different roles, even act on behalf of others. We can call these Agents – things that can perform stuff. People are obvious agents, Organizations (like The Lab), but also Software can be active agents.
  • When we talk about things, or entities, we might want to relate them to each other. An extracted genome can be said to be derived from the sample. The sequence we select from the genome is a kind of quote. The result we get from analysing this is derived from the sequence, and is a revision of the old result – which again has its own chain of influences which might differ.
  • Activities is what is happening – typically using existing entities and generating new ones, somewhat under control by one or more agents. Taken together, you can describe a whole lineage of activities that generate and consume each other’s entities.
  • So these three classes are what is at the core of the W3C PROV model, which we have helped build. The Entity is derived from other entities, and attributed to an Agent. An Activity use one entity and generates another, and is associated with an agent.
  • So for Wf4Ever, what kind of provenance are we talking about? It is a bit more down to earth as we are generally just talking about files and documents, but still, as you form a research object combining many different resources, the different dimensions of provenance become increasingly important to track.
  • So let’s have a look at what a Research Object looks like. The core is the concept of the Research Object itself, which you may also known as an ORE aggregation. This is described by the manifest, which is simply an RDF file. The RO aggregates a series of resources – in Linked Data these could be anywhere in the world. Additionally it aggregates a set of annotations, which we know is the link between a target resource (here aggregated in the RO), and an body resource. In Wf4Ever we typically provide the body as a separate RDF Graph, so that we can use existing vocabularies to describe and relate the resources.
  • So how is this model exposed in the Wf4Ever architecture? In the back end there is a Blob store which can store any odd file, and a Graph store, where we import any RDF file as a named graph. This is exposed as a SPARQL endpoint. The manifest is one of those files, so you can query Research Object aggregations, their annotations and the annotation bodies in a single SPARQL query.Then each of the different types are exposed as a REST resource. Compound resources are simply PUT or GET almost directly into the Blob store. The ORE Proxy is a mechanism for “uploading” a URL, this is how we keep external references. The annotation graphs (the bodies) are also simply uploaded as files – and then related to the resource(s) it describe by making an Annotation within the manifest. So the Manifest is the key RDF document, as it contains both the aggregation and the annotation – but importantly not the body of those annotations (Which is typically either 1 line or 1000s of lines of RDF) as those are separate files.
  • So where do our annotations come from? For one we are importing existing annotations from existing formats – like extracting annotations from within workflow definitions.Secondly, users annotate research objects by interacting with the website, filling in title, description, etc.Thirdly we have software agents that are automatically invoked, for instance we have a transformation service that extracts the workflow structure as an RDF graph by parsing the native workflow format. Provenance traces of workflow runs can be inspected to relate files in the RO with generated outputs in the workflow run.
  • So which aspects of the OA model are we using? Well, in comes down to actually just oa:hasTarget andoa:hasBody. We use the oa:Annotation as the anchor for provenance about who made the relation, and keep separate provenance on the body resource – as the person stating it could differ.
  • myExperiment, a social networking site for sharing workflows, already have the aspects of commenting, etc.. – it does have an RDF interface, but using our own vocabulary. As we upgrade myExperiment to be based on research objects, we will be moving to use the OA model for motivatons – the proposed OA motivatoins largely cover what we already do.We also have an issue in that we have many annotations on compound resources. For instance within the Taverna workbench it is possible to annotate individual steps of a workflow with descriptions, examples and so on. Within a research object these are referred to by their full URI – but how do you find this URI, as the compound resources of the workflow definition are not aggregated. Currently we have to climb through the generated workflow structure graph to find the processors, which we find through a separate annotation on the workflow definition file. If we use the OA selector mechanism we should be able to also relate those deep annotations as SpecificResource on the workflow definition.
  • Now I have to be fair, we started on this quite a bit before OA existed. In fact, when we started there seemed to be two competing models. But as we inspected OAC and AO and wondered about how we would use them, we came to the conclusion that, at least for our purposes, they are technically equivalent. So we just went with a political choice, AO was already supported by Utopia, which is made in Manchested, so we went with that. However, our enquiries about this also helped trigger the formaton of the OA community group, which we have also participated in. We are looking forward to use the OA model as we now revise the RO model – the transition is not hard as we only used the 2 properties for target and body, which maps 1:1.In the Research Object Bundle we have already adapted OA, which is a way to save a whole research object as a single ZIP file.
  • So not everyone have access to set up a RESTful semantic web servers, in particular we’ve run into this with desktop applications – users just want to save files and then they decide where they are stored. So we decided to write a serialization format for Research Object, which we call the RO Bundle.We wanted this to be accessible for applicaton developers, so we’ve adopted ZIP and JSON, and in a way this would let you create research objects and make annotations without ever seing any RDF.
  • This is how we represent a workflow run as a Workflow Results RO Bundle. We aggregate the workflowoutputs, , workflow definition, the inputs used for execution, a description of the execution environment, external URI references (such as the project homepage) and attribution to scientists who contributed to the bundle. This effectively forms a Research Object, all tied together by the RO Bundle Manifest, which is in JSON-LD format. (normal JSON that is also valid RDF).
  • This shows how the JSON manifest focuses on the most common aspect of a research object – who made it? When? What is aggregated – files in the ZIP but also external URIs – up to the application or person making the bundle to decide what is to be included in the ZIP. Annotations are included at the bottom here, we see that there’s an annotation “about” (target) the analysis JPEG, and the content (the body) is within the annotations/ folder. Similarly, the next annotations relates the external resource (a blog post) with our aggregation of a resource.This is processable as JSON-LD – so it is not just JSON, it is also RDF, and out comes normal ORE aggregations and OA annotations.
  • Here’s another example of light-weight usage of RDFa to turn a normal index.html into a research object. Here the author is given as a creator of the RO, and the excel files that helped form this analysis are aggregated by the research object. This way of using the Research Object model requires not infrastructure or special packaging – and we have augmented this page to also have a downloadable RO Bundle so you can get all the aggregated resources in a one-go operation.
  • So we have recently formed a W3C Community Group for Research Object, which has gathered significant interest, 75 participant. As you see, I am one of the chairs, and so is Rob which you already know from OA group. We are just starting up, and our focus in the RO community group is rather how to practically use Research Objects as a concept than to specify a new model – we’ll refer to existing models where it’s appropriate, but also explore other models which could be described as research objects.


  • 1. Wf4Ever: Annotatingresearch objectsStian Soiland-Reyes, Sean BechHofermyGrid, University of ManchesterOpen Annotation Rollout, Manchester, 2013-06-24This work is licensed under aCreative Commons Attribution 3.0 Unported License
  • 2. Motivation: Scientific workflowsCoordinated execution ofservices and linked resourcesDataflow between servicesWeb services (SOAP, REST)Command line toolsScriptsUser interactionsComponents (nested workflows)Method becomes:Documented visuallyShareable as single definitionReusable with new inputsRepurposable other servicesReproducible?
  • 3. But workflows are complex machinesOutputsInputsConfigurationComponents• Will it still work after a year? 10 years?• Expanding components, we see a workflow involves aseries of specific tools and services which• Depend on datasets, software libraries, other tools• Are often poorly described or understood• Over time evolve, change, break or are replaced• User interactions are not reproducible• But can be tracked and replayed
  • 4. Electronic Paper Not EnoughHypothesis ExperimentResultAnalysisConclusionsInvestigationData DataElectronicpaperPublish Research movement: Openly share the data of your experiments
  • 5. OBJECT (RO) objects goal: Openly share everything about yourexperiments, including how those things are related
  • 6. What is in a research object?A Research Object bundles and relates digitalresources of a scientific experiment orinvestigation:Data used and results produced inexperimental studyMethods employed to produce and analysethat dataProvenance and settings for the experimentsPeople involved in the investigationAnnotations about these resources, that areessential to the understanding andinterpretation of the scientific outcomescaptured by a research object
  • 7. Gathering everythingResearch Objects (RO) aggregate related resources, theirprovenance and annotationsConveys “everything you need to know” about astudy/experiment/analysis/dataset/workflowShareable, evolvable, contributable, citableROs have their own provenance and lifecycles
  • 8. Why Research Objects?i. To share your research materials(RO as a social object)ii. To facilitate reproducibility and reuse of methodsiii. To be recognized and cited(even for constituent resources)iv. To preserve results and prevent decay(curation of workflow definition;using provenance for partial rerun)
  • 9. A Research object
  • 10. QualityAssessment of a research object
  • 11. QualityMonitoring
  • 12. Annotations in research objectsTypes: “This document contains an hypothesis”Relations: “These datasets are consumed by that tool”Provenance: “These results came from this workflow run”Descriptions: “Purpose of this step is to filter out invalid data”Comments: “This method looks useful, but how do I install it?”Examples: “This is how you could use it”
  • 13. Annotation guidelines – whichproperties?Descriptions: dct:title, dct:description, rdfs:comment, dct:publisher, dct:license,dct:subjectProvenance: dct:created, dct:creator, dct:modified, pav:providedBy,pav:authoredBy, pav:contributedBy, roevo:wasArchivedBy, pav:createdAtProvenance relations: prov:wasDerivedFrom, prov:wasRevisionOf,wfprov:usedInput, wfprov:wasOutputFromSocial networking: oa:Tag, mediaont:hasRating, roterms:technicalContact,cito:isDocumentedBy, cito:isCitedByDependencies: dcterms:requires, roterms:requiresHardware,roterms:requiresSoftware, roterms:requiresDatasetTyping: wfdesc:Workflow, wf4ever:Script, roterms:Hypothesis, roterms:Results,dct:BibliographicResource
  • 14. What is provenance?By Dr Stephen Dannlicensed under Creative Commons Attribution-ShareAlike 2.0 Generic did it?Derivationhow did it change?Activitywhat happens to it?Licensingcan I use it?Attributeswhat is it?Originwhere is it from?Annotationswhat do others say about it?Aggregationwhat is it part of?Date and toolwhen was it made?using what?
  • 15. AttributionWho collected this sample? Who helped?Which lab performed the sequencing?Who did the data analysis?Who curated the results?Who produced the raw data this analysis is based on?Who wrote the analysis workflow?Why do I need this?i. To be recognized for my workii. Who should I give credits to?iii. Who should I complain to?iv. Can I trust them?v. Who should I make friends with?prov:wasAttributedToprov:actedOnBehalfOfdct:creatordct:publisherpav:authoredBypav:contributedBypav:curatedBypav:createdBypav:importedBypav:providedBy...RolesPersonOrganizationSoftwareAgentAgent typesAliceThelabDatawasAttributedToactedOnBehalfOf
  • 16. DerivationWhich sample was this metagenome sequenced from?Which meta-genomes was this sequence extracted from?Which sequence was the basis for the results?What is the previous revision of the new results?Why do I need this?i. To verify consistency (did I usethe correct sequence?)ii. To find the latest revisioniii. To backtrack where a diversionappeared after a changeiv. To credit work I depend onv. Auditing and defence for peer reviewwasDerivedFromwasQuotedFromSequenceNewresultswasDerivedFromSampleMeta -genomeOldresultswasRevisionOfwasInfluencedBy
  • 17. ActivitiesWhat happened? When? Who?What was used and generated?Why was this workflow started?Which workflow ran? Where?Why do I need this?i. To see which analysis was performedii. To find out who did whatiii. What was the metagenomeused for?iv. To understand the whole process“make me a Methods section”v. To track down inconsistenciesusedwasGeneratedBywasStartedAt"2012-06-21"MetagenomeSamplewasAssociatedWithWorkflowserverwasInformedBywasStartedByWorkflowrunwasGeneratedByResultsSequencingwasAssociatedWithAlicehadPlanWorkflowdefinitionhadRoleLabtechnicianResults
  • 18. PROV model © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.Provenance Working Group
  • 19. Provenance of what?Who made the (content of) research object? Who maintains it?Who wrote this document? Who uploaded it?Which CSV was this Excel file imported from?Who wrote this description? When? How did we get it?What is the state of this RO? (Live or Published?)What did the research object look like before? (Revisions) – arethere newer versions?Which research objects are derived from this RO?
  • 20. Research object model at a glanceResearchObjectResourceResourceResourceAnnotationAnnotationAnnotationoa:hasTargetResourceResourceAnnotation graphoa:hasBodyore:aggregates«ore:Aggregation»«ro:ResearchObject»«oa:Annotation»«ro:AggregatedAnnotation»«trig:Graph»«ore:AggregatedResource»«ro:Resource»Manifest«ore:ResourceMap»«ro:Manifest»
  • 21. Wf4Ever architectureBlob storeGraphstoreResourceUploaded toManifestAnnotationgraphResearchobjectAnnotationORE ProxyExternalreferenceRedirects toIf RDF, import as named graphSPARQLREST resources
  • 22. Where do RO annotations comefrom?Imported from uploaded resources, e.g. embedded inworkflow-specific format (creator: unknown!)Created by users filling in Title, Description etc. on websiteBy automatically invoked software agents, e.g.:A workflow transformation service extracts the workflowstructure as RDF from the native workflow formatProvenance trace from a workflow run, which describes theorigin of aggregated output files in the research object
  • 23. How we are using the OA modelMultiple oa:Annotation contained within the manifest RDF andaggregated by the RO.Provenance (PAV, PROV) on oa:Annotation (who made the link)and body resource (who stated it)Typically a single oa:hasTarget, either the RO or an aggregatedresource.oa:hasBody to a trig:Graph resource (read: RDF file) with the“actual” annotation as RDF:<workflow1> dct:title "The wonderful workflow" .Multiple oa:hasTarget for relationships, e.g. graph body:<workflow1> roterms:inputSelected <input2.txt> .
  • 24. What should we also be using?MotivationsmyExperiment: commenting, describing, moderating, questioning,replying, tagging – made our own vocabulary as OA did not existSelectors on compound resourcesE.g. description on processors within a workflow in a workflowdefinition. How do you find this if you only know the workflowdefinition file?Currently: Annotations on separate URIs for each component,described in workflow structure graph, which is body of annotationtargeting the workflow definition fileImporting/referring to annotations from other OA systems(how to discover those?)
  • 25. What is the benefit of OA for us?Existing vocabulary – no need for our project to try tospecify and agree on our own way of trackingannotations.Potential interoperability with third-party annotationtoolsE.g. We want to annotate a figure in a paper andrelate it to a dataset in a research object – don’twant to write another tool for that!Existing annotations (pre research object) in Tavernaand myExperiment map easily to OA model
  • 26. History lesson (AO/OAC/OA)When forming the Wf4Ever Research Object model, we found:Open Annotation Collaboration (OAC)Annotation Ontology (AO)What was the difference?Technically, for Wf4Ever’s purposes: They are equivalentPolitical choice: AO – supported by Utopia (Manchester)We encouraged the formation of W3C Open AnnotationCommunity Group and a joint modelNext: Research Object model v0.2 and RO Bundle will use theOA model – since we only used 2 properties, mapping is 1:1
  • 27. Saving a research object:RO bundleSingle, transferrable research objectSelf-contained snapshotWhich files in ZIP, which are URIs? (Up to user/application)Regular ZIP file, explored and unpacked with standard toolsJSON manifest is programmatically accessible without RDFunderstandingWorks offline and in desktop applications – no REST APIaccess requiredBasis for RO-enabled file formats, e.g. Taverna run bundleExchanged with myExperiment and RO tools
  • 28. Workflow Results Bundleworkflowrun.prov.ttl(RDF)outputA.txtoutputC.jpgoutputB/ in Research ObjectZIP folder structure (RO Bundle)mimetypeapplication/
  • 29. RO BundleWhat is aggregated? File In ZIP or external URIWho made the RO? When?Who?External URIs placed in foldersEmbedded annotationExternal annotation, e.g. blogpostJSON-LD context  RDFRO JSON "quotes" not shown above for brevity
  • 30.<h3 property="dc:title">Common Motifs in Scientific Workflows:<br>An Empirical Analysis</h3><body resource=""typeOf="ore:Aggregation ro:ResearchObject">Research Object as RDFa<li><a property="ore:aggregates" href="t2_workflow_set_eSci2012.v.0.9_FGCS.xls"typeOf="ro:Resource">Analytics for Taverna workflows</a></li><li><a property="ore:aggregates" href="WfCatalogue-AdditionalWingsDomains.xlsx“typeOf="ro:Resource">Analytics for Wings workflows</a></li><span property="dc:creator prov:wasAttributedTo"resource=""></span>
  • 31. W3C community group for RO