Presenting PROV-O, PAV, Open Annotation Model and Research Object (RO).

  1. 1. Provenance and annotations Stian Soiland-Reyes myGrid, University of Manchester HeRC CHIPSET meeting, Manchester, 2013-12-16 This work is licensed under a Creative Commons Attribution 3.0 Unported License
  2. 2. What is provenance? Activity what happens to it? Derivation how did it change? By Dr Stephen Dann licensed under Creative Commons Attribution-ShareAlike 2.0 Generic http://www.flickr.com/photos/stephendann/3375055368/ Origin where is it from?
  3. 3. What is provenance? Attribution who did it? Date and tool when was it made? using what? Aggregation what is it part of? Attributes what is it? Annotations what do others say about it? Licensing can I use it? By Dr Stephen Dann licensed under Creative Commons Attribution-ShareAlike 2.0 Generic http://www.flickr.com/photos/stephendann/3375055368/
  4. 4. Attribution actedOnBehalfOf Who collected this sample? Who helped? The lab Alice Which lab performed the sequencing? Who did the data analysis? wasAttributedTo Who curated the results? Who produced the raw data this analysis is based on? Data Who wrote the analysis workflow? Why do I need this? Roles Agent types i. To be recognized for my work ii. Who should I give credits to? Person Organization SoftwareAgent iii. Who should I complain to? iv. Can I trust them? v. Who should I make friends with? prov:wasAttributedTo prov:actedOnBehalfOf dct:creator dct:publisher pav:authoredBy pav:contributedBy pav:curatedBy pav:createdBy pav:importedBy pav:providedBy ... http://practicalprovenance.wordpress.com/
  5. 5. Derivation Sample Which sample was this metagenome sequenced from? wasDerivedFrom Which meta-genomes was this sequence extracted from? Which sequence was the basis for the results? Meta genome What is the previous revision of the new results? wasQuotedFrom Why do I need this? i. ii. wasInfluencedBy To find the latest revision iii. To backtrack where a diversion appeared after a change iv. To credit work I depend on v. Sequence To verify consistency (did I use the correct sequence?) Auditing and defence for peer review wasDerivedFrom Old results wasRevisionOf New results
  6. 6. Activities Lab technician Sample Alice hadRole wasAssociatedWith What happened? When? Who? What was used and generated? Why was this workflow started? Which workflow ran? Where? used "2012-06-21" Sequencing wasGeneratedBy Metagenome Why do I need this? wasStartedBy i. To see which analysis was performed Workflow ii. To find out who did what run iii. What was the metagenome used for? wasGeneratedBy iv. To understand the whole process “make me a Methods section” Results Results v. To track down inconsistencies wasStartedAt wasInformedBy Workflow server wasAssociatedWith hadPlan Workflow definition
  7. 7. Provenance Working Group PROV model Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. http://www.w3.org/TR/prov-primer/
  8. 8. PROV implementations Legend: PROV-N PROV-O PROV-XML PROVJSON AERS-LD agentSwitch Amalgame Annotation Inference Framework APROVeD: Automatic Provenance Derivation checker.pl CollabMap cProv csv2rdf4lodautomation D2R Server DataFAQs DBpedia DeFacto Dublin Core to PROV mapping Earth System Science Server Global Change Information System Hedgehog Human Computation ontology Informed Rural Passenger Information Infrastructure ISO_19115_Li neage Music Ontology OBIAMA OECD Linked Data Open Provenance Model for Workflows (OPMW) OpenUp Prov Oracle Enterprise Transactions Controls Governor PAV Provenance, Authoring and Versioning PML 3.0 Policy Reasoning Framework PoN P-plan PROV Python library prov-api prov-check Provenance Environment (ProvEn) Services Provenance for Earth Science Provenance server Provenance Vocabulary Prov-gen PROV-N to Neo4J DB mapping PROVoKing Prov Toolbox ProvValidator provx2o Pubby PubFlow Provenance Archive Quality Assessment Framework QuerioCity research prototype Raw2LD recoprov roevo Semantic Proteomics Dashboard (SemPoD) SIGNA StatJR eBook system SysPro Taverna tavernaprov Tinga Provenance Service Triplify TWC Healthdata University of Southampton Open Data WebLabPROV wfprov Wings Provenance Export http://dx.doi.org/10.6084/m9.figshare.878099 Yanfeng Shu Source (2013-04-16): http://www.w3.org/TR/prov-implementations/
  9. 9. Open Annotation Data Model Copyright © 2012-2013 the Contributors to the Open Annotation Core Data Model Specification, published by the Open Annotation Community Group under the W3C Community Contributor License Agreement (CLA). http://www.openannotation.org/spec/core/core.html
  10. 10. Example: David’s slides are about ClinicalCodes Option 1: The FOAF vocabulary The primaryTopic property relates a document to the main thing that the document is about. foaf:primaryTopic https://clinicalcodes.rss.mhs.man.ac.uk/ http://dev.mygrid.org.uk/wiki/download/attachments/ 16384498/daspringate_clinicalcodes_HeRC.pdf
  11. 11. Example: David’s slides are about ClinicalCodes Option 2: Open Annotation Data Model annotation oa:hasBody oa:hasTarget https://clinicalcodes.rss.mhs.man.ac.uk/ http://dev.mygrid.org.uk/wiki/download/attachments/ 16384498/daspringate_clinicalcodes_HeRC.pdf
  12. 12. Annotations have provenance oa:hasTarget oa:hasBody annotation © 2013 David A. Springate pav:authoredBy pav:retrievedBy pav:createdBy foaf:name foaf:name David A. Springate oa:annotatedBy David A. Springate foaf:name Stian Soiland-Reyes Who is the “creator” of the slides, is it David or Stian? With PAV we can differentiate content authoring from upload http://purl.org/pav/html
  13. 13. Annotations have provenance oa:hasTarget oa:hasBody annotation © 2013 David A. Springate pav:authoredBy pav:retrievedBy pav:createdBy foaf:name foaf:name David A. Springate oa:annotatedBy David A. Springate foaf:name Stian Soiland-Reyes http://orcid.org/0000-0001-9842-9718 http://purl.org/pav/html Which David…? Need a common identifier  ORCID
  14. 14. Annotations as first-class citizens oa:hasTarget oa:hasBody annotation oa:motivatedBy oa:bookmarking oa:classifying oa:commenting JSON oa:describing oa:editing oa:highlighting oa:identifying oa:linking oa:moderating oa:questioning oa:replying oa:tagging … Turtle
  15. 15. Provenance of what? Who made the (content of) this data set? Who maintains it? Who wrote this document? Who uploaded it? Which CSV was this Excel file imported from? Who wrote this description? When? How did we get it? What is the state of these guidelines? Are they official? What did the guidelines look like before? (Revisions) – are there newer versions? What new resources have been derived from this data set?
  16. 16. RESEARCH OBJECT (RO) http://www.researchobject.org/ Research objects goal: Openly share everything about your experiments, including how those things are related http://www.researchobject.org/
  17. 17. What is in a research object? A Research Object bundles and relates digital resources of a scientific experiment or investigation: Data used and results produced in experimental study Methods employed to produce and analyse that data Provenance and settings for the experiments People involved in the investigation Annotations about these resources, that are essential to the understanding and interpretation of the scientific outcomes captured by a research object http://www.researchobject.org/
  18. 18. Gathering everything Research Objects (RO) aggregate related resources, their provenance and annotations Conveys “everything you need to know” about a study/experiment/analysis/dataset/workflow Shareable, evolvable, contributable, citable ROs have their own provenance and lifecycles
  19. 19. Research object model at a glance Resource Resource Resource oa:hasTarget ore:aggregates Research Object Annotation Annotation Annotation oa:hasBody Resource Resource Annotation graph Manifest
  20. 20. Why Research Objects? i. To share your research materials (RO as a social object) ii. To facilitate reproducibility and reuse of methods iii. To be recognized and cited (even for constituent resources) iv. To preserve results and prevent decay (curation of workflow definition; using provenance for partial rerun)
  21. 21. A Research object http://alpha.myexperiment.org/packs/387
  22. 22. Annotations in research objects Types: “This document contains an hypothesis” Relations: “These datasets are consumed by that tool” Provenance: “These results came from this workflow run” Descriptions: “Purpose of this step is to filter out invalid data” Comments: “This method looks useful, but how do I install it?” Examples: “This is how you could use it”
  23. 23. Annotation guidelines – which properties? Descriptions: dct:title, dct:description, rdfs:comment, dct:publisher, dct:license, dct:subject Provenance: dct:created, dct:creator, dct:modified, pav:providedBy, pav:authoredBy, pav:contributedBy, roevo:wasArchivedBy, pav:createdAt Provenance relations: prov:wasDerivedFrom, prov:wasRevisionOf, wfprov:usedInput, wfprov:wasOutputFrom Social networking: oa:Tag, mediaont:hasRating, roterms:technicalContact, cito:isDocumentedBy, cito:isCitedBy Dependencies: dcterms:requires, roterms:requiresHardware, roterms:requiresSoftware, roterms:requiresDataset Typing: wfdesc:Workflow, wf4ever:Script, roterms:Hypothesis, roterms:Results, dct:BibliographicResource
  24. 24. Saving a research object: RO bundle Single, transferrable research object Self-contained snapshot Which files in ZIP, which are URIs? (Up to user/application) Regular ZIP file, explored and unpacked with standard tools JSON manifest is programmatically accessible without RDF understanding Works offline and in desktop applications – no REST API access required Basis for RO-enabled file formats, e.g. Taverna run bundle Exchanged with myExperiment and RO tools
  25. 25. Workflow Results Bundle URI references ZIP folder structure (RO Bundle) .ro/manifest.json de/def2e58b-50e2-4949-9980-fd310166621a.txt intermediates/ workflowrun.prov.ttl (RDF) execution environment outputA.txt Aggregating in Research Object outputB/ 1.txt 2.txt 3.txt attribution outputC.jpg workflow inputA.txt mimetype application/vnd.wf4ever.robundle+zip https://w3id.org/bundl
  26. 26. RO Bundle JSON-LD context  RDF .ro/manifest.json Who made the RO? When? http://orcid.org/ RO provenance What is aggregated? File In ZIP or external URI Format Who? http://json-ld.org/ External URIs placed in folders Embedded annotation External annotation, e.g. blogpost Note: JSON "quotes" not shown above for brevity https://w3id.org/bundle
  27. 27. Research Object as RDFa http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/ <body resource="http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/" typeOf="ore:Aggregation ro:ResearchObject"> <span property="dc:creator prov:wasAttributedTo" resource="http://delicias.dia.fi.upm.es/members/DGarijo/#me"></span> <h3 property="dc:title">Common Motifs in Scientific Workflows: <br>An Empirical Analysis</h3> <li><a property="ore:aggregates" href="t2_workflow_set_eSci2012.v.0.9_FGCS.xls" typeOf="ro:Resource">Analytics for Taverna workflows</a></li> <li><a property="ore:aggregates" href="WfCatalogue-AdditionalWingsDomains.xlsx“ typeOf="ro:Resource">Analytics for Wings workflows</a></li> http://mayor2.dia.fi.upm.es/oeg-upm/files/dgarijo/motifAnalysisSite/