Welcome
2018
Sponsored by
Centre of Excellence for Computational
Biomolecular Research; H2020 grant 675728.
Research Object
Community Update
Carole Goble, Stian Soiland-Reyes, Sean Bechhofer
The University of Manchester, UK
carole.goble@manchester.ac.uk
RO2018, 29 October 2018, Amsterdam, 2018.
Satellite workshop of IEEE 14th International Conference on e-Science 2018
2010 – Research Objects not PDFs
Research has many
components, of
many types
Exchange of all the components of
an investigation
Computational instruments break or
need to be maintained
Science, and its products, evolve
Overcome
fragmentation
Bundle and relate
components
Support preservation,
reproducibility, reuse
Release
evolving
research
Accelerate
exchange
Research Object Framework
Bechhofer et al (2013) https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) https://eprints.soton.ac.uk/268555/
carry machine processable
metadata common and specific
to different object types
bundle together and relate digital resources
with their context
snapshot, cite,
exchange
Standards-based generic metadata framework
Data used and results produced
Methods used to produce /analyse that data
Provenance and settings, People involved,
Annotations understanding & interpretation
6
Howard Ratner,
Chair STM Future Labs Committee, CEO EVP Nature Publishing Group
Director of Development for CHORUS (Clearinghouse for the Open Research of US) STM
Innovations Seminar 2012
http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5
Container
“Unbounded” Objects
Bags of things and external references to things
A Digital Package Object
Type composed of many
interrelated elements that
bundles together and
relates digital resources of
a scientific investigation
with context.
A Metadata Object
that represents
properties in common
across all research
artefacts types,
common PIDs and
metadata
Output FilesInput Files Intermediates
Parameters
Configurations
Workflow
Run
Provenance
Narrative
ExecutionWorkflow
Engine
Tools / Codes
Resources
Author Workflow
Container
Metadata
Workflow RO
Workflow RO
Describe and run workflows, and
the command line tools they
orchestrate, supporting
containers to be portable,
transparent and interoperable .
Describe the workflow inputs,
outputs, tools and data with
controlled vocabularies /
ontologies
EDAM
Describe the provenance of
the workflow
Software components are
containerised to be portable
Workflow systems run the
CWL workflow
Gather the CWL
workflow
descriptions + rich
context, provenance
using multi-tiered
descriptions
Snapshot workflow.
Relate it to other
objects.
Archive formats
to contain the
objectContainer
Metadata
https://www.commonwl.org/
https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics-
workflow/blob/master/detect_variants/detect_variants.cwl
Manifest
CWL
Annotations
Under the hood
https://osf.io/h59uh/
https://doi.org/10.1101/191783
Inspect and replicate the
computational analytical
workflow to review and
approve the
bioinformatics
Standardize exchange
of HTS workflows for
regulatory submissions
between FDA, pharma,
bioinformatics platform
providers and
researchers
Technology
Independent.
The least possible.
The simplest feasible. Low tech.
Low user overhead and thin client
Graceful
degradation.
Desiderata
IDENTIFIER
Container
Manifest Profile
Descriptions
what
else
is needed
Dependencies
Versioning
its evolution
what
should
be there
Checklists
Provenance
where it
came from
ids
Tailored metadata profiles to describe a RO
All
Type Specific
Implementation
specific
Validate
Container
Manifest Profile
Descriptions
Tailored metadata profiles to describe a RO
general purpose to drive scalable infrastructure
towards generic approaches …
Container
Profile
Tailored metadata profiles to describe a RO
general purpose to drive scalable infrastructure
Manifest
Construction
Manifest
Profile
Description
https://w3id.org/ro/2016-01-28
Manifest Construction
Includes wfdesc and wfprov
Basis of CWL and CWLProv
Time for a review….?
e.g. RO type to help tools..
Container Profiles
Specification for a structured ZIP-file, based on the ePub and Adobe
UCF specifications
Research Object Bundle 1.0
https://researchobject.github.io/specifications/bundle/
Specifies a file system structure for transferring and
archiving a collection of files, including their checksums to
verify and validate content and brief metadata.
https://github.com/ResearchObject/bagit-ro
mechanism for serialization
and transport consistency,
capture identity, annotations and
provenance of the resources
Big Data
collections
of arbitrary
referenced
content
https://github.com/fair-research/bdbag
Manifest Profile Description
general construction & validation tooling
Linked Data and RDF Shapes
Validate graph-based data against a set
of conditions
Shapes Constraint Language
Gamble,Zhao, Klyne,Goble. IEEE eScience 2012,
http://dx.doi.org/10.1109/eScience.2012.6404489
Minim model for defining checklists
ro-show
• RO pre-processing to merge to
single graph
• RDF Shape that indicates to
follow links
• Bespoke validators / unpackers
to iterate over the RO
[Lilian Gorea,Oluwatomide Fasugba 2018]
ResearchObject drivers
Goble, De Roure, Bechhofer, Accelerating KnowledgeTurns, DOI: 10.1007/978-3-642-37186-8_1
Exchange & Commons
Preservation and fixed point publishing
Reproducibility and execution
Active “release” research
Workflows Models Data
NIH Data Commons European Open Science Cloud
CWL Workflow Collaboratory
Seeding critical mass
Community
Tools Driver
ResearchObject Gaps
RO Profiles
– Templates & Standard types
General tooling
– Construction,Validation,Viewing
Handling RO composition
– Nesting, Complex & mixed types
– Less flexibility -> easier tools
RO Life cycles stewardship
– Fixed snapshot
– Living objects
– Rot, mutations, cloning
– References
RO Circulation
– Credit, tracking
Community fragmentation
Digital object
initiatives (e.g. GEDE)
Specification
Build a Community
Acknowledgements
Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Stian Soiland-Reyes
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
KristianGarza
Daniel Garijo
Catarina Martins
Iain Buchan
Michael Crusoe
Rob Finn
Carl Kesselman
Ian Foster
Kyle Chard
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz,
Denis Dean II
Durga Addepalli
Wouter Haak
Anita De Waard
Paul Groth
Oscar Corcho
Josh Sommer
Project ID: 675728

Research Object Community Update

  • 1.
    Welcome 2018 Sponsored by Centre ofExcellence for Computational Biomolecular Research; H2020 grant 675728.
  • 2.
    Research Object Community Update CaroleGoble, Stian Soiland-Reyes, Sean Bechhofer The University of Manchester, UK carole.goble@manchester.ac.uk RO2018, 29 October 2018, Amsterdam, 2018. Satellite workshop of IEEE 14th International Conference on e-Science 2018
  • 3.
    2010 – ResearchObjects not PDFs Research has many components, of many types Exchange of all the components of an investigation Computational instruments break or need to be maintained Science, and its products, evolve
  • 4.
    Overcome fragmentation Bundle and relate components Supportpreservation, reproducibility, reuse Release evolving research Accelerate exchange
  • 5.
    Research Object Framework Bechhoferet al (2013) https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) https://eprints.soton.ac.uk/268555/ carry machine processable metadata common and specific to different object types bundle together and relate digital resources with their context snapshot, cite, exchange Standards-based generic metadata framework Data used and results produced Methods used to produce /analyse that data Provenance and settings, People involved, Annotations understanding & interpretation
  • 6.
    6 Howard Ratner, Chair STMFuture Labs Committee, CEO EVP Nature Publishing Group Director of Development for CHORUS (Clearinghouse for the Open Research of US) STM Innovations Seminar 2012 http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5
  • 7.
    Container “Unbounded” Objects Bags ofthings and external references to things A Digital Package Object Type composed of many interrelated elements that bundles together and relates digital resources of a scientific investigation with context. A Metadata Object that represents properties in common across all research artefacts types, common PIDs and metadata
  • 8.
    Output FilesInput FilesIntermediates Parameters Configurations Workflow Run Provenance Narrative ExecutionWorkflow Engine Tools / Codes Resources Author Workflow Container Metadata Workflow RO
  • 9.
    Workflow RO Describe andrun workflows, and the command line tools they orchestrate, supporting containers to be portable, transparent and interoperable . Describe the workflow inputs, outputs, tools and data with controlled vocabularies / ontologies EDAM Describe the provenance of the workflow Software components are containerised to be portable Workflow systems run the CWL workflow Gather the CWL workflow descriptions + rich context, provenance using multi-tiered descriptions Snapshot workflow. Relate it to other objects. Archive formats to contain the objectContainer Metadata https://www.commonwl.org/
  • 10.
  • 11.
    https://osf.io/h59uh/ https://doi.org/10.1101/191783 Inspect and replicatethe computational analytical workflow to review and approve the bioinformatics Standardize exchange of HTS workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers
  • 12.
    Technology Independent. The least possible. Thesimplest feasible. Low tech. Low user overhead and thin client Graceful degradation. Desiderata IDENTIFIER
  • 13.
    Container Manifest Profile Descriptions what else is needed Dependencies Versioning itsevolution what should be there Checklists Provenance where it came from ids Tailored metadata profiles to describe a RO All Type Specific Implementation specific
  • 14.
    Validate Container Manifest Profile Descriptions Tailored metadataprofiles to describe a RO general purpose to drive scalable infrastructure towards generic approaches …
  • 15.
    Container Profile Tailored metadata profilesto describe a RO general purpose to drive scalable infrastructure Manifest Construction Manifest Profile Description
  • 16.
    https://w3id.org/ro/2016-01-28 Manifest Construction Includes wfdescand wfprov Basis of CWL and CWLProv Time for a review….? e.g. RO type to help tools..
  • 17.
    Container Profiles Specification fora structured ZIP-file, based on the ePub and Adobe UCF specifications Research Object Bundle 1.0 https://researchobject.github.io/specifications/bundle/ Specifies a file system structure for transferring and archiving a collection of files, including their checksums to verify and validate content and brief metadata. https://github.com/ResearchObject/bagit-ro mechanism for serialization and transport consistency, capture identity, annotations and provenance of the resources Big Data collections of arbitrary referenced content https://github.com/fair-research/bdbag
  • 18.
    Manifest Profile Description generalconstruction & validation tooling Linked Data and RDF Shapes Validate graph-based data against a set of conditions Shapes Constraint Language Gamble,Zhao, Klyne,Goble. IEEE eScience 2012, http://dx.doi.org/10.1109/eScience.2012.6404489 Minim model for defining checklists ro-show • RO pre-processing to merge to single graph • RDF Shape that indicates to follow links • Bespoke validators / unpackers to iterate over the RO [Lilian Gorea,Oluwatomide Fasugba 2018]
  • 19.
    ResearchObject drivers Goble, DeRoure, Bechhofer, Accelerating KnowledgeTurns, DOI: 10.1007/978-3-642-37186-8_1 Exchange & Commons Preservation and fixed point publishing Reproducibility and execution Active “release” research
  • 20.
  • 21.
    NIH Data CommonsEuropean Open Science Cloud CWL Workflow Collaboratory
  • 22.
    Seeding critical mass Community ToolsDriver ResearchObject Gaps RO Profiles – Templates & Standard types General tooling – Construction,Validation,Viewing Handling RO composition – Nesting, Complex & mixed types – Less flexibility -> easier tools RO Life cycles stewardship – Fixed snapshot – Living objects – Rot, mutations, cloning – References RO Circulation – Credit, tracking Community fragmentation Digital object initiatives (e.g. GEDE) Specification
  • 23.
  • 24.
    Acknowledgements Barend Mons Sean Bechhofer MatthewGamble Raul Palma Jun Zhao Mark Robinson AlanWilliams Norman Morrison Stian Soiland-Reyes Tim Clark Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone KristianGarza Daniel Garijo Catarina Martins Iain Buchan Michael Crusoe Rob Finn Carl Kesselman Ian Foster Kyle Chard Vahan Simonyan Ravi Madduri Raja Mazumder GilAlterovitz, Denis Dean II Durga Addepalli Wouter Haak Anita De Waard Paul Groth Oscar Corcho Josh Sommer Project ID: 675728

Editor's Notes

  • #6 Nested content Heterogeneous elements. Distributed and embedded content. Externally stewarded content. Checklists + Checksums Standards-based generic metadata framework A unit for exchanges and turning…. Validation, Verification Provenance Dependencies Versions Checklists Variance Portability Transparent Processes Bechhofer et al (2013) Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) Research Objects: Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/ Fixed and Living
  • #8 This is what we are turning More than Data, More than one store Lets make this more concrete before we go into the HOW Standards-based metadata framework for logically and physically bundling resources with context, http://researchobject.org Bigger on the inside than the outside - external referencing Research Object Framework Standards-based metadata framework for logically and physically bundling resources with context
  • #9 In Seven Bridges Language The abstract one is the task (blue) The results as well (green) is the analysis Studies are lots of analysis
  • #10 Community led standard way of expressing and running workflows and the command line tools they orchestrate, supporting containers for portability.
  • #11 JSON-LD
  • #13 Predated the FAIR Principles Element enumeration Identification & citation Description tracking attributes (metadata) and origins (provenance) of contents. Simplicity - low user overhead and thin (no) client
  • #14 Research Objects are designed to: be tailored to be domain or type specific; work at many levels of granularity, with their own identifiers, citation metadata; and be snapshots or living to suit their place in the research lifecycle. Data used and results produced … Methods employed to produce and analyse that data ​… Provenance and settings … People involved ​… Annotations understanding & interpretation …
  • #16 Describe how to describe It’s a scaffold In the presentation we outline features of the latest specifications (https://w3id.org/ro/2016-01-28) Under the hood building blocks
  • #18 BagIt is an Internet Draft that specifies a file system structure for transferring and archiving a collection of files, including their checksums and brief metadata. Research Object bundles is a specification for a structured ZIP-file, based on the ePub and Adobe UCF specifications. The bundle serializes a Research Object, embedding some or all of its resources within the ZIP file, and list the RO content in a manifest, in addition to embedding and referencing annotations and provenance. A BagIt bag can be considered a mechanism for serialization and transport consistency, while a Research Object can be considered a way to capture identity, annotations and provenance of the resources. As such, the two formats complement each-other. They are however not directly compatible. This GitHub repository explains by example a profile for a BagIt bag to also be a Research Object.
  • #19 https://www.slideshare.net/jelabra/shex-vs-shacl ShEx is schema based SHACL is constraint based encoding manifest content profiles using Linked Data and RDF Shapes in order to support general validation tools (https://github.com/researchobject/ro-show) validating graph-based data against a set of conditions. Among others, SHACL includes features to express conditions that constrain the number of values that a property may have, the type of such values, numeric ranges, string matching patterns, and logical combinations of such constraints. SHACL also includes an extension mechanism to express more complex conditions in languages such as SPARQL. A SHACL validation engine takes as input a data graph and a graph containing shapes declarations and produces a validation report that can be consumed by tools. All these graphs can be represented in any Resource Description Framework (RDF) serialization formats including JSON-LD or Turtle. The adoption of SHACL may influence the future of linked data.[2]
  • #20 Exchange & Commons: The transfer of knowledge, data and results between the services and actors and the development of RO Commons to enable reuse and sharing. References to remote content allows for access management due to privacy restrictions or data scales. Commons & Catalogues Publishing, Exchange between people and platforms Sharing, Training Preservation: Snapshots of state of a collection of resources for fixed-point publishing. Conservation Repair Archive Reproducibility: Describing the structured collection, its components and its context in a rich enough way to support inspection and interpretation by people, and re-execution and comparison by computational machinery. Execution Replication Reproducibility Preservation Portability Active “release” oriented research: Accumulating metadata to reflect versions, new configurations of content, evolutions, relationships between objects, and metadata reflecting who has added to the object or used it. Active Research Release Evolution Snapshots Remixing, Comparison, Review Automated processing
  • #22 RO Composer
  • #23 Nested content Heterogeneous elements. Distributed and embedded content. Externally stewarded content. Checklists + Checksums Community forking & fragmentation Arose from workflow sharing and preservation Atomicity Granularity Aggregation Composition, Fragmentation Versioning Forking Cloning Portability Dependency management