RO-Crate: A framework for packaging research products into FAIR Research Objects
Mar. 3, 2021•0 likes
1 likes
Be the first to like this
Show More
•394 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Science
RO-Crate: A framework for packaging research products into FAIR Research Objects presented to Research Data Alliance RDA Data Fabric/GEDE FAIR Digital Object meeting. 2021-02-25
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research
products into FAIR Research Objects
Carole Goble
The University of Manchester
ELIXIR-UK
https://orcid.org/0000-0003-1219-2137
@caroleannegoble
This work is licensed under a
Creative Commons Attribution 4.0 International License
Stian Soiland-Reyes
The University of Manchester
BioExcel Centre of Excellence
The University of Amsterdam
https://orcid.org/0000-0001-9842-9718
@soilandreyes
RDA IG Data Fabric, FAIR Digital Object, 2021-02-25
tl;dr
Web standards-based metadata framework for bundling resources with their
context into citable reproducible packages
Machine actionable Metadata + Identifiers + Web protocols => FAIR
What andWhy
Examples
How andTools
Alignment with FDOF
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
Many Objects are the Outcomes of Research
Each object has its own metadata and repositories
All are first class citizens and are required to make research FAIR+R
integrated view
over fragmented
resources using
PIDs and metadata
Encapsulated content and references to external resources
The RO package has its own metadata, can be registered and deposited in its own
right, unpackaged and accessed, activated and reproduced if appropriate
self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
files
links to web
resources
RO Content
Archive file format / packaging system
BagIt, zip
OCFL, Git
type
id
description
datePublished
…
directories
license
author organisation
self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
image file
links to web
resources
RO Content
Archive file format / packaging system
directory of data
type, id
description
datePublished
creator
size
format …
https://zenodo.org/record/3541888
https://github.com/o/script
type
id
description
datePublished
…
license
author organisation
Linked Data
approach
Self-describing, chiefly metadata, objects
Strict Structure, Open ended content
How do we describe the metadata?
• PIDs + JSON-LD + Schema.org descriptors
• Opinionated profile of schema.org
• Linked Data by Stealth: JSON with gradual path to
extensibility with LD – e.g. ad-hoc terms
• Example-driven documentation
How can I add additional metadata?
• Schema.org, domain ontologies
How do I define a checklist of what is
expected to be in a type of RO?
• RO-Crate Profiles
http://www.researchobject.org/ro-crate/
Standard Web
Mark-up
Summary: RO-Crate in a nutshell
Practical lightweight approach to packaging research data
entities (any object) with metadata
Aggregate files and/or any URI-addressable content, with
contextual information to aid decisions about re-use: Who
What When Where Why How.
Web Native Machine readable. Human readable. Search
engine friendly. Familiar.
Extensible and Incremental: add additional metadata; nested
and typed by their profile.
Open Community effort
http://www.researchobject.org/ro-crate/
Cultural Heritage: A data curation service for
endangered languages: 500,000 files in
28,624 items and 574 collections
long term preservation and
accessibility of research data objects
https://arkisto-platform.github.io/
[Marco La Rosa, Peter Sefton]
Scalable verified
collections of references
Processing big genomic & clinical data
distributed over multiple locations
NIH Data Commons
[Chard, et al 2016]
https://doi.org/10.1109/BigData.2016.7840618
minids
Retain and archive processed datasets
Reference and transfer large data on demand
Controlled access to sensitive data
[Kesselman, Foster]
minids minids
13 EU Life Science Research Infrastructures
Data and Method Thematic Commons
Sharing data, tools and workflows in
the cloud
Workflow + data + provenance interchange, stewardship,
recording dependencies -> portability &
reuse/reproducibility
https://www.eosc-life.eu/
Embedding into Infrastructures & Standards
EGI-ACE data spaces for
Earth Science researchers
FAIR and GDPR compliant data
storage and sharing fabric Science
Mesh using Cloud Services for
Synchronization and Sharing
exchange between
genomics platform and
repository.
standardise and share
analyses generated
from genome
sequencing.
IEEE 2791-2020
Supporting the Research Life Cycle
Exchange &
Import/Export
Report & Archive
Share & Access
Reuse & Reproduce
Living Objects
Describe the objects
as they are being
created using open
standards and tools
Release the objects
as they are created,
updated and used
Figure: RDMKit, https://rdmkit.elixir-europe.org/
Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
Machine-processable
Standards
Low tech
Graceful degradation
Commodity tooling
Incremental
Multi-platform
Technology Independent
Keep it
simple
E X A M P L E S
Developer
friendliness
Just enough complexity / standards
• sufficient extra benefits from what already
exists…without compromising developer entry-level
experience so they do their own thing
Just Enough Linked Data Just In Time
• simplifications instead of generalizations
Retain Linked Data benefits
• querying, graph stores, vocabularies, clickable URIs)
Plus the developer needs
• documentation, examples, libraries, tools,
community …
Limited flexibility frees up developers
Familiarity is important for uptake
From FAIR data to FAIR Digital Objects
FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units:
https://doi.org/10.3390/publications8020021
https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-01aa75ed71a1/language-en/format-PDF/source-190308283
FAIR Digital Objects
Actionable knowledge unit
Digital butterfly – digital twins
Bags of references
courtesy Dimitris Koureas
Coordinator DiSSCo EU Research
Infrastructure
Specimen object image
courtesy of Alex Hardisty
[Hardisty et al, 2020]
Specimen Data Refinery
Workflows to Digitise Natural History Specimens
FAIR Digital Objects -> Packaged + Actionable
FAIR Digital Object
Framework
Open Digital Specimen
Workflow Infrastructure
RO-Crate
+ Same conceptual workflow;
multiple executable flavours
+
Data and Contextual entities
described within RO-Crate Metadata File
Base vocabulary & types: schema.org
Cross-references to further contextual entities
RO-Crate principle:
Reuse existing PIDs and URLs
..but always describe entities which lack a
human-readable resolution
Metadata
{
"@id": "https://orcid.org/0000-0002-8367-6908",
"@type": "Person",
"affiliation": { "@id": "https://ror.org/03f0f6041" },
"name": "J. Xuan"
}
{
"@id": "https://ror.org/03f0f6041",
"@type": "Organization",
"name": "University of Technology Sydney",
"url": "https://www.uts.edu.au/"
}
{
"@id": "figure.png",
"@type": ["File", "ImageObject"],
"name": "XXL-CT-scan of an XXL Tyrannosaurus rex skull",
"identifier": "https://doi.org/10.5281/zenodo.3479743",
"author": {"@id": "https://orcid.org/0000-0002-8367-6908"},
"encodingFormat": "image/png"
}
No need to put all the metadata in RO-Crate!
31
https://biocompute-objects.github.io/bco-ro-crate/
Specification provide details on how to extend RO-Crate formally or informally
However, in many cases a domain-specific format for metadata may already exist
Co-exist as a side-car metadata file within the Research Object
Example: BioCompute Object has its own JSON format, typed & indicated from RO-Crate
Persistent IDs
Gradual ascent towards FAIR
Base line: Relative paths from RO-Crate Metadata File
Use cases: Describing dataset on desktop
Ad-hoc web-hosting (e.g. GitHub pages)
Institutional archives (e.g. Oxford Common File Layout)
Reuse existing PIDs and URLs
Use cases: Large data, not a file (e.g. database), reference datasets
Cite/reference existing resources (e.g. via identifiers.org)
Distinguish and crosslink contextual entities
Make paths absolute, using location-independent PIDs
(UUIDs, Naming Things with Hashes, ARCP)
Use cases: Found RO-Crate “in the wild”, ZIP archives, workflow outputs
Assign PIDs to RO-Crate (and its entities)
Use cases: Long-term availability, citations, permalinks
Bit Sequences
Base line: Files on disk within RO-Crate Root folder + URLs
Downloadable Web resources listed w/ content size and access time.
Packaging/archiving RO-Crate Root folder:
BagIt (RFC8493): Manifest of all (local) files and their checksums
Use case: Ensure all files are transferred/archived
BDBag: Include external files by ARK/MinID PIDs
Use case: “thin” RO-Crate, Big Data, shared immutable files
OCFL: Archival storage with revision tracking, infrequent changes
Git LFS: Tracked frequent changes, collaborative editing
33
Repository
Base line: zip/tar download
e.g. “supplementary data”, GitHub release
Export from web platforms
e.g. workflowhub.eu, Galaxy workflow system
Deposit in general/institutional
repositories
e.g. Zenodo, Mendeley Data
Deposit in domain-specific
repositories
e.g. GBIF
Deposit individual files
e.g. ARK, S3, B2SHARE
Deposit metadata
e.g. WikiData, nanopubs
What can FDO learn from RO-Crate?
Not everything is known before hand
• Existing PID metadata not always relevant
• Allow contextualized metadata
• Types not always known in advance
• Allow “casting” or reinterpretation
• Operations not always known in advance
• Allow open-ended generation and use
Use wheels already invented
• Fit into researchers’ existing working practices and
familiar technologies.
• Aim for gradual improvements.
• Reuse existing technology
• .. But only when not too complex
• Reuse existing PID infrastructure,
including URLs
• Human consumers recognize and click hyperlinks
• Build on existing metadata standards
• Which is both simple and extendable
Data and Contextual entities are equally
important
Remember people … especially Developers
• Keep human consumption in focus
• Ensure metadata is easily rendered and edited
• Provide Best Practice guidance
• Firm, but not too restrictive
• Developer producer and consumer friendly
• Rather than academically elegant
What can RO-Crate learn from FDO?
Provide stronger guidance on PID and availability
Recommend deposit infrastructure for end-users and framework developers
Tools to assist – e.g. generate Zenodo Datacite metadata from RO-Crate
Provide stronger typing of RO-Crates and its content
Profiles as first approach
Expose potential operations on an RO-Crate
Build general RO-Crate services, e.g. index
Turtles all the way down!
Document better how RO-Crates can/should be nested
How to choose granularity of RO-Crates?
Tools to liberate/reuse an entity from a single RO-Crate
RO-Crate as part of the FDO ecosystem
A bag of references + metadata
Framework for actionable objects
RO-Crates enable active operations for FDOs
FDO offers additional infrastructure and
practice
A bag of references + metadata
A metadata framework for FDO
Developer friendly practical and web-native
implementations
Community
Infrastructure applications
Synthesys+ Project gives a concrete
use case for practically join the two up.
FAIR Principles for Digital Research Objects
FAIR all the way down
Unbounded FAIR
Distributed FAIR
Living FAIR
Analogous to FAIR Software
FAIR RO-Crate is a practical start
Fig from EOSC Interoperability Framework
Editor's Notes
Overcome fragmentation
Identifiers to locate things
Organisation to structure & link things
Annotations about the things
@type: MUST be Dataset
@id: MUST end with / and SHOULD be the string ./
name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used.
This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata.
The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.
While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.
At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root.
While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.
It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects.
The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
Data might
@type: MUST be Dataset
@id: MUST end with / and SHOULD be the string ./
name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used.
This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata.
The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.
While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.
At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root.
While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.
It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects.
The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
More flexible than XML Schemas
More straightforward than OWL
More flexible than XML Schemas
More straightforward than OWL
Linked Data “exotics” there for when the time is right if needed by the right people
Portland Common Data Model for digital library & repository content
Metadata is Flat JSON file
Descriptions of the Objects
Descriptions of the RO-Crate
Descriptions relating the Objects
human and machine readable formats,
additional domain-specific metadata.
PARADISEC and The University of Melbourne
University of Technology Sydney
PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) has operated a data curation service for endangered languages for 18 years. Currently consisting of 500,000 files in 28,624 items and 574 collections, there is over 90TB of data under management. The repository, whilst still functional and fit for purpose, is showing signs of its age. Using the Arkisto Platform, the Modern PARADISEC project is updating the catalog whilst also enabling a path for smaller regional repositories to exist and to easily move their content across collections and institutions.
NIH Data Commons
transferring and archiving very large HTS datasets in a location-independent way
keep the context of data content together when its scattered. Scalability
ReproZip can automatically pack your research along with all necessary data files, libraries, environment variables and options into a self-contained bundle. https://www.reprozip.org/
One pressure is towards bundling objects within objects. There’s nothing in DRS for a structured manifest within a bundle.
RO-Crate as interchange format and integration with Zenodo, using Describo
Data Cubes for efficient and scalable structured data access and discovery
Research Objects as the mechanism to manage scientific research activities and connect associated resources
RO snapshot/releases published to EOSC repositories and scholarly communication platforms (e.g., B2Share, Zenodo
We started in theory, but we work in practice
ROs sit in the middle
To be FAIR
each digital object type has its own metadata requirements,
and may have its own repositories and registries
https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-01aa75ed71a1/language-en/format-PDF/source-190308283
https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf
Schwardmann (2020), Digital Objects – FAIR Digital Objects: Which Services Are Required? Data Science Journal
EOSC Interoperability Framework Draft (2020)
Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280.
DONA Digital Object Architecture Digital Object Interface Protocol (2018)
https://fairdigitalobjectframework.org/
[Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280.
How to make this practical!
So let’s do a quick reminder about what are the key components of FAIR Digital Objects:
Multiple levels of Digital objects:
- A Persistent Identifier to locate the DO
- A set of Operations to use the DO
- Metadata about the DO – itself a DO
- All of these represented as bit sequences that can be transferred or stored in repository
- Finally a Collection is a type of DO that aggregates other DOs and of course has its own metadata
For RO-Crate, how do we approach FDO? We have implementations for each of these concepts:
The RO-Crate dataset itself is of course what we can call a Collection.
It aggregates a set of data entities, but also contextual entities. I’ll come back to those.
We describe the RO-crate and the entities in a metadata file
We can assign persistent identiers to the RO-Crate and parts of it
We can store the bit sequences, ideally with something like BagIt, to ensure completeness
And deposit all of this in regular repositories like Zenodo or GitHub
RO-Crate recommends how to structure metadata.
FDO RO-Crate profile
What are the operations on RO-Crate metadata?
Re-visit Wf4Ever Research Object APIs? (basically CRUD)
RO-Crate as a Linked Data Platform (LDP) Container?
Registration of RO-Crate types/profiles
How can RO-Crate users use FDO to resolve PID/storage challenges?