Research Objects and their instantiation as RO-Crate: motivation, explanation, examples, history and lessons, and opportunities for scholarly communications, delivered virtually to 17th Italian Research Conference on Digital Libraries
fundamental of entomology all in one topics of entomology
The swings and roundabouts of a decade of fun and games with Research Objects
1. The swings and roundabouts
of a decade of fun and games
with Research Objects
Professor Carole Goble
The University of Manchester, ELIXIR-UK
Software Sustainability Institute UK
carole.goble@manchester.ac.uk
17th Italian Research Conference on Digital Libraries (IRCDL) 2021, 18th February 2021
2. I work at supporting
Data Driven
Research at three
scales
Across the research
data lifecycle
Life Sciences,
Biodiversity
social sciences, astronomy,
digital libraries, chemistry etc
EU Research
Infrastructures
4. Methods
De novo assembly and binning
Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21.
Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig
length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of
coverage required for the binning was inferred by mapping the raw reads back to their
assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths
of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools
sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The
QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722
using the lineage_wf workflow and calculated as: level of completeness − 5 ×
contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from
INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance
models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the
sum of all non-overlapping hits. Each gene was considered present if more than 80% of the
expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified
with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default
parameters. Classification into high- and medium-quality MAGs was based on the criteria
defined by the minimum information about a metagenome-assembled genome (MIMAG)
standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S
rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination).
Assignment of MAGs to reference databases
Four reference databases were used to classify the set of MAGs recovered from the human
gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR
comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination)
retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the
HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n
= 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053
eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August
2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly
available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial
Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash
sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash
sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG
and the set of references was calculated with ‘mash dist’ to find the best match (that is, the
reference genome with the lowest Mash distance). Subsequently, each MAG and its closest
relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of
genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.
https://doi.org/10.1038/s41586-019-0965-1
scripts, workflows, SOPs & datasets
6. Objects are the Outcomes of Research
research outcomes are more than just publications and data
software, models, workflows, SOPs, lab protocols….
all are first class citizens of scholarship
information required to make research
FAIR and Reproducible (FAIR+R) …
7. From FAIR data to FAIR Digital Objects
To be FAIR
each digital object
type has its own
metadata
requirements,
and may have its
own repositories
and registries
FAIR Digital Objects for Science: From Data Pieces to Actionable
Knowledge Units: https://doi.org/10.3390/publications8020021
8. From FAIR data to FAIR Digital Objects
The FAIR Guiding Principles for Data Stewardship and Management
Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18
9. Objects are the Outcomes of Research
Each object has its own metadata and repositories
13. shareable, cite-able, exchangeable resource, with versioning and snapshots
Research Object
general framework - research outcomes related and bundled together
14. enriching resources and collections with additional
information required to make research FAIR and reproducible
metadata describing content and context
dependencies, versions, relationships, provenance, annotations …
DCAT
Schema.org
EML
MIAPPE
CodeMeta SBML
Bioschemas/CWL DataCite
15. integrated view over fragmented resources using PIDs
bigger on the inside than the outside
encapsulated content and references to external resources
16. has its own metadata, can be registered and deposited in its own right,
unpackaged and accessed, activated and reproduced if appropriate
encapsulated content and references to external resources
17. RO is the Unit of Knowledge Exchange
Between scholarly
communication and
research infrastructure
platforms
->
Between researchers
18. From the big picture
to the small picture in
practice
19. standards-based metadata framework for bundling resources
with context into citable reproducible packages
researchobject.org
Linked Data
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
Since 2010…
20. self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
files
links to web
resources
RO Content
Archive file format / packaging system
BagIt, zip
OCFL, Git
type
id
description
datePublished
…
directories
license
author organisation
21. self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
image file
links to web
resources
RO Content
Archive file format / packaging system
directory of data
type, id
description
datePublished
creator
size
format …
"
https://zenodo.org/record/3541888
https://github/script
type
id
description
datePublished
…
license
author organisation
“Web Linked Data”
approach
22. self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
image file
links to web
resources
RO Content
Archive file format / packaging system
directory of data
type, id
description
datePublished
creator
size
format …
"
https://zenodo.org/record/3541888
type
id
description
datePublished
…
license
author organisation
Workflow RO
“Web Linked Data”
approach
23. self-describing, chiefly metadata, objects
How do we describe the metadata?
PIDs + JSON-LD + Schema.org
descriptors
How can I add additional metadata?
Schema.org, domain ontologies
How do I define a checklist of what is
expected to be in a type of RO?
RO Profile
http://www.researchobject.org/ro-crate/
Standard
Web
Mark-up
24. RO-Crate in a nutshell
Practical lightweight approach to packaging research data
entities (any object) with metadata
Aggregate files and/or any URI-addressable content, with
contextual information to aid decisions about re-use:Who
WhatWhenWhereWhy How.
Web Native Machine readable. Human readable. Search
engine friendly. Familiar.
Extensible and Incremental: add additional metadata; nested
and typed by their profile.
Open Community effort
38 contributors
fromAU, EU and US
http://www.researchobject.org/ro-crate/
25. Lifting the Skirts to look at the
underwear
https://w3id.org/ro/crate/1.1
Released 30th October 2020
http://www.researchobject.org/ro-crate/
28. Cultural Heritage: A data curation service
for endangered languages: 500,000 files in
28,624 items and 574 collections
long term preservation and
accessibility of research data objects
https://arkisto-platform.github.io/
[Marco La Rosa, Peter Sefton]
29. Describe the data as
it’s being created
using open standards
and tools
source: https://guides.library.ucsc.edu/datamanagement/
[Marco La Rosa, Peter Sefton]
30. Scalable verified
collections of references
Processing big genomic & clinical data
distributed over multiple locations
NIH Data Commons
[Chard, et al 2016]
https://doi.org/10.1109/BigData.2016.7840618
minids
Retain and archive processed datasets
Reference and transfer large data on demand
Controlled access to sensitive data
[Kesselman, Foster]
31. Data and Method Commons
13 EU Life Science Research Infrastructures
Sharing data, tools and workflows in the cloud
100+ data collections
method commons
32. Data and Method Commons
13 EU Life Science Research Infrastructures
exchanging and preserving
secure data objects
Sharing data, tools and workflows in the cloud
interchange, stewardship,
recording dependencies -> portability &
reuse/reproducibility of workflow objects
34. Do Research
Assemble
Methods, Materials
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Publish
Share
Results
Manage
Results
Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
Experiment
Observe
Simulate
Describe and release the data , workflows, research
as its being created, updated and used
35. Getting embedded into the European Open Science Cloud
Collaborative environments and
EGI-ACE data spaces for Earth
Science researchers
RO-Crate as interchange format and
integration with Zenodo, using Describo
Pan-European natively FAIR
and GDPR compliant data
storage and sharing fabric
Science Mesh using Cloud
Services for Synchronization
and Sharing (CS3)
RO to manage research activities
and release snapshots to Zenodo
36. exchange between
genomics platform and
repository.
standardise and share
analyses generated from
genome sequencing.
Scholarly Communication Drivers
Exchange & Import/Export
Reuse & Reproduce
Report & Archive
Living Objects
Share & Access
Figure: RDMKit, https://rdmkit.elixir-europe.org/
39. Activation &
research in a
project
Adoption by
other projects
Adoption
Adoption in
mainstream
infrastructure
40. Research ObjectVersion 1
2010-2017
http://wf4ever.org/
Sharing of data-intensive computational workflows
Preservation and managing their decay
• documentation of workflows
• interconnections between workflows and related
resources (e.g., datasets, publications, etc.)
• documentation of their dependencies and their outputs
• social aspects
Type-cast not just for workflows or biology!
41. Howard Ratner,
Chair STM Future Labs Committee, CEO EVP Nature PublishingGroup
Director of Development for CHORUS
2012
2017
42. adoption stuck in inner groups
not mainstream or outsiders
a swing back to basics
reboot!
43. Machine-processable
Standards
Low tech
Graceful degradation
Commodity tooling
Incremental
Multi-platform
Technology Independent
Keep it
simple
E X A M P L E S
Developer
friendliness
A swing back to basics
"just enough complexity / standards”
sufficient extra benefits from what
already exists…
…without compromising the developer
entry-level experience so much that they
rather do their own thing.
44. Version 1
Linked Data Purity
Construct
metadata
file
Describe
metadata
content
Mismash of
ontologies
combined together
to describe
metadata file
RDF Shapes
Represent profiles
Hard to make
checklists
SHACL, ShEx
W3C Web Annotation Vocabulary
OAI Object Exchange and Reuse
45. Indeed.
Linked Data is not enough.
And sometimes its too much.
Research Infrastructures:
“digital technologies (hardware, software),
resources (data, services, digital libraries,
standards), comms (protocols, access rights,
networks), people and organisational
structures”
46. Tensions
Research Infrastructures sit in the middle
Academic
Viewpoint
Infrastructure
Viewpoint
Green field site
Theoretical purity
Use latest thing
Proof of concept
Sophistication
Narrow developer audience
Strive for super generic
The end
Exposing the tech
Pre-existing platforms
Practicality
Use things that work
Production
Simplicity & Familiarity
Wide developer audience
Several specific is ok!
The means
Hiding the tech
47. 2018: Digital Libraries & Developer friendly reboot
Peter Sefton
Semantic Web world vs RealWorld
Open Repositories & Dig Lib Community
DataCrate
simple web
stack
RO
rich RDF
stack
+
fewer features
easier to understand, conceptually simpler
opinionated guide to current best practices
software stacks widely used on the Web
48. Just Enough Linked Data Just InTime
simplifications rather than generalizations
Retain benefits of Linked Data
querying, graph stores,
vocabularies, clickable URIs
customization and conventions
The stuff a developer needs
documentation, examples, libraries,
tools, community
limited flexibility frees up developers
familiarity is important
Linked Data “exotics” there for when the time is right if needed by the right people
+
49. Developer friendly ->Tooling, Libraries
How can I use it?
While we’re mostly focusing on the specification, some tools already exist for working with
RO-Crates:
● Describo interactive desktop application to create, update and export RO-Crates for
different profiles. (~ beta)
● CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable
rendering (~ beta)
● ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta)
● ro-crate-js - utility to render HTML from RO-Crate (~ alpha)
● ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha)
● ro-crate-py Python library to consume/produce RO-Crates (~ planning)
These applications use or expose RO-Crates:
● Workflow Hub imports and exports Workflow RO-Crates
● OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the
file system, validate RO-Crate Metadata Files and parse into objects registered in
Elasticsearch. (~ alpha)
● ONI indexer
● ocfl-tools
● ocfl-viewer
● Research Object Composer is a REST API for gradually building and depositing
Research Objects according to a pre-defined profile. (RO-Crate support alpha)
● … (yours?)
https://uts-eresearch.github.io/describo/
50. People! A Community
Sponsored
RO2018, Amsterdam
e-Science Conference
https://www.researchobject.org/ro-crate/#contribute
Diverse set of people
Variety of stakeholders
Set of collective norms
Open platform for communication
RO2018
51. People! Leadership, Champions, Early Adopters
we had to reboot the community twice….
Stian Soiland-Reyes
The University of Manchester, UK
Peter Sefton
University ofTechnology Sydney Infrastructure buy in lifts from project to mainstream
52. tendency to first focus on
providers…
…but (developer) consumers
build the ecosystem…
53. Consumers
Archivist and library specialists know the
importance of metadata and standards…
… and for things to work 5, 10, 20 years later.
End-users and Developers handle legacy …
…. their field, repositories, institutions , journals
etc. will always be lagging behind the curve
Researchers, to make their object FAIR …
… do not have the resources or know-how….
Specification &
standards-based
realisation
Incremental &
Extensible, Practical,
Familiar
Built in to their
platforms they use in
their day-to-day work
55. Back to the Big Picture.
Opportunities
for Scholarly Communications
56. I concentrated on developers….what about end users?
Big Picture
User-Ware
Familiar tools in
the research
process
Infrastructure
Under-Ware
Familiar tech in
the infrastructure
Registries
Repositories
Tools
Applications
Built in
On ramps
Metadata
automation
57. I concentrated on developers….what about end users?
Infrastructure
RO-Crate Commons
Ecosystem
Knowledge graphs Citation and tracking
Reproducibility
Release updates
New services
Built by
developers
New user
capabilities
Living and actionable publications
whose content transcends platforms
threaded publications
FAIR Digital
Objects
58. Back to FAIR Digital Objects
Actionable knowledge unit
Digital butterfly – digital twins
Bags of references
courtesy Dimitris Koureas
Coordinator DiSSCo EU Research
Infrastructure
Specimen object image
courtesy of Alex Hardisty
[Hardisty et al, 2020]
59. European Open Science Cloud
EOSC Interoperability Framework
Specimen Data Refinery
Workflows to Digitise
Natural History Specimens
FAIR Digital Object
Framework
Open Digital Specimen
Workflow Infrastructure
RO-Crate
+
https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-
01aa75ed71a1/language-en/format-PDF/source-190308283
60. FAIR Principles for Digital Research Objects
FAIR all the way down
Unbounded FAIR
Distributed FAIR
Living FAIR
Analogous to FAIR Software
FAIR RO-Crate is a practical start
Fig from EOSC Interoperability Framework
61. Preparing for Digital Objects…..
Digital library community allies!
Make Research Object’s normative
– Promote to researchers but …
– ... target Research Infrastructures to deliver
– Move RO(-Crate)s across the scholarly ecosystem
Developer friendliness matters
– Persuasive design
FAIR principles for Research Objects….
– Unifying the vision with the practical
Release
paradigm
instead of a
Publish one
62. Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Stian Soiland-Reyes
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
DanielGarijo
Catarina Martins
Iain Buchan
Michael Crusoe
Rob Finn
StuartOwen
Finn Bacall
Bert Droebeke
Laura Rodríguez Navas
Ignacio Eguinoa
Carl Kesselman
Ian Foster
Kyle Chard
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz
Denis Dean II
DurgaAddepalli
Wouter Haak
Anita DeWaard
Paul Groth
Oscar Corcho
Peter Sefton
Eoghan Ó Carragáin
FrederikCoppens
Jasper Koehorst
Simone Leo
Nick Juty
LJ Garcia Castro
Karl Sebby
Alexander Kanitz
Ana Trisovic
http://researchobject.org
Gavin Kennedy
Mark Graves
José María Fernández Jose
Manuel Gomez-Perez
Jason A. Clark
Salvador Capella-Gutierrez
Alasdair J. G. Gray
Kristi Holmes
Giacomo Tartari
Hervé Ménager
Paul Walk
Brandon Whitehead
Erich Bremer
Mark Wilkinson
Jen Harrow
Marco La Rosa
And many more....
Editor's Notes
The swings and roundabouts of a decade of fun and games with Research Objects
http://ircdl2021.dei.unipd.it/
17th Italian Research Conference on Digital Libraries Information and Research Science connecting to Digital and Library Science
Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
[1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v
Data pipeline & analysis reporting & reproducibility
https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-01aa75ed71a1/language-en/format-PDF/source-190308283
https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf
Schwardmann (2020), Digital Objects – FAIR Digital Objects: Which Services Are Required? Data Science Journal
EOSC Interoperability Framework Draft (2020)
Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280.
DONA Digital Object Architecture Digital Object Interface Protocol (2018)
https://fairdigitalobjectframework.org/
The FAIR Guiding Principles for Data Stewardship and Management
Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18
Annotations about these resources, to improve understanding and interpretation
Data used and results produced in experimental study
Methods employed to produce and analyse that data
Provenance and settings for the experiments
People, specimens, equipment etc involved in the investigation
Overcome fragmentation
Overcome fragmentation
Added after presentation
So we came up with ROs
Identifiers to locate things
Organisation to structure & link things
Annotations about the things
@type: MUST be Dataset
@id: MUST end with / and SHOULD be the string ./
name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used.
This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata.
The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.
While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.
At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root.
While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.
It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects.
The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
Data might
@type: MUST be Dataset
@id: MUST end with / and SHOULD be the string ./
name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used.
This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata.
The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.
While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.
At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root.
While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.
It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects.
The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
Other ROs might be embedded or linked
An RO might JUST be metadata and a structured
@type: MUST be Dataset
@id: MUST end with / and SHOULD be the string ./
name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used.
This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata.
The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.
While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.
At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root.
While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.
It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects.
The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
Portland Common Data Model for digital library & repository content
Metadata is Flat JSON file
Descriptions of the Objects
Descriptions of the RO-Crate
Descriptions relating the Objects
human and machine readable formats,
additional domain-specific metadata.
Added after presentation
Released 30th October 2020
We wanted to be platform/community agnostic
PARADISEC and The University of Melbourne
University of Technology Sydney
PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) has operated a data curation service for endangered languages for 18 years. Currently consisting of 500,000 files in 28,624 items and 574 collections, there is over 90TB of data under management. The repository, whilst still functional and fit for purpose, is showing signs of its age. Using the Arkisto Platform, the Modern PARADISEC project is updating the catalog whilst also enabling a path for smaller regional repositories to exist and to easily move their content across collections and institutions.
For research data to be Findable, and Interoperable and Reusable, it has to have metadata that makes it so - and this needs to be both human and machine readable so that finding aids can be built and people can search, or write programs to discover and use data. Also for research data to be distributed it has to be *packaged* into distributable objects - otherwise there is no digital object. This talk looks at one tool that implements the Research Object Crate standard that supports such FAIR packaging.
NIH Data Commons
transferring and archiving very large HTS datasets in a location-independent way
keep the context of data content together when its scattered. Scalability
The Data-Driven Open Science, e-Laboratories
ROs all the way along, not just the end
Inside and outside
Elsewhere, on-date
Within, during
Reproducible Research Environment
Integrated infrastructure for producing and working with reproducible research.
Reproducible Research Publication Environment
Distributing and reviewing; academic credit; legal licensing etc.
Such a merge between research infrastructure and research marketplace would overcome known "cultural barriers" and the "methodological barriers" mentioned above. In the RI scope, research products publishing would:
be facilitated by the very ICT services that are generating them;
take advantage of research activity awareness and product interlinking;
support access rights issues that fit the need of the community; and
subsume the major costs of storing and curating products. In other words, by bringing marketplace features within the RI, researchers can finally achieve their best effort in terms of scholarly communication.
Science 2.0 Repositories: Time for a Change in Scholarly Communication
Massimiliano Assante, Leonardo Candela, Donatella Castelli, Paolo Manghi and Pasquale PaganoIstituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Italy{assante, candela, castelli, manghi, pagano}@isti.cnr.it DOI: 10.1045/january2015-assante
Data Cubes for efficient and scalable structured data access and discovery
Research Objects as the mechanism to manage scientific research activities and connect associated resources
RO snapshot/releases published to EOSC repositories and scholarly communication platforms (e.g., B2Share, Zenodo
Journey
Journey
Howard Ratner, Chair STM Future Labs Committee, CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US) STM Innovations Seminar 2012
projects
https://www.researchobject.org/specs/
W3C Web Annotation Vocabulary
OAI Object Exchange and Reuse
Describe manifest content
Wf4Ever RO ontology, Wf4Ever ROEvo …
Dublin Core, FOAF, SIOC, Creative Commons, PROV, PAV…
The first implementation of Research Objects in 2009 for sharing workflows in myExperiment https://doi.org/10.1093/nar/gkq429 was based on RDF ontologies http://eprints.soton.ac.uk/id/eprint/267787, building on Dublin Core, FOAF, SIOC, Creative Commons, OAI-ORE and (later) DBPedia to form myExperiment ontologies for describing social networking, attribution and creditation, annotations, aggregation packs, experiments, view statistics, contributions, and workflow components. http://web.archive.org/web/20091115080336/http://rdf.myexperiment.org/ontologies Programmatic access to Research Objects was facilitated with an RDF endpoint that exposed individual myExperiment resources, also queriable from a SPARQL endpoint, both using the myExperiment vocabularies and RDF formats RDF/XML and Turtle.
Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
Stack becomes very large to learn and use.
RDF as its own worst enemy.
Need for simplicity. Cater for "web developer".
Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
Stack becomes very large to learn and use.
RDF as its own worst enemy.
Need for simplicity. Cater for "web developer".
Sometimes its too much
The irony
It is for research infrastructures not research
We started in theory, but we work in practice
ROs sit in the middle
“Just enough standards” cf. schema.org
What I would like to see in the end is how we are NOT throwing out the Linked Data foundation - and that we want to retain all the benefits, e.g. querying, graph stores, vocabularies, clickable URI as identifiers - but they are mere things we want to keep in our toolbox rather than a goal in itself - and as such the toolbox also have the other things developers expects; documentation, examples, libraries, simplifications rather than generalizations [removing unneeded degrees of freedom like multiple RDF formats]; showing them how to do the Right Thing (tm) almost by accident. Then they can - if they want to and when time is right - pick up some of those shiny tools in the bottom of the toolbox.
Just like in cars - they benefit a lot from reusing standards - like communication buses, agreeing on mounts and screws, almost interchangable parts so that manufacturers can mix and match when they put it together. But you as a consumer don't care until you realize your exhaust pipe is broken and the mechanic now can choose how to fix it rather than be restrained to one single manufacturer.
some kind of "Why not just plain JSON?" point perhaps. We are not after the exotic features that you will hear about in the rest of the conference, but the bog standard tooling that arguably you could also do in a NoSQL database with enough customization and conventions. But rather than invent our own C&C we retain the Linked Data conventions - "Just enough standards" - following the lead of schema.org
Sponsored by BioExcel
(GitHub, Google Docs, monthly telcons)
Cross cutting Infrastructure buy in lifts from boutiques to mainstream
Easy to make, easy to use?
Hence desktop solutions like Describo are also important. Not sure if you mentioned Describo!
Library & Information Science Network
Then to can enable the big picture. Get early benefits
Snapshotting evolving papers can we get to the released evolving paper?
RO Middleware – added services on top, like a RO commons
Big picture new scholarship ….
Small picture new middleware…..
[Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280.
How to make this practical!
How do we persuade Scientists?
We do not
That’s why we need to persuade the infrastructure people
What other things do we need to do
Small Picture – you don’t
Developer Friendliness Matters
Its underware – customers are infrastructures
Marrying the FDO vision with the RO
Embed in Research Infrastructures