The swings and roundabouts of a decade of fun and games with Research Objects

The swings and roundabouts
of a decade of fun and games
with Research Objects
Professor Carole Goble
The University of Manchester, ELIXIR-UK
Software Sustainability Institute UK
carole.goble@manchester.ac.uk
17th Italian Research Conference on Digital Libraries (IRCDL) 2021, 18th February 2021

I work at supporting
Data Driven
Research at three
scales
Across the research
data lifecycle
Life Sciences,
Biodiversity
social sciences, astronomy,
digital libraries, chemistry etc
EU Research
Infrastructures

Data availability
Methods
Source data
Suppl info

Methods
De novo assembly and binning
Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21.
Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig
length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of
coverage required for the binning was inferred by mapping the raw reads back to their
assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths
of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools
sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The
QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722
using the lineage_wf workflow and calculated as: level of completeness − 5 ×
contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from
INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance
models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the
sum of all non-overlapping hits. Each gene was considered present if more than 80% of the
expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified
with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default
parameters. Classification into high- and medium-quality MAGs was based on the criteria
defined by the minimum information about a metagenome-assembled genome (MIMAG)
standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S
rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination).
Assignment of MAGs to reference databases
Four reference databases were used to classify the set of MAGs recovered from the human
gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR
comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination)
retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the
HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n
= 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053
eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August
2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly
available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial
Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash
sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash
sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG
and the set of references was calculated with ‘mash dist’ to find the best match (that is, the
reference genome with the lowest Mash distance). Subsequently, each MAG and its closest
relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of
genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.
https://doi.org/10.1038/s41586-019-0965-1
scripts, workflows, SOPs & datasets

scripts, workflows, SOPs & datasets
https://doi.org/10.1038/s41586-019-0965-1

Objects are the Outcomes of Research
research outcomes are more than just publications and data
software, models, workflows, SOPs, lab protocols….
all are first class citizens of scholarship
information required to make research
FAIR and Reproducible (FAIR+R) …

From FAIR data to FAIR Digital Objects
To be FAIR
each digital object
type has its own
metadata
requirements,
and may have its
own repositories
and registries
FAIR Digital Objects for Science: From Data Pieces to Actionable
Knowledge Units: https://doi.org/10.3390/publications8020021

From FAIR data to FAIR Digital Objects
The FAIR Guiding Principles for Data Stewardship and Management
Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18

Objects are the Outcomes of Research
Each object has its own metadata and repositories

De-contextualised
Static, Fragmented
Lost Semantic linking
Contextualised
Active, Unified
Semantic linking
Buried in a
PDF figure
Scattered Reporting and Reading

Data Files
Methods: Scripts,
CompWorkflows
URLs to
external
data
Since 2007…
https://myexperiment.org

Systems Biology
Structured, interrelated objects in context
Data files
SOP docs
Models
URLs to
resources
Since 2008…
https://fairdomhub.org

shareable, cite-able, exchangeable resource, with versioning and snapshots
Research Object
general framework - research outcomes related and bundled together

enriching resources and collections with additional
information required to make research FAIR and reproducible
metadata describing content and context
dependencies, versions, relationships, provenance, annotations …
DCAT
Schema.org
EML
MIAPPE
CodeMeta SBML
Bioschemas/CWL DataCite

integrated view over fragmented resources using PIDs
bigger on the inside than the outside
encapsulated content and references to external resources

has its own metadata, can be registered and deposited in its own right,
unpackaged and accessed, activated and reproduced if appropriate
encapsulated content and references to external resources

RO is the Unit of Knowledge Exchange
Between scholarly
communication and
research infrastructure
platforms
->
Between researchers

From the big picture
to the small picture in
practice

standards-based metadata framework for bundling resources
with context into citable reproducible packages
researchobject.org
Linked Data
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
Since 2010…

self-describing, chiefly metadata, objects
RO Metadata file Structured metadata about the RO and content
files
links to web
resources
RO Content
Archive file format / packaging system
BagIt, zip
OCFL, Git
type
id
description
datePublished
…
directories
license
author organisation

image file
links to web
resources
RO Content
directory of data
type, id
description
datePublished
creator
size
format …
"
https://zenodo.org/record/3541888
https://github/script
type
id
description
datePublished
…
license
author organisation
“Web Linked Data”
approach

image file
links to web
resources
RO Content
directory of data
type, id
description
datePublished
creator
size
format …
"
https://zenodo.org/record/3541888
type
id
description
datePublished
…
license
author organisation
Workflow RO
“Web Linked Data”
approach

How do we describe the metadata?
PIDs + JSON-LD + Schema.org
descriptors
How can I add additional metadata?
Schema.org, domain ontologies
How do I define a checklist of what is
expected to be in a type of RO?
RO Profile
http://www.researchobject.org/ro-crate/
Standard
Web
Mark-up

RO-Crate in a nutshell
Practical lightweight approach to packaging research data
entities (any object) with metadata
Aggregate files and/or any URI-addressable content, with
contextual information to aid decisions about re-use:Who
WhatWhenWhereWhy How.
Web Native Machine readable. Human readable. Search
engine friendly. Familiar.
Extensible and Incremental: add additional metadata; nested
and typed by their profile.
Open Community effort
38 contributors
fromAU, EU and US

Lifting the Skirts to look at the
underwear
https://w3id.org/ro/crate/1.1
Released 30th October 2020

Community
specific
Combine Archive
Publisher
specific
[Bergmann et al. 2014].
Platform
specific
DataONE
A trend

Cultural Heritage: A data curation service
for endangered languages: 500,000 files in
28,624 items and 574 collections
long term preservation and
accessibility of research data objects
https://arkisto-platform.github.io/
[Marco La Rosa, Peter Sefton]

Describe the data as
it’s being created
using open standards
and tools
source: https://guides.library.ucsc.edu/datamanagement/
[Marco La Rosa, Peter Sefton]

Scalable verified
collections of references
Processing big genomic & clinical data
distributed over multiple locations
NIH Data Commons
[Chard, et al 2016]
https://doi.org/10.1109/BigData.2016.7840618
minids
Retain and archive processed datasets
Reference and transfer large data on demand
Controlled access to sensitive data
[Kesselman, Foster]

Data and Method Commons
13 EU Life Science Research Infrastructures
Sharing data, tools and workflows in the cloud
100+ data collections
method commons

Data and Method Commons
13 EU Life Science Research Infrastructures
exchanging and preserving
secure data objects
Sharing data, tools and workflows in the cloud
interchange, stewardship,
recording dependencies -> portability &
reuse/reproducibility of workflow objects

https://workflowhub.eu/
Profile

Do Research
Assemble
Methods, Materials
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Publish
Share
Results
Manage
Results
Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
Experiment
Observe
Simulate
Describe and release the data , workflows, research
as its being created, updated and used

Getting embedded into the European Open Science Cloud
Collaborative environments and
EGI-ACE data spaces for Earth
Science researchers
RO-Crate as interchange format and
integration with Zenodo, using Describo
Pan-European natively FAIR
and GDPR compliant data
storage and sharing fabric
Science Mesh using Cloud
Services for Synchronization
and Sharing (CS3)
RO to manage research activities
and release snapshots to Zenodo

exchange between
genomics platform and
repository.
standardise and share
analyses generated from
genome sequencing.
Scholarly Communication Drivers
Exchange & Import/Export
Reuse & Reproduce
Report & Archive
Living Objects
Share & Access
Figure: RDMKit, https://rdmkit.elixir-europe.org/

Swings and Roundabouts
why is it called “RO-Crate”??

Activation &
research in a
project
Adoption by
other projects
Adoption
Adoption in
mainstream
infrastructure

Research ObjectVersion 1
2010-2017
http://wf4ever.org/
Sharing of data-intensive computational workflows
Preservation and managing their decay
• documentation of workflows
• interconnections between workflows and related
resources (e.g., datasets, publications, etc.)
• documentation of their dependencies and their outputs
• social aspects
Type-cast  not just for workflows or biology!

Howard Ratner,
Chair STM Future Labs Committee, CEO EVP Nature PublishingGroup
Director of Development for CHORUS
2012
2017

adoption stuck in inner groups
not mainstream or outsiders
a swing back to basics
reboot!

Machine-processable
Standards
Low tech
Graceful degradation
Commodity tooling
Incremental
Multi-platform
Technology Independent
Keep it
simple
E X A M P L E S
Developer
friendliness
A swing back to basics
"just enough complexity / standards”
sufficient extra benefits from what
already exists…
…without compromising the developer
entry-level experience so much that they
rather do their own thing.

Version 1
Linked Data Purity
Construct
metadata
file
Describe
metadata
content
Mismash of
ontologies
combined together
to describe
metadata file
RDF Shapes
Represent profiles
Hard to make
checklists
SHACL, ShEx
W3C Web Annotation Vocabulary
OAI Object Exchange and Reuse

Indeed.
Linked Data is not enough.
And sometimes its too much.
Research Infrastructures:
“digital technologies (hardware, software),
resources (data, services, digital libraries,
standards), comms (protocols, access rights,
networks), people and organisational
structures”

Tensions
Research Infrastructures sit in the middle
Academic
Viewpoint
Infrastructure
Viewpoint
Green field site
Theoretical purity
Use latest thing
Proof of concept
Sophistication
Narrow developer audience
Strive for super generic
The end
Exposing the tech
Pre-existing platforms
Practicality
Use things that work
Production
Simplicity & Familiarity
Wide developer audience
Several specific is ok!
The means
Hiding the tech

2018: Digital Libraries & Developer friendly reboot
Peter Sefton
Semantic Web world vs RealWorld
Open Repositories & Dig Lib Community
DataCrate
simple web
stack
RO
rich RDF
stack
+
fewer features
easier to understand, conceptually simpler
opinionated guide to current best practices
software stacks widely used on the Web

Just Enough Linked Data Just InTime
simplifications rather than generalizations
Retain benefits of Linked Data
querying, graph stores,
vocabularies, clickable URIs
customization and conventions
The stuff a developer needs
documentation, examples, libraries,
tools, community
limited flexibility frees up developers
familiarity is important
Linked Data “exotics” there for when the time is right if needed by the right people
+

Developer friendly ->Tooling, Libraries
How can I use it?
While we’re mostly focusing on the specification, some tools already exist for working with
RO-Crates:
● Describo interactive desktop application to create, update and export RO-Crates for
different profiles. (~ beta)
● CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable
rendering (~ beta)
● ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta)
● ro-crate-js - utility to render HTML from RO-Crate (~ alpha)
● ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha)
● ro-crate-py Python library to consume/produce RO-Crates (~ planning)
These applications use or expose RO-Crates:
● Workflow Hub imports and exports Workflow RO-Crates
● OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the
file system, validate RO-Crate Metadata Files and parse into objects registered in
Elasticsearch. (~ alpha)
● ONI indexer
● ocfl-tools
● ocfl-viewer
● Research Object Composer is a REST API for gradually building and depositing
Research Objects according to a pre-defined profile. (RO-Crate support alpha)
● … (yours?)
https://uts-eresearch.github.io/describo/

People! A Community
Sponsored
RO2018, Amsterdam
e-Science Conference
https://www.researchobject.org/ro-crate/#contribute
Diverse set of people
Variety of stakeholders
Set of collective norms
Open platform for communication
RO2018

People! Leadership, Champions, Early Adopters
we had to reboot the community twice….
Stian Soiland-Reyes
The University of Manchester, UK
Peter Sefton
University ofTechnology Sydney Infrastructure buy in lifts from project to mainstream

tendency to first focus on
providers…
…but (developer) consumers
build the ecosystem…

Consumers
Archivist and library specialists know the
importance of metadata and standards…
… and for things to work 5, 10, 20 years later.
End-users and Developers handle legacy …
…. their field, repositories, institutions , journals
etc. will always be lagging behind the curve
Researchers, to make their object FAIR …
… do not have the resources or know-how….
Specification &
standards-based
realisation
Incremental &
Extensible, Practical,
Familiar
Built in to their
platforms they use in
their day-to-day work

Problem Community Reference
Examples
Tools
Best Practice Guides
Tutorials
Embedded
Go Round and Round and Round & Maintain Momentum

Back to the Big Picture.
Opportunities
for Scholarly Communications

I concentrated on developers….what about end users?
Big Picture
User-Ware
Familiar tools in
the research
process
Infrastructure
Under-Ware
Familiar tech in
the infrastructure
Registries
Repositories
Tools
Applications
Built in
On ramps
Metadata
automation

I concentrated on developers….what about end users?
Infrastructure
RO-Crate Commons
Ecosystem
Knowledge graphs Citation and tracking
Reproducibility
Release updates
New services
Built by
developers
New user
capabilities
Living and actionable publications
whose content transcends platforms
threaded publications
FAIR Digital
Objects

Back to FAIR Digital Objects
Actionable knowledge unit
Digital butterfly – digital twins
Bags of references
courtesy Dimitris Koureas
Coordinator DiSSCo EU Research
Infrastructure
Specimen object image
courtesy of Alex Hardisty
[Hardisty et al, 2020]

European Open Science Cloud
EOSC Interoperability Framework
Specimen Data Refinery
Workflows to Digitise
Natural History Specimens
FAIR Digital Object
Framework
Open Digital Specimen
Workflow Infrastructure
RO-Crate
+
https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-
01aa75ed71a1/language-en/format-PDF/source-190308283

FAIR Principles for Digital Research Objects
FAIR all the way down
Unbounded FAIR
Distributed FAIR
Living FAIR
Analogous to FAIR Software
FAIR RO-Crate is a practical start
Fig from EOSC Interoperability Framework

Preparing for Digital Objects…..
Digital library community allies!
Make Research Object’s normative
– Promote to researchers but …
– ... target Research Infrastructures to deliver
– Move RO(-Crate)s across the scholarly ecosystem
Developer friendliness matters
– Persuasive design
FAIR principles for Research Objects….
– Unifying the vision with the practical
Release
paradigm
instead of a
Publish one

Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Stian Soiland-Reyes
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
DanielGarijo
Catarina Martins
Iain Buchan
Michael Crusoe
Rob Finn
StuartOwen
Finn Bacall
Bert Droebeke
Laura Rodríguez Navas
Ignacio Eguinoa
Carl Kesselman
Ian Foster
Kyle Chard
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz
Denis Dean II
DurgaAddepalli
Wouter Haak
Anita DeWaard
Paul Groth
Oscar Corcho
Peter Sefton
Eoghan Ó Carragáin
FrederikCoppens
Jasper Koehorst
Simone Leo
Nick Juty
LJ Garcia Castro
Karl Sebby
Alexander Kanitz
Ana Trisovic
http://researchobject.org
Gavin Kennedy
Mark Graves
José María Fernández Jose
Manuel Gomez-Perez
Jason A. Clark
Salvador Capella-Gutierrez
Alasdair J. G. Gray
Kristi Holmes
Giacomo Tartari
Hervé Ménager
Paul Walk
Brandon Whitehead
Erich Bremer
Mark Wilkinson
Jen Harrow
Marco La Rosa
And many more....

The swings and roundabouts of a decade of fun and games with Research Objects

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The swings and roundabouts of a decade of fun and games with Research Objects

Similar to The swings and roundabouts of a decade of fun and games with Research Objects (20)

More from Carole Goble

More from Carole Goble (12)

Recently uploaded

Recently uploaded (20)

The swings and roundabouts of a decade of fun and games with Research Objects

Editor's Notes