Advertisement
Advertisement

More Related Content

Similar to RO-Crate: A framework for packaging research products into FAIR Research Objects(20)

Advertisement

More from Carole Goble(20)

Advertisement

RO-Crate: A framework for packaging research products into FAIR Research Objects

  1. RO-Crate: A framework for packaging research products into FAIR Research Objects Carole Goble The University of Manchester ELIXIR-UK https://orcid.org/0000-0003-1219-2137 @caroleannegoble This work is licensed under a Creative Commons Attribution 4.0 International License Stian Soiland-Reyes The University of Manchester BioExcel Centre of Excellence The University of Amsterdam https://orcid.org/0000-0001-9842-9718 @soilandreyes RDA IG Data Fabric, FAIR Digital Object, 2021-02-25
  2. tl;dr Web standards-based metadata framework for bundling resources with their context into citable reproducible packages Machine actionable Metadata + Identifiers + Web protocols => FAIR  What andWhy  Examples  How andTools  Alignment with FDOF Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
  3. https://www.researchobject.org/
  4. Many Objects are the Outcomes of Research Each object has its own metadata and repositories All are first class citizens and are required to make research FAIR+R
  5. De-contextualised Static, Fragmented Lost Semantic linking Rebuild to be reproducible Contextualised Active, Unified Semantic linking Buried in a PDF figure Scattered Reporting and Reading
  6. integrated view over fragmented resources using PIDs and metadata Encapsulated content and references to external resources The RO package has its own metadata, can be registered and deposited in its own right, unpackaged and accessed, activated and reproduced if appropriate
  7. self-describing, chiefly metadata, objects RO Metadata file Structured metadata about the RO and content files links to web resources RO Content Archive file format / packaging system BagIt, zip OCFL, Git type id description datePublished … directories license author organisation
  8. self-describing, chiefly metadata, objects RO Metadata file Structured metadata about the RO and content image file links to web resources RO Content Archive file format / packaging system directory of data type, id description datePublished creator size format … https://zenodo.org/record/3541888 https://github.com/o/script type id description datePublished … license author organisation Linked Data approach
  9. Self-describing, chiefly metadata, objects Strict Structure, Open ended content How do we describe the metadata? • PIDs + JSON-LD + Schema.org descriptors • Opinionated profile of schema.org • Linked Data by Stealth: JSON with gradual path to extensibility with LD – e.g. ad-hoc terms • Example-driven documentation How can I add additional metadata? • Schema.org, domain ontologies How do I define a checklist of what is expected to be in a type of RO? • RO-Crate Profiles http://www.researchobject.org/ro-crate/ Standard Web Mark-up
  10. https://w3id.org/ro/crate/1.1 JSON with a flat list of: - Data Entities (e.g. files, dirs, DBs) - Contextual Entities (e.g. people) Objects reference each other by @id
  11. Summary: RO-Crate in a nutshell Practical lightweight approach to packaging research data entities (any object) with metadata Aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use: Who What When Where Why How. Web Native Machine readable. Human readable. Search engine friendly. Familiar. Extensible and Incremental: add additional metadata; nested and typed by their profile. Open Community effort http://www.researchobject.org/ro-crate/
  12. Tooling https://www.researchobject.org/ro-crate/implementations.html User facing Infrastructure facing Software libraries https://www.npmjs.com/package/ro-crate https://github.com/ResearchObject/ro-crate-ruby https://pypi.org/project/rocrate/ https://uts-eresearch.github.io/describo/
  13. Cultural Heritage: A data curation service for endangered languages: 500,000 files in 28,624 items and 574 collections long term preservation and accessibility of research data objects https://arkisto-platform.github.io/ [Marco La Rosa, Peter Sefton]
  14. Scalable verified collections of references Processing big genomic & clinical data distributed over multiple locations NIH Data Commons [Chard, et al 2016] https://doi.org/10.1109/BigData.2016.7840618 minids Retain and archive processed datasets Reference and transfer large data on demand Controlled access to sensitive data [Kesselman, Foster] minids minids
  15. 13 EU Life Science Research Infrastructures Data and Method Thematic Commons Sharing data, tools and workflows in the cloud Workflow + data + provenance interchange, stewardship, recording dependencies -> portability & reuse/reproducibility https://www.eosc-life.eu/
  16. https://workflowhub.eu/workflows/98 RO-Crate as interchange format
  17. https://workflowhub.eu/ Profile Same conceptual workflow; multiple executable flavours for different workflow engines and specific use-cases
  18. Embedding into Infrastructures & Standards EGI-ACE data spaces for Earth Science researchers FAIR and GDPR compliant data storage and sharing fabric Science Mesh using Cloud Services for Synchronization and Sharing exchange between genomics platform and repository. standardise and share analyses generated from genome sequencing. IEEE 2791-2020
  19. Supporting the Research Life Cycle Exchange & Import/Export Report & Archive Share & Access Reuse & Reproduce Living Objects Describe the objects as they are being created using open standards and tools Release the objects as they are created, updated and used Figure: RDMKit, https://rdmkit.elixir-europe.org/ Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
  20. Machine-processable Standards Low tech Graceful degradation Commodity tooling Incremental Multi-platform Technology Independent Keep it simple E X A M P L E S Developer friendliness Just enough complexity / standards • sufficient extra benefits from what already exists…without compromising developer entry-level experience so they do their own thing Just Enough Linked Data Just In Time • simplifications instead of generalizations Retain Linked Data benefits • querying, graph stores, vocabularies, clickable URIs) Plus the developer needs • documentation, examples, libraries, tools, community … Limited flexibility frees up developers Familiarity is important for uptake
  21. From FAIR data to FAIR Digital Objects FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units: https://doi.org/10.3390/publications8020021 https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-01aa75ed71a1/language-en/format-PDF/source-190308283
  22. FAIR Digital Objects Actionable knowledge unit Digital butterfly – digital twins Bags of references courtesy Dimitris Koureas Coordinator DiSSCo EU Research Infrastructure Specimen object image courtesy of Alex Hardisty [Hardisty et al, 2020]
  23. Specimen Data Refinery Workflows to Digitise Natural History Specimens FAIR Digital Objects -> Packaged + Actionable FAIR Digital Object Framework Open Digital Specimen Workflow Infrastructure RO-Crate + Same conceptual workflow; multiple executable flavours +
  24. Adapted from https://doi.org/10.3390/publications8020021 FAIR Digital Object
  25. Adapted from https://doi.org/10.3390/publications8020021 RO-Crate as FAIR Digital Object https://w3id.org/ Data Entity Contextual Entity RO-Crate metadata file RO-Crate http://schema.org/Dataset
  26. RO-Crate model as UML schema.org/hasPart schema.org/about (describes) https://www.researchobject.org/ro-crate/1.1/structure.html A valid RO-Crate JSON-LD graph MUST describe: 1.The RO-Crate Metadata File Descriptor 2.The Root Data Entity 3.Zero or more Data Entities 4.Zero or more Contextual Entities "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"}, RO-Crate Root Data Entity «schema.org/Dataset» RO-Crate Metadata File «schema.org/CreativeWork» «schema.org/Organization» «schema.org/Place» «schema.org/IndividualProduct» «schema.org/Person» «bioschemas.org/ComputationalWorkflow» «schema.org/MediaObject» «schema.org/Dataset» «schema.org/ImageObject» «schema.org/CreativeWork» Data Entities Data Entities «schema.org/Thing» Contextual Entities Contextual Entities (is-a) subClassOf subClassOf subClassOf (describes) schema.org/mentions
  27. { "@id": "cp7glop.ai", "@type": "File", "name": "Diagram showing trend to increase", … }, … { "@type": "CreativeWork", "@id": "ro-crate-metadata.json", "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"}, "about": { "@id": "./" } } { "@context": "https://w3id.org/ro/crate/1.1/context", "@graph": [ Collection RO-Crate metadata file descriptor RO-Crate root dataset ..aggregates Data entities ..described w/ contextual entities { "@id": "./", "identifier": "https://doi.org/10.5281/zenodo.1009240", "@type": "Dataset", "hasPart": [ { "@id": "cp7glop.ai" }, { "@id": "lots_of_little_files/" }, { "@id": "communities-2018.csv" }, { "@id": "https://doi.org/10.4225/59/59672c09f4a4b" }, { "@id": "SciDataCon Presentations/AAA_Pilot_Project_Abstract.html" } ], "author": { "@id": "https://orcid.org/0000-0002-8367-6908" }, "publisher": { "@id": "https://ror.org/03f0f6041" }, "citation": { "@id": "https://doi.org/10.1109/TCYB.2014.2386282"}, "name": "Presentation of user survey 2018" }, Flat list of metadata per entity JSON-LD preamble "hasPart": [ { "@id": "cp7glop.ai" }, { "@id": "lots_of_little_files/" }, { "@id": "communities-2018.csv" }, { "@id": "https://doi.org/10.4225/59/59672c09f4a4b" }, { "@id": "SciDataCon-Presentations/AAA_Pilot_Abstract.html"} ],
  28. Data and Contextual entities described within RO-Crate Metadata File Base vocabulary & types: schema.org Cross-references to further contextual entities RO-Crate principle: Reuse existing PIDs and URLs ..but always describe entities which lack a human-readable resolution Metadata { "@id": "https://orcid.org/0000-0002-8367-6908", "@type": "Person", "affiliation": { "@id": "https://ror.org/03f0f6041" }, "name": "J. Xuan" } { "@id": "https://ror.org/03f0f6041", "@type": "Organization", "name": "University of Technology Sydney", "url": "https://www.uts.edu.au/" } { "@id": "figure.png", "@type": ["File", "ImageObject"], "name": "XXL-CT-scan of an XXL Tyrannosaurus rex skull", "identifier": "https://doi.org/10.5281/zenodo.3479743", "author": {"@id": "https://orcid.org/0000-0002-8367-6908"}, "encodingFormat": "image/png" }
  29. No need to put all the metadata in RO-Crate! 31 https://biocompute-objects.github.io/bco-ro-crate/  Specification provide details on how to extend RO-Crate formally or informally  However, in many cases a domain-specific format for metadata may already exist   Co-exist as a side-car metadata file within the Research Object  Example: BioCompute Object has its own JSON format, typed & indicated from RO-Crate
  30. Persistent IDs Gradual ascent towards FAIR Base line: Relative paths from RO-Crate Metadata File Use cases: Describing dataset on desktop Ad-hoc web-hosting (e.g. GitHub pages) Institutional archives (e.g. Oxford Common File Layout) Reuse existing PIDs and URLs Use cases: Large data, not a file (e.g. database), reference datasets Cite/reference existing resources (e.g. via identifiers.org) Distinguish and crosslink contextual entities Make paths absolute, using location-independent PIDs (UUIDs, Naming Things with Hashes, ARCP) Use cases: Found RO-Crate “in the wild”, ZIP archives, workflow outputs Assign PIDs to RO-Crate (and its entities) Use cases: Long-term availability, citations, permalinks
  31. Bit Sequences Base line: Files on disk within RO-Crate Root folder + URLs Downloadable Web resources listed w/ content size and access time. Packaging/archiving RO-Crate Root folder:  BagIt (RFC8493): Manifest of all (local) files and their checksums Use case: Ensure all files are transferred/archived  BDBag: Include external files by ARK/MinID PIDs Use case: “thin” RO-Crate, Big Data, shared immutable files  OCFL: Archival storage with revision tracking, infrequent changes  Git LFS: Tracked frequent changes, collaborative editing 33
  32. Repository Base line: zip/tar download  e.g. “supplementary data”, GitHub release Export from web platforms  e.g. workflowhub.eu, Galaxy workflow system Deposit in general/institutional repositories  e.g. Zenodo, Mendeley Data Deposit in domain-specific repositories  e.g. GBIF Deposit individual files  e.g. ARK, S3, B2SHARE Deposit metadata  e.g. WikiData, nanopubs
  33. What can FDO learn from RO-Crate? Not everything is known before hand • Existing PID metadata not always relevant • Allow contextualized metadata • Types not always known in advance • Allow “casting” or reinterpretation • Operations not always known in advance • Allow open-ended generation and use Use wheels already invented • Fit into researchers’ existing working practices and familiar technologies. • Aim for gradual improvements. • Reuse existing technology • .. But only when not too complex • Reuse existing PID infrastructure, including URLs • Human consumers recognize and click hyperlinks • Build on existing metadata standards • Which is both simple and extendable Data and Contextual entities are equally important Remember people … especially Developers • Keep human consumption in focus • Ensure metadata is easily rendered and edited • Provide Best Practice guidance • Firm, but not too restrictive • Developer producer and consumer friendly • Rather than academically elegant
  34. What can RO-Crate learn from FDO? Provide stronger guidance on PID and availability  Recommend deposit infrastructure for end-users and framework developers  Tools to assist – e.g. generate Zenodo Datacite metadata from RO-Crate Provide stronger typing of RO-Crates and its content  Profiles as first approach Expose potential operations on an RO-Crate  Build general RO-Crate services, e.g. index Turtles all the way down!  Document better how RO-Crates can/should be nested  How to choose granularity of RO-Crates?  Tools to liberate/reuse an entity from a single RO-Crate
  35. RO-Crate as part of the FDO ecosystem A bag of references + metadata Framework for actionable objects RO-Crates enable active operations for FDOs FDO offers additional infrastructure and practice A bag of references + metadata A metadata framework for FDO Developer friendly practical and web-native implementations Community Infrastructure applications Synthesys+ Project gives a concrete use case for practically join the two up.
  36. https://about.workflowhub.eu/ Bi-weekly calls & Slack Bi-weekly calls & GitHub Join us! https://www.researchobject.org/ ro-crate/
  37. Acknowledgements  RO-Crate Community https://www.researchobject.org/ro-crate/community  Workflow Hub Club https://about.workflowhub.eu/acknowledgements/  Funding:  H2020-INFRAEOSC-2018-2 824087  H2020-INFRAEDI-2018-1 823830  H2020-INFRAIA-2017-1 730976  H2020-INFRADEV-2019-2 871118  H2020-INFRAIA-2018-1 823827 40
  38. FAIR Principles for Digital Research Objects FAIR all the way down Unbounded FAIR Distributed FAIR Living FAIR Analogous to FAIR Software FAIR RO-Crate is a practical start Fig from EOSC Interoperability Framework

Editor's Notes

  1. Overcome fragmentation
  2. Identifiers to locate things Organisation to structure & link things Annotations about the things @type: MUST be Dataset @id: MUST end with / and SHOULD be the string ./ name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important. datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond. license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used. This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata. The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment. While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples. At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based. It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects. The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
  3. Data might @type: MUST be Dataset @id: MUST end with / and SHOULD be the string ./ name: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates description: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important. datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond. license: SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description. MAY have a URI (eg for Creative Commons or Open Source licenses). MAY, if necessary be a textual description of how the RO-Crate may be used. This document specifies a method, known as RO-Crate (Research Object Crate), of organizing file-based data with associated metadata, using linked data principles, in both human and machine readable formats, with the ability to include additional domain-specific metadata. The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment. While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples. At the basic level, an RO-Crate is a collection of files and resources represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. While RO-Crate is well catered for describing a Dataset as files and relevant metadata that are contained by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based. It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that do have full inventories / manifest and integrity checks, e.g. by using checksums, such as BagIt and Oxford Common File Layout OCFL Objects. The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as BagIt, OCFL and git (see also appendix Combining with other packaging schemes). An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.
  4. More flexible than XML Schemas More straightforward than OWL More flexible than XML Schemas More straightforward than OWL Linked Data “exotics” there for when the time is right if needed by the right people Portland Common Data Model for digital library & repository content Metadata is Flat JSON file Descriptions of the Objects Descriptions of the RO-Crate Descriptions relating the Objects human and machine readable formats, additional domain-specific metadata.
  5. PARADISEC and The University of Melbourne University of Technology Sydney PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) has operated a data curation service for endangered languages for 18 years. Currently consisting of 500,000 files in 28,624 items and 574 collections, there is over 90TB of data under management. The repository, whilst still functional and fit for purpose, is showing signs of its age. Using the Arkisto Platform, the Modern PARADISEC project is updating the catalog whilst also enabling a path for smaller regional repositories to exist and to easily move their content across collections and institutions.
  6. NIH Data Commons transferring and archiving very large HTS datasets in a location-independent way keep the context of data content together when its scattered. Scalability
  7. ReproZip can automatically pack your research along with all necessary data files, libraries, environment variables and options into a self-contained bundle. https://www.reprozip.org/ One pressure is towards bundling objects within objects. There’s nothing in DRS for a structured manifest within a bundle. RO-Crate as interchange format and integration with Zenodo, using Describo Data Cubes for efficient and scalable structured data access and discovery Research Objects as the mechanism to manage scientific research activities and connect associated resources RO snapshot/releases published to EOSC repositories and scholarly communication platforms (e.g., B2Share, Zenodo
  8. We started in theory, but we work in practice ROs sit in the middle
  9. To be FAIR each digital object type has its own metadata requirements, and may have its own repositories and registries https://op.europa.eu/en/publication-detail/-/publication/d787ea54-6a87-11eb-aeb5-01aa75ed71a1/language-en/format-PDF/source-190308283 https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf Schwardmann (2020), Digital Objects – FAIR Digital Objects: Which Services Are Required? Data Science Journal EOSC Interoperability Framework Draft (2020) Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280. DONA Digital Object Architecture Digital Object Interface Protocol (2018) https://fairdigitalobjectframework.org/
  10. [Hardisty A, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280. How to make this practical!
  11. So let’s do a quick reminder about what are the key components of FAIR Digital Objects: Multiple levels of Digital objects: - A Persistent Identifier to locate the DO - A set of Operations to use the DO - Metadata about the DO – itself a DO - All of these represented as bit sequences that can be transferred or stored in repository - Finally a Collection is a type of DO that aggregates other DOs and of course has its own metadata
  12. For RO-Crate, how do we approach FDO? We have implementations for each of these concepts: The RO-Crate dataset itself is of course what we can call a Collection. It aggregates a set of data entities, but also contextual entities. I’ll come back to those. We describe the RO-crate and the entities in a metadata file We can assign persistent identiers to the RO-Crate and parts of it We can store the bit sequences, ideally with something like BagIt, to ensure completeness And deposit all of this in regular repositories like Zenodo or GitHub
  13. RO-Crate recommends how to structure metadata.  FDO RO-Crate profile What are the operations on RO-Crate metadata? Re-visit Wf4Ever Research Object APIs? (basically CRUD) RO-Crate as a Linked Data Platform (LDP) Container? Registration of RO-Crate types/profiles How can RO-Crate users use FDO to resolve PID/storage challenges?
Advertisement