Successfully reported this slideshow.

The swings and roundabouts of a decade of fun and games with Research Objects

0

Share

1 of 51
1 of 51

The swings and roundabouts of a decade of fun and games with Research Objects

0

Share

Download to read offline

https://zbmed.github.io/damalos DaMaLOS 2020 workshop at ISWC 2020
The swings and roundabouts of a decade of fun and games with Research Objects
Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
[1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v

https://zbmed.github.io/damalos DaMaLOS 2020 workshop at ISWC 2020
The swings and roundabouts of a decade of fun and games with Research Objects
Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
[1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

The swings and roundabouts of a decade of fun and games with Research Objects

  1. 1. The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble The University of Manchester Researchobject.org carole.goble@manchester.ac.uk DaMaLOS workshop 02 Nov 2020 at ISWC 2020
  2. 2. Special Acknowledgement Stian Soiland-Reyes The University of Manchester, UK
  3. 3. Our RO start up – what, why and how… ROs In the Large –TheVision A new form of Scholarly Communication. RDM support throughout the research cycle. ROs in the Small –The Implementation Packaging digital components. Referencing physical components.
  4. 4. OurWorld of FAIRThematic Research Infrastructures (aka Cyberinfrastructure) – Biology, Biodiversity “facilities that provide resources and services for research communities to conduct research and foster innovation….they may be single-sited, distributed, or virtual. • major scientific equipment or sets of instruments • collections, archives or scientific data • computing systems and communication networks • any other research and innovation infrastructure of a unique nature which is open to external users” European Commission Research Software Engineers https://ec.europa.eu/info/research-and-innovation/strategy/european-research-infrastructures_en
  5. 5. FAIR Data Commons Diverse Research Objects – models, data, pipelines, lab protocols and SOPs, provenance... citable, exchangeable, publishable, preserved, executable objects and collections of objects. Assemble and share large scale, multi- element datasets. Secure referencing and moving of sensitive data. Zoo of catalogues & resources. Across 13 Research Infrastructures. Reproduce, port, share, and execute analytics & pipelines NIH Data Commons
  6. 6. FAIR Digital Objects PIDs + Metadata Sounds like Linked Data! The FAIR Guiding Principles for Data Stewardship and Management Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18
  7. 7. Motivation in 2007. Still is. The different objects that are the research …..documentation Structured, interrelated objects in context ….documentation The narrative paper The narrative paper
  8. 8. From Manuscripts to Research Objects “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship.The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995 research outcomes more than just publications data software, models, workflows, SOPs, lab protocols are first class citizens of scholarship added information required to make research FAIR and Reproducible (FAIR+R) …
  9. 9. Do Research Project Research Infrastructure Services Assemble Methods, Materials Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Public Marketplace Services Publish Share Results Any research product Selected products Manage Results ROs all the way along, not just the end Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015 Ainsworth and Buchan: e-Labs and Work Objects: Towards Digital Health Economies, EuropeComm 2009, pp 205-216 Experiment ObserveSimulate
  10. 10. Digital Objects electric butterflies Digital twins Actionable knowledge units FAIR Digital Objects courtesy Dimitris Koureas Coordinator DiSSCo EU Research Infrastructure Specimen object image courtesy of Alex Hardisty
  11. 11. Digital Objects as First Class Entities FAIR Digital Object Framework A KnowledgeGraph of FDOs • Schwardmann (2020), DigitalObjects – FAIR DigitalObjects:Which ServicesAre Required? Data Science Journal • EOSC Interoperability Framework Draft (2020) • HardistyA, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280. • DONA DigitalObjectArchitecture DigitalObject Interface Protocol (2018) • https://fairdigitalobjectframework.org/
  12. 12. From Manuscripts to FAIR+R Research Objects research objects related and bundled together … one shareable, cite-able, exchangable resource that can be versioned and snapshot … metadata describing context and content of objects dependencies, versions, relationships, provenance … enough to be reproducible virtual objects , links to physical objects (people, specimens, equipment) integrated view over fragmented & scattered specialised repositories internal content and references
  13. 13. Bigger on the inside than the outside Packaging internal content and references From Manuscripts to FAIR+R Research Objects
  14. 14. Commons Currency Credit Archival preservation Reproducibility Portability Virtual Witnessing Releasing Living Objects Goble, De Roure, Bechhofer, Accelerating KnowledgeTurns, DOI: 10.1007/978-3-642-37186-8_1 RDM role FAIR ROs Analogous to software FAIR Enough Structure: Composite Dynamic: Versioning Executable: Portability Virtual: References Maintenance: Decay PID resolution? Metadata? Access? Licences?
  15. 15. Packaging of Digital Objects Driver: ComputationalWorkflows http://wf4ever.org/ Preservation of computational workflows in data- intensive science • Workflow-centric Research Objects • Computational workflows, provenance of executions, interconnections between workflows and related resources (e.g., datasets, publications, etc.), social aspects in the experiments. • Wf-centric RO creation & management best practices • analysis and management of decay in workflows.
  16. 16. Methods (..) De novo assembly and binning Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21. Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722 using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). (...) (..) Assignment of MAGs to reference databases Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI. (..) https://doi.org/10.1038/s41586-019-0965-1 Data pipeline & analysis reporting & reproducibility
  17. 17. Data pipeline & analysis reporting & reproducibility https://doi.org/10.1038/s41586-019-0965-1 Lots of scripts & datasets
  18. 18. EBI’s MGnify metagenomics workflows as workflows using aWfMS Workflow description Input data files Command line tools, containerised tools, workflows Output data files
  19. 19. Standards-based metadata framework for bundling resources (physically and logically) with context into citable reproducible packages. researchobject.org Linked Data!! Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
  20. 20. A Research Object bundles and relates digital resources of a scientific experiment/investigation + context Data used and results produced in experimental study Methods employed to produce and analyse that data Provenance and settings for the experiments People, specimens, equipment etc involved in the investigation Annotations about these resources, to improve understanding and interpretation Package
  21. 21. Manifest Construction Manifest Identification to locate things Aggregates to link things together Annotations about things & their relationships Package Research Objects => Metadata Objects Manifest Description Type Checklists what should be there Provenance where it came from Versioning its evolution Dependencies what else is needed Manifest Base Profile, Profiles for RO types
  22. 22. We are in a SemanticWeb Conference…. Linked Data Middleware • Manifests described using Linked Data – Identifiers to resources, including people (orcid) – OWL / RDF / SPARQL / JSON-LD • Mismash of specialized ontologies – Construct the manifest itself • W3CWeb AnnotationVocabulary • OAI Object Exchange and Reuse – Describe manifest content • Wf4Ever RO ontology,Wf4Ever ROEvo … • Dublin Core, FOAF, SIOC, Creative Commons, PROV, PAV… • RDF shapes (SHACL, ShEx) – Capture requirements, expectations and validate profiles – Hard to express checklists
  23. 23. https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics- Manifest CWL Annotations Evidence!
  24. 24. Influence? Publishers… Howard Ratner, Chair STM Future Labs Committee, CEO EVP Nature Publishing Group Director of Development for CHORUS
  25. 25. European Open Science Cloud Interoperability Framework EOSC Interoperability Framework (v1.0) May 2020, Draft for community consultation Chair:Oscar Corcho, UPM 2021 we start to combine
  26. 26. Used? Yes ….…… NIH DataCommons transferring and archiving very large HTS datasets in a location- independent way keep the context of data content together when its scattered. Scalability https://doi.org/10.1109/BigData.2016.7840618 A framework for standardizing and sharing computations and analyses generated from High-throughput genome sequencing. Standardized as IEEE 2791-2020 https://doi.org/10.1371/journal.pbio.3000099 Virtualized collaborative working environment for Earth Science researchers to share resources (data, workflows), ideas, knowledge, and results. NIH DataSTAGE RO Composer to exchange between Seven Bridges Platform genomics platform and the Mendeley Data repository https://doi.org/10.1016/j.future.2019.03.046
  27. 27. Used? Yes ….…… Exchange Reproducibility Archival Active Objects
  28. 28. Activation & Research Championing Adoption Phase 1 2010 - 2015 Phase 2 2015 – 2018 Phase 3 2017 -
  29. 29. time to reflect….
  30. 30. Machine-processable Standards Low tech Graceful degradation Commodity tooling Incremental Multi-platform Technology Independent Keep it simple E X A M P L E S Developer friendliness Desiderata & Norms Balance and prioritise • "just enough complexity" or “just enough standards” so… • sufficient extra benefits from what already exists (Linked Data, vocabularies, tooling, validation, transformation • without compromising the developer entry-level experience so much that they rather do their own thing.
  31. 31. Research ObjectTensions Research Infrastructures sit in the middle Academic Viewpoint Infrastructure Viewpoint Green field site Theoretical purity Use latest thing Proof of concept Sophistication Narrow developer audience Strive for super generic The end Exposing the tech Pre-existing platforms Practicality Use things that work Production Simplicity Wide developer audience Several specific is ok! The means Hiding the tech
  32. 32. Ladder Model of OSS Adoption (adapted from Carbone P., Value Derived from Open Source is a Function of Maturity Levels) [FLOSS@Sycracuse] "it's better, initially, to make a small number of users really love you than a large number kind of like you" Paul Buchheit paulbuchheit.blogspot.com
  33. 33. Not really mortal developer friendly • “Easy to make, hard to use…” • Daunting Linked Data tech stack • Being too clever – Infer what is in the object and what kind of object it is – Massive reuse of ontologies • Make developers (and researchers) lives easier not more demanding….
  34. 34. Developer friendliness matters Reinvent with fewer features Easy to understand and simple conceptually… … with strong opinionated guide to current best practices … using software stacks widely used on theWeb
  35. 35. Indeed. Linked Data is not enough. Research Infrastructures: “digital technologies (hardware, software), resources (data, services, digital libraries, standards), comms (protocols, access rights, networks), people and organisational structures”
  36. 36. CommunityUse Driver Tools Developer friendly – so possible Incentives – so rewarding Adoption path – so acceptable Metadata capture Early benefit Exchange, reproducibility, executable objects Portability between platforms, Archiving Platform & user buy-in & consensus Passionate, dedicated leadership Active engaged community, seed Support Linked Data and a Spec is not enough and sometimes too much GuidesReference examples
  37. 37. Research Object Reboot
  38. 38. CommunitySponsored RO2018, Amsterdam e-Science Conference BDBags Keynote
  39. 39. Swing Back to Basics Peter Sefton BagIT data profile + schema.org + JSON-LD annotations Semantic Web world vs RealWorld DataCrate from the Open Repository community "As a researcher…I’m a bit b****y fed up with Data Management", Cameron Neylon
  40. 40. Be HumbleArchivist and library people know the importance of metadata and standards… … and for things to work 5, 10, 20 years later. End-users need to have their own way to “bypass the system”… …. their field, repositories, institutions , journals etc. will always be lagging behind the curve Most who want to make their data is FAIR … … do not have the resources or knowledge to start championing all of this to all levels & need tools and ramps. http://www.lisbdnet.com/ https://ischools.org/
  41. 41. A RO-Crate Community! https://www.researchobject.org/ro-crate/#contribute https://github.com/researchobject/ro-crate/issues/1 A Merger • A diverse set of people • A variety of stakeholders • A set of collective norms • A open platform that facilitates communication (GitHub, Google Docs, monthly telcons)
  42. 42. RO-Crate https://w3id.org/ro/crate RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments. RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see RO-Crate background. The RO-Crate specification details how to capture a set of files and resources as a dataset with associated metadata – including contextual entities like people, organizations, publishers, funding, licensing, provenance, workflows, geographical places, subjects and repositories. A growing list of RO-Crate tools and libraries simplify creation and consumption of RO-Crates, including the graphical interface Describo. The RO-Crate community help shape the specification or get help with using it!
  43. 43. Reinvent as Lightweight Underware Linked data but Developer Friendly Easy to understand and simple conceptually… … with strong opinionated guide to current best practices … using software stacks widely used on theWeb BagIT data profile , schema.org, JSON-LD, JSONSchema Flattened compacted JSON-LD, no need for RDF libraries Example driven rather than strict specification Implementers add additional metadata using schema.org types and properties Data entities are files/directories or web resources Boundness of elements is explicit Single graph, data structure depth 1 Swung a bit far…. and swung back…
  44. 44. Tooling! How can I use it? While we’re mostly focusing on the specification, some tools already exist for working with RO-Crates: ● Describo interactive desktop application to create, update and export RO-Crates for different profiles. (~ beta) ● CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable rendering (~ beta) ● ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta) ● ro-crate-js - utility to render HTML from RO-Crate (~ alpha) ● ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha) ● ro-crate-py Python library to consume/produce RO-Crates (~ planning) These applications use or expose RO-Crates: ● Workflow Hub imports and exports Workflow RO-Crates ● OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the file system, validate RO-Crate Metadata Files and parse into objects registered in Elasticsearch. (~ alpha) ● ONI indexer ● ocfl-tools ● ocfl-viewer ● Research Object Composer is a REST API for gradually building and depositing Research Objects according to a pre-defined profile. (RO-Crate support alpha) ● … (yours?) https://uts-eresearch.github.io/describo/
  45. 45. • RDF and schema.org but you don't need to know. • Extend RO-Crate …. – Add your own schema.org types and properties. – Add in your own ontologies …and it still works! https://arkisto-platform.github.io/case-studies/ Under-ware
  46. 46. Driver! Profile for workflows https://about.workflowhub.eu/Workflow-RO-Crate/
  47. 47. Infrastructure families Driver! Profile for workflows On-boarding developers Web and dev friendly Workflow-RO-Crate profiles RO in practice External references – logically & physically contained – versions, snapshots, multi-typed, active, multi- stewarded, multi-authored, governance…
  48. 48. More than plain JSON, Just Enough Linked Data Retain benefits of Linked Data in the toolbox • querying, graph stores, vocabularies, clickable URI as identifiers) • customization and conventions Plus all the other stuff a developer expects • documentation, examples, libraries, tools • simplifications rather than generalizations (less flexibility frees up developers) • “Just enough standards” cf. schema.org Linked Data “exotics” there for when the time is right if needed by the right people.
  49. 49. Keep your eye on the target….. How do we make RO’s normative? – Propaganda and incentive models to scientists, target the Research Infrastructures to deliver. – Digital library community allies! Developer friendliness matters – Underware, incremental, ramps, embed, metadata automation, persuasive design Linked Data has a role – As a means but it is not an end. – Simpler version of Linked Data makes an adoption path (cf. Knowledge Graphs, schema.org, JSON- LD) FAIR principles for Research Objects…. – Unifying the vision with the practical
  50. 50. Acknowledgements Barend Mons Sean Bechhofer Matthew Gamble Raul Palma Jun Zhao Mark Robinson AlanWilliams Norman Morrison Stian Soiland-Reyes Tim Clark Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone Kristian Garza Daniel Garijo Catarina Martins Iain Buchan Michael Crusoe Rob Finn StuartOwen Finn Bacall Bert Droebeke Laura Rodríguez Navas Ignacio Eguinoa Carl Kesselman Ian Foster Kyle Chard Vahan Simonyan Ravi Madduri Raja Mazumder GilAlterovitz Denis Dean II DurgaAddepalli Wouter Haak Anita DeWaard Paul Groth Oscar Corcho Peter Sefton Eoghan Ó Carragáin Frederik Coppens Jasper Koehorst Simone Leo Nick Juty LJ Garcia Castro Karl Sebby Alexander Kanitz Ana Trisovic http://researchobject.org Gavin Kennedy Mark Graves José María Fernández Jose Manuel Gomez-Perez Jason A. Clark Salvador Capella-Gutierrez Alasdair J. G. Gray Kristi Holmes Giacomo Tartari Hervé Ménager Paul Walk Brandon Whitehead Erich Bremer Mark Wilkinson Jen Harrow And many more....

Editor's Notes

  • More information on our website: https://zbmed.github.io/damalos
    The swings and roundabouts of a decade of fun and games with Research Objects
    Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
    [1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v
  • I stopped living in research land a long time ago…
    RIs not academic …

  • https://datascience.nih.gov/CancerResearchDataCommons
    We make the rocket launchers for the rocket science
    We are the microscope, telescope, datascope instrument makers
  • Use Linked data
    Documentation and reporting
  • https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf
    https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf
  • The Data-Driven Open Science, e-Laboratories


    Inside and outside
    Elsewhere, on-date
    Within, during
    Reproducible Research Environment
    Integrated infrastructure for producing and working with reproducible research.

    Reproducible Research Publication Environment
    Distributing and reviewing; academic credit; legal licensing etc.




    Such a merge between research infrastructure and research marketplace would overcome known "cultural barriers" and the "methodological barriers" mentioned above. In the RI scope, research products publishing would:
    be facilitated by the very ICT services that are generating them;
    take advantage of research activity awareness and product interlinking;
    support access rights issues that fit the need of the community; and
    subsume the major costs of storing and curating products. In other words, by bringing marketplace features within the RI, researchers can finally achieve their best effort in terms of scholarly communication.

    Science 2.0 Repositories: Time for a Change in Scholarly Communication
    Massimiliano Assante, Leonardo Candela, Donatella Castelli, Paolo Manghi and Pasquale Pagano Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Italy {assante, candela, castelli, manghi, pagano}@isti.cnr.it DOI: 10.1045/january2015-assante

  • Beyond packaging and not just at the end!
    A knowledge graph of the butterfly
  • FDO Framework
    PIDs
    Byte-stream described by metadata
    https://www.dona.net/digitalobjectarchitecture
    https://fairdigitalobjectframework.org/
    FDO types and type definition registries
    https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf
    https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf
    https://www.rd-alliance.org/group/gede-group-european-data-experts-rda/wiki/gede-digital-object-topic-group
  • A Knowledge Graph of the investigation
  • No just encapsulated
    References

  • RO in practice

    Interpretation, Comparison, Portability, Reuse, Credit, Citation
    Interpretation, Comparison, Preservation, Repair, Portability, Reuse, Execution
    Active Research, Evolving codes, New data, Portability, Reuse, Execution
  • A knowledge graph about a workflow.
  • Most workflows in papers are ad-hoc, artisan
    So workflows are freedom ….
    Move fast, break grow it
    Bring enterprise scale
    Its so brittle, data science



  • So we came up with ROs
  • ROs combine containers and incremental metadata

    Like DataONE Packages
    Its all about the metadata
  • The first implementation of Research Objects in 2009 for sharing workflows in myExperiment https://doi.org/10.1093/nar/gkq429 was based on RDF ontologies http://eprints.soton.ac.uk/id/eprint/267787, building on Dublin Core, FOAF, SIOC, Creative Commons, OAI-ORE and (later) DBPedia to form myExperiment ontologies for describing social networking, attribution and creditation, annotations, aggregation packs, experiments, view statistics, contributions, and workflow components. http://web.archive.org/web/20091115080336/http://rdf.myexperiment.org/ontologies Programmatic access to Research Objects was facilitated with an RDF endpoint that exposed individual myExperiment resources, also queriable from a SPARQL endpoint, both using the myExperiment vocabularies and RDF formats RDF/XML and Turtle.
    Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
    Stack becomes very large to learn and use.
    RDF as its own worst enemy.
    Need for simplicity. Cater for "web developer".
    Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
    Stack becomes very large to learn and use.
    RDF as its own worst enemy.
    Need for simplicity. Cater for "web developer".

  • JSON-LD
  • Howard Ratner, Chair STM Future Labs Committee, CEO EVP Nature Publishing Group Director of Development for CHORUS (Clearinghouse for the Open Research of US) STM Innovations Seminar 2012
    projects
  • But really friends, and 1 platform and 1 project
  • Championing through bubble projects
  • Desiderata, norms
    But did we really do this?

    But it's about prioritizing, so say we use a production-ready database but put a bleeding-edge research object application on top, like how Peter Sefton's team are putting things quickly together. but of course it's way simplified
    kind of "just enough complexity" (or just enough standards!) to get sufficient extra benefits from what already exists (Linked Data, vocabularies, tooling, validation, transformation etc) without compromising the developer entry-level experience too much that they rather do their own thing (another triangle wheel).
  • We started in theory, but we work in practice
    ROs sit in the middle
  • Inspection states
    Conform to a profile – STILL do not have profile validation…
    Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
    Stack becomes very large to learn and use.
    RDF as its own worst enemy.
    Need for simplicity. Cater for "web developer".




  • Heath Robinson
    Base Type

    nested ROs?
    specialised Workflow-RO-Crate profiles for different WfMS
    RO-Crate -> WRO-Crate -> Galaxy-WRO-Crate
    JSon schema
    Data Crate examples…. From

    you can point to also how RDF graph have been reinvented as knowledge graphs - just the same thing with less features, and then suddenly it took off

    14:45
    Linked Data and Semantic web are the same

    Stian, 14:45
    as it became understandable!


    Stian, 14:45
    and schema.org vs. all the mismash of specialized ontologies
  • Sometimes its too much
  • For Practice, A Village

    Sustainability – get some traction….
  • Sponsored by BioExcel
  • University of Technology Sydney
    "As a researcher…I’m a bit bloody fed up with Data Management" https://cameronneylon.net/blog/as-a-researcher-im-a-bit-bloody-fed-up-with-data-management/
  • Hence desktop solutions like Describo are also important. Not sure if you mentioned Describo!
    Library & Information Science Network

  • We did oversimplify to just the datacrate !
    Data structure Depth 1

    Adopt software stacks widely used on the Web
    BagIT data profile , schema.org, JSON-LD, JSONSchema
    Flattened JSON-LD, no need for RDF libraries
    A self-described container using LD as a core principle
    Data entities are files/directories or web resources
    Easy to understand and simple conceptually
    Data structure depth limited to 1
    Boundness of elements is explicit
    Strong opinionated guide to current best practices 
    Example driven rather than strict specification
    Implementers can add additional metadata using other schema.org types and properties

    KGs are simplified and then could be adopted.

    We swung too far and threw away the remote referencing – hence RO-Crate 1.1
    Next RO-Crate and containers…

    Conceptual Definition
    A key premise of RO-Crate is the existence of a wide variety of resources on the Web that can help describe research. As such RO-Crate relies on concepts from the Web and in particular linked data. 
    Linked Data as a core principle
    Linked Data [cite] is a core principle of RO-Crate, and IRIs are therefore used to identify the RO-Crate, its constituent parts and metadata descriptions, as well as to the properties and classes used in the metadata. IRIs are a generalization of URIs (which include well-known http/https URLs), permitting international Unicode characters without %-encoding [RFC3987]. 
    Using Linked Data, where consumers can follow the links for more (ideally both human- or machine-readable) information, the RO-Crate can then be sufficiently self-described and related using global identifiers, without needing to recursively fully describe every referenced entity.
    The base of Linked Data and shared vocabularies also means multiple RO-Crates and other Linked Data resources can be indexed, combined, queried, integrated or transformed using existing Semantic Web technologies like SPARQL and knowledge graph triple stores.
    RO-Crate is a self-described container
    An RO-Crate is defined as a self-described Root Data Entity, which describes and contains data entities, which are further described using contextual entities.  A  data entity is primarily a file (bytes on disk somewhere) or a directory (set of named files); while a contextual entity exists outside the information system (e.g. a Person) and its bytes are primarily metadata.
    The Root Data Entity is a directory, the RO-Crate Root, identified by the presence of the RO-Crate Metadata File ro-crate-metadata.json. This is a JSON-LD file that describes the RO-Crate, its content and related metadata using Linked Data. 
    The minimal requirements for the root data entity metadata is name, description and datePublished, as well as a contextual entity identifying its license; but additional metadata is frequently added depending on the purpose of the particular RO-Crate.
    Because the RO-Crate is just a set of related files and directories, it can be stored, transferred or published in multiple ways, such as BagIt [RFC8493], Oxford Common File Layout [OCFL], downloadable ZIP archives in Zenodo or dedicated online repositories, as well as published directly on the web, e.g. using GitHub Pages. Combined with Linked Data identifiers this caters for a diverse set of storage and access requirements across different scientific domains, e.g. metagenomics workflows producing ~100 GB of genome data, or cultural heritage records with access restrictions for personally identifiable data.
    Data Entities are described using Contextual Entities
    RO-Crate distinguishes between data and contextual entities in a similar way to HTTP terminology information (data) and non-information (contextual) resource [HttpRange-14]. While the data entities in the most common scenario are files and directories located by relative IRI references within the RO-Crate Root, they can also be web resources identified with absolute http/https IRIs.
    As both types of entities are identified by IRIs, their distinction is allowed to be blurry; data entities can be located anywhere and be complex, and the contextual entities can have a Web presence beyond their description inside the RO-Crate. For instance  <https://orcid.org/0000-0002-1825-0097> is primarily an identifier for a person, but is secondarily also a web page that describes that person and their academic work. 
    Any particular IRI might appear as a contextual entity in one RO-Crate, and a data entity in another; their distinction is that data entities can be considered to be contained or captured by the RO-Crate, while the contextual entities are mainly explaining the RO-Crate and its entities. It follows that an RO-Crate should have one or more data entities, although this is not a formal requirement.
    Best Practice Guide rather than strict specification
    RO-Crate builds on Linked Data standards and community-evolved vocabularies like JSON-LD and schema.org, but rather than enforce additional constraints and limits using perhaps intimidating technologies like RDF shapes (SHACL, ShEx), RO-Crate aims instead to build a set of best practice guides on how to apply the existing standards in a common way to describe research outputs and their provenance.
    As such the RO-Crate specification [RO-Crate 1.1] can be seen as an opinionated and example-driven guide to writing schema.org metadata as JSON-LD, and leaves it open for implementers to add additional metadata using other schema.org types and properties, or even using other Linked Data vocabularies/ontologies or their own ad-hoc terms.
    This does not mean that RO-Crate has not got constraints, crucially the specification is quite strict on the style of the flattened JSON-LD to facilitate lightweight developer consumption and generation of the RO-Crate Metadata file without the need for RDF libraries.
    Checking Simplicity
    As previously discussed, one aim of RO-Crate is to be conceptually simple. This simplicity has been repeatable checked and confirmed through a community process. See for example discussion at (GitHub issue #71 on ad-hoc vocabularies). 
    To further verify this idea, we have formalized the RO-Crate definition (see Appendix X). An important result of this exercise is that the underlying data structure of RO-Crate is a depth limited tree of 1 level. The formalization also emphasizes the boundness of the structure, namely,  that elements are specifically identified as part of the RO-Crate or are typed as being references external to the structure (i.e. ContextualEntities).
  • Active objects, versioning, nesting, references
  • STILL A LONG WAY TO GO! Still need better guides

    Two draft profiles
    ComputationalWorkflow v 0.5
    10 mandatory, 15 recommended, 7 optional fields
    FormalParameter v 0.1
    3 mandatory, 1 recommended, 3 optional fields

    Two draft types
    ComputationalWorkflow v 0.1
    FormalParameter v 0.1

    Mappings to DataCite & OpenAIRE
  • What I would like to see in the end is how we are NOT throwing out the Linked Data foundation - and that we want to retain all the benefits, e.g. querying, graph stores, vocabularies, clickable URI as identifiers - but they are mere things we want to keep in our toolbox rather than a goal in itself - and as such the toolbox also have the other things developers expects; documentation, examples, libraries, simplifications rather than generalizations [removing unneeded degrees of freedom like multiple RDF formats]; showing them how to do the Right Thing (tm) almost by accident. Then they can - if they want to and when time is right - pick up some of those shiny tools in the bottom of the toolbox.

    Just like in cars - they benefit a lot from reusing standards - like communication buses, agreeing on mounts and screws, almost interchangable parts so that manufacturers can mix and match when they put it together. But you as a consumer don't care until you realize your exhaust pipe is broken and the mechanic now can choose how to fix it rather than be restrained to one single manufacturer.

    some kind of "Why not just plain JSON?" point perhaps. We are not after the exotic features that you will hear about in the rest of the conference, but the bog standard tooling that arguably you could also do in a NoSQL database with enough customization and conventions. But rather than invent our own C&C we retain the Linked Data conventions - "Just enough standards" - following the lead of schema.org
  • Keep your eye on the target

    How do we persuade Scientists?
    We do not
    That’s why we need to persuade the infrastructure people

    What other things do we need to do
    Small Picture – you don’t
    Developer Friendliness Matters
    Its underware – customers are infrastructures
    Marrying the FDO vision with the RO

    Embed in Research Infrastructures

  • ×