Reproducible Research:
how could Research Objects help
Professor Carole Goble
The University of Manchester, UK
ELIXIR Interoperability & Head of Node UK
Software Sustainability Institute UK
carole.goble@manchester.ac.uk
21st Genomics Standards Consortium meeting 21 May 2019,Vienna
Being FAIR
Flipping through Nature in April 2019….
Flawed
Design
& Practice
Poor
Reporting &
Availability
Scientific publications
• announce a result
• convince readers to trust it
• enable a peer to reuse it or compare against it.
Wet Lab Experimental science
• describe the results
• provide a clear enough
description of the materials and
protocol to allow successful
repetition and extension [Jill
Mesirov 2010]
Dry Lab Computational science
• describe the results
• provide the complete software
development environment,
data, instructions, techniques
[David Donoho 1995]
Why?
Not reporting the
design sufficiently
Not enough metadata on
the data or methods to
understand, repeat,
compare, rerun
Reporting & Availability irreproducibility?
The method isn’t
transparently,
comprehensively and
accurately reported
Not being able to access
the data, rerun the
method in your
environment, have all the
components you need
portability
preservation
packaging
hosting
robustness
descriptionids
steps, provenance
access
dependencies
Flipping through Nature in April 2019….
Reproduce and reuse computations
Transparently communicate the
way computations are performed
Disambiguate interpretation of
inputs/parameters/results
Safely (re)run computations ported
onto different platforms
Human and computer readable
definitions for the provenance of
computation, types for the data and
results
The Data and the Methods
Method Reproducibility
the provision of enough detail about
study procedures and data so the
same procedures could, in theory or
in actuality, be exactly repeated.
Result Reproducibility
the same results from the conduct of
an independent study whose
procedures are as closely matched
to the original experiment as possible
Procedure = Software, SOP, Lab Protocol, Workflow, Script.
Tools, Technologies, Techniques. A whole bunch of them together.
Goodman, et al ScienceTranslational Medicine 8 (341) 2016
Flipping through Nature in April 2019….
DATA
UMGS genomes
• in ENA ERP108418
Other datasets:
• ftp://ftp.ebi.ac.uk/pub/databases/me
tagenomics/umgs_analyses/
Supplementary Tables
• Excel spreadsheets at the publishers
Flipping through Nature in April 2019….
METHODS
Pointers to scripts, tools and toolkits
• https://pypi.org/project/mg-toolkit/
• sR v3.4.1; Python v2.7.5 and v3.6.5; SPAdes
v3.10.0; MetaBAT v2.12.1; BWA v0.7.16;
samtools v1.5; CheckM v1.0.7; Mash v2.0;
MUMmer v3.23; specI v1.0; MUSCLE
v3.8.31; DIAMOND v0.9.17.118; prodigal
v2.6.3; InterProScan v5.27-66.0;
antiSMASH 4; ALDEx2; sourmashv2.0.0a4;
phytools v0.6-44; GhostKOALA; VirFinder
v1.1; CompareM v0.0.23; MEGAHIT v1.1.3;
MetaWRAP v1.0; MaxBin v2.2.4;
mltoolsv0.3.5; RAxML v8.1.15; CD-HIT v4.7;
tRNAscan-SE v2.0; INFERNAL v1.1.2; dRep
v2.2
• Parameter settings?Configurations?
De-contextualised
Static, Fragmented
Lost Semantic linking
Contextualised
Active, Unified
Semantic linking
Buried in a
PDF figure
Dissemination Fragmentation
Community specific approaches …
Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata
- Bundle files - Keep provenance
- Exchange data - Ship results
Bergmann, F.T. (2014). COMBINE archive and OMEX format: one file to share all information
to reproduce a modeling project. BMC bioinformatics,15(1), 1.
Combine Archive
Systems Biology
Systems Medicine
https://sems.unirostock.de/projects/combinearchive/
Research Objects
Bundled
together**
Digital
objects*
• PIDs
• Metadata
*Turning FAIR into reality Final report and action plan from the European Commission expert group on FAIR data , Nov 2018
** Bechhofer et al (2013) Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
machine processable metadata
in common and specific to
different object types.
bundle together and relate
digital resources with their
context into a unit.
snapshot | cite | exchange
Research Object
Framework
Container
“Unbounded” Objects: External references to things
A Digital Package ObjectType
composed of many interrelated
elements that bundles together and
relates digital resources of a scientific
investigation with context.
A Metadata Object
represents properties in
common across all research
artefacts types, common
PIDs and metadata
Bigger on the
inside than
the outside
Archive formats to
encode the object
Container Profile
Manifest Construction Profile
Model & format for
constructing the manifest
Standards!
OADM
OAI
Manifest Content Profile
About the Object
What is in the Object
Tailored to the Object type
Validate - what expect to be there
Domain
Ontologies
PROV
GitHub
Workflow
EBI’s MGnify
metagenomics pipelines
Workflow description
Input data files
Command line tools,
containerised tools,
workflows
Output data files
Why
Who
HowWhat
When
Where
SOP, Lab Protocol
Publication
Workflow
Content Profile
Workflow Workflow Run
Workflow
“Node”
Describes computational
workflows to be portable,
scalable & interoperable
with different workflow
systems and
containerised tools
Bundles the CWL workflow
descriptions
Adds context, provenance,
examples, validation data …
Snapshots workflow.
Relates it to other objects -
studies, data collections,
SOPs and Lab protocols …
https://www.commonwl.org/
Description of tools, inputs and
outputs.
Ontology markup using EDAM
CWL files in GitHub
Or export from native
platforms
Bundles it all together
Example input files
Validation tests
Links to research study
Software components are
containerised to make them portable and
handle software dependencies
Manifest
Annotations about the content of the manifest
SHACL
Create
Validate
Curate
Explore
https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics-
workflow/blob/master/detect_variants/detect_variants.cwl
For the
JSON fans…
For example: CWL Provenance
Data lineage and licence/citation tracking
EDAM
Ontology
CWL enabledWfMS
Which machines
??
?
parameters
configurations
?
Flipping through Nature in April 2019….
METHODS
Inspect and replicate the
computational analytical
workflow to review and
approve the bioinformatics
Standardize exchange of
HTS workflows for
regulatory submissions
between FDA, pharma,
bioinformatics platform
providers and researchers
“Parametric
domain”
IEEE P2791 BioCompute Working Group
http://biocomputeobject.org
Sharing
commons
co-development
publishing
Exchange
rich description
portability,
interoperability
reproducibility
recomputation
Active Releasing
changes
updates
Stewardship
preservation
maintenance
Challenges
Objectness
• nesting, citing, lifecycles,
governance…
Content Profiles
• machine processable
accuracy and detail
Tooling
• Embedded into platforms,
on ramps
NIH Data Commons
Big data distributed over multiple locations,
Efficiently and safely moved on demand
ROs are verified collections of references
[Chard, et al 2016]
European Open Science Cloud Commons
Tools and Workflow
Collaboratories
RO-based
Workflow Commons
Getting into Practice
Getting into Practice Commonwl.org
Acknowledgements
Stian Soiland-Reyes
Michael Crusoe
Rob Finn
Kyle Chard
Daniel Garijo
Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
Catarina Martins
Iain Buchan
Carl Kesselman
Ian Foster
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz,
Denis Dean II
Durga Addepalli
Wouter Haak
Anita De Waard
Paul Groth
Oscar Corcho
CWL and RO communities
Project ID: 675728

Reproducible Research: how could Research Objects help

  • 1.
    Reproducible Research: how couldResearch Objects help Professor Carole Goble The University of Manchester, UK ELIXIR Interoperability & Head of Node UK Software Sustainability Institute UK carole.goble@manchester.ac.uk 21st Genomics Standards Consortium meeting 21 May 2019,Vienna
  • 2.
  • 3.
    Flipping through Naturein April 2019…. Flawed Design & Practice Poor Reporting & Availability
  • 4.
    Scientific publications • announcea result • convince readers to trust it • enable a peer to reuse it or compare against it. Wet Lab Experimental science • describe the results • provide a clear enough description of the materials and protocol to allow successful repetition and extension [Jill Mesirov 2010] Dry Lab Computational science • describe the results • provide the complete software development environment, data, instructions, techniques [David Donoho 1995] Why?
  • 5.
    Not reporting the designsufficiently Not enough metadata on the data or methods to understand, repeat, compare, rerun Reporting & Availability irreproducibility? The method isn’t transparently, comprehensively and accurately reported Not being able to access the data, rerun the method in your environment, have all the components you need portability preservation packaging hosting robustness descriptionids steps, provenance access dependencies
  • 6.
    Flipping through Naturein April 2019…. Reproduce and reuse computations Transparently communicate the way computations are performed Disambiguate interpretation of inputs/parameters/results Safely (re)run computations ported onto different platforms Human and computer readable definitions for the provenance of computation, types for the data and results
  • 7.
    The Data andthe Methods Method Reproducibility the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated. Result Reproducibility the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible Procedure = Software, SOP, Lab Protocol, Workflow, Script. Tools, Technologies, Techniques. A whole bunch of them together. Goodman, et al ScienceTranslational Medicine 8 (341) 2016
  • 8.
    Flipping through Naturein April 2019…. DATA UMGS genomes • in ENA ERP108418 Other datasets: • ftp://ftp.ebi.ac.uk/pub/databases/me tagenomics/umgs_analyses/ Supplementary Tables • Excel spreadsheets at the publishers
  • 9.
    Flipping through Naturein April 2019…. METHODS Pointers to scripts, tools and toolkits • https://pypi.org/project/mg-toolkit/ • sR v3.4.1; Python v2.7.5 and v3.6.5; SPAdes v3.10.0; MetaBAT v2.12.1; BWA v0.7.16; samtools v1.5; CheckM v1.0.7; Mash v2.0; MUMmer v3.23; specI v1.0; MUSCLE v3.8.31; DIAMOND v0.9.17.118; prodigal v2.6.3; InterProScan v5.27-66.0; antiSMASH 4; ALDEx2; sourmashv2.0.0a4; phytools v0.6-44; GhostKOALA; VirFinder v1.1; CompareM v0.0.23; MEGAHIT v1.1.3; MetaWRAP v1.0; MaxBin v2.2.4; mltoolsv0.3.5; RAxML v8.1.15; CD-HIT v4.7; tRNAscan-SE v2.0; INFERNAL v1.1.2; dRep v2.2 • Parameter settings?Configurations?
  • 10.
    De-contextualised Static, Fragmented Lost Semanticlinking Contextualised Active, Unified Semantic linking Buried in a PDF figure Dissemination Fragmentation
  • 11.
    Community specific approaches… Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata - Bundle files - Keep provenance - Exchange data - Ship results Bergmann, F.T. (2014). COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC bioinformatics,15(1), 1. Combine Archive Systems Biology Systems Medicine https://sems.unirostock.de/projects/combinearchive/
  • 12.
    Research Objects Bundled together** Digital objects* • PIDs •Metadata *Turning FAIR into reality Final report and action plan from the European Commission expert group on FAIR data , Nov 2018 ** Bechhofer et al (2013) Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
  • 13.
    Bechhofer et al(2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/ machine processable metadata in common and specific to different object types. bundle together and relate digital resources with their context into a unit. snapshot | cite | exchange Research Object Framework
  • 14.
    Container “Unbounded” Objects: Externalreferences to things A Digital Package ObjectType composed of many interrelated elements that bundles together and relates digital resources of a scientific investigation with context. A Metadata Object represents properties in common across all research artefacts types, common PIDs and metadata Bigger on the inside than the outside
  • 15.
    Archive formats to encodethe object Container Profile Manifest Construction Profile Model & format for constructing the manifest Standards! OADM OAI Manifest Content Profile About the Object What is in the Object Tailored to the Object type Validate - what expect to be there Domain Ontologies PROV GitHub
  • 16.
    Workflow EBI’s MGnify metagenomics pipelines Workflowdescription Input data files Command line tools, containerised tools, workflows Output data files
  • 17.
    Why Who HowWhat When Where SOP, Lab Protocol Publication Workflow ContentProfile Workflow Workflow Run Workflow “Node”
  • 18.
    Describes computational workflows tobe portable, scalable & interoperable with different workflow systems and containerised tools Bundles the CWL workflow descriptions Adds context, provenance, examples, validation data … Snapshots workflow. Relates it to other objects - studies, data collections, SOPs and Lab protocols … https://www.commonwl.org/
  • 19.
    Description of tools,inputs and outputs. Ontology markup using EDAM CWL files in GitHub Or export from native platforms Bundles it all together Example input files Validation tests Links to research study Software components are containerised to make them portable and handle software dependencies
  • 20.
    Manifest Annotations about thecontent of the manifest SHACL Create Validate Curate Explore https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics- workflow/blob/master/detect_variants/detect_variants.cwl For the JSON fans…
  • 21.
    For example: CWLProvenance Data lineage and licence/citation tracking
  • 22.
  • 23.
    Flipping through Naturein April 2019…. METHODS
  • 24.
    Inspect and replicatethe computational analytical workflow to review and approve the bioinformatics Standardize exchange of HTS workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers “Parametric domain” IEEE P2791 BioCompute Working Group http://biocomputeobject.org
  • 25.
    Sharing commons co-development publishing Exchange rich description portability, interoperability reproducibility recomputation Active Releasing changes updates Stewardship preservation maintenance Challenges Objectness •nesting, citing, lifecycles, governance… Content Profiles • machine processable accuracy and detail Tooling • Embedded into platforms, on ramps
  • 26.
    NIH Data Commons Bigdata distributed over multiple locations, Efficiently and safely moved on demand ROs are verified collections of references [Chard, et al 2016]
  • 27.
    European Open ScienceCloud Commons Tools and Workflow Collaboratories RO-based Workflow Commons
  • 28.
  • 29.
  • 30.
    Acknowledgements Stian Soiland-Reyes Michael Crusoe RobFinn Kyle Chard Daniel Garijo Barend Mons Sean Bechhofer Matthew Gamble Raul Palma Jun Zhao Mark Robinson AlanWilliams Norman Morrison Tim Clark Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone Kristian Garza Catarina Martins Iain Buchan Carl Kesselman Ian Foster Vahan Simonyan Ravi Madduri Raja Mazumder GilAlterovitz, Denis Dean II Durga Addepalli Wouter Haak Anita De Waard Paul Groth Oscar Corcho CWL and RO communities Project ID: 675728