Reproducible Research: how could Research Objects help, given at 21st Genomic Standards Consortium Meeting
Dates: May 20-23, 2019
https://press3.mcs.anl.gov/gensc/meetings/gsc21/
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Reproducible Research: how could Research Objects help
1. Reproducible Research:
how could Research Objects help
Professor Carole Goble
The University of Manchester, UK
ELIXIR Interoperability & Head of Node UK
Software Sustainability Institute UK
carole.goble@manchester.ac.uk
21st Genomics Standards Consortium meeting 21 May 2019,Vienna
3. Flipping through Nature in April 2019….
Flawed
Design
& Practice
Poor
Reporting &
Availability
4. Scientific publications
• announce a result
• convince readers to trust it
• enable a peer to reuse it or compare against it.
Wet Lab Experimental science
• describe the results
• provide a clear enough
description of the materials and
protocol to allow successful
repetition and extension [Jill
Mesirov 2010]
Dry Lab Computational science
• describe the results
• provide the complete software
development environment,
data, instructions, techniques
[David Donoho 1995]
Why?
5. Not reporting the
design sufficiently
Not enough metadata on
the data or methods to
understand, repeat,
compare, rerun
Reporting & Availability irreproducibility?
The method isn’t
transparently,
comprehensively and
accurately reported
Not being able to access
the data, rerun the
method in your
environment, have all the
components you need
portability
preservation
packaging
hosting
robustness
descriptionids
steps, provenance
access
dependencies
6. Flipping through Nature in April 2019….
Reproduce and reuse computations
Transparently communicate the
way computations are performed
Disambiguate interpretation of
inputs/parameters/results
Safely (re)run computations ported
onto different platforms
Human and computer readable
definitions for the provenance of
computation, types for the data and
results
7. The Data and the Methods
Method Reproducibility
the provision of enough detail about
study procedures and data so the
same procedures could, in theory or
in actuality, be exactly repeated.
Result Reproducibility
the same results from the conduct of
an independent study whose
procedures are as closely matched
to the original experiment as possible
Procedure = Software, SOP, Lab Protocol, Workflow, Script.
Tools, Technologies, Techniques. A whole bunch of them together.
Goodman, et al ScienceTranslational Medicine 8 (341) 2016
8. Flipping through Nature in April 2019….
DATA
UMGS genomes
• in ENA ERP108418
Other datasets:
• ftp://ftp.ebi.ac.uk/pub/databases/me
tagenomics/umgs_analyses/
Supplementary Tables
• Excel spreadsheets at the publishers
9. Flipping through Nature in April 2019….
METHODS
Pointers to scripts, tools and toolkits
• https://pypi.org/project/mg-toolkit/
• sR v3.4.1; Python v2.7.5 and v3.6.5; SPAdes
v3.10.0; MetaBAT v2.12.1; BWA v0.7.16;
samtools v1.5; CheckM v1.0.7; Mash v2.0;
MUMmer v3.23; specI v1.0; MUSCLE
v3.8.31; DIAMOND v0.9.17.118; prodigal
v2.6.3; InterProScan v5.27-66.0;
antiSMASH 4; ALDEx2; sourmashv2.0.0a4;
phytools v0.6-44; GhostKOALA; VirFinder
v1.1; CompareM v0.0.23; MEGAHIT v1.1.3;
MetaWRAP v1.0; MaxBin v2.2.4;
mltoolsv0.3.5; RAxML v8.1.15; CD-HIT v4.7;
tRNAscan-SE v2.0; INFERNAL v1.1.2; dRep
v2.2
• Parameter settings?Configurations?
11. Community specific approaches …
Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata
- Bundle files - Keep provenance
- Exchange data - Ship results
Bergmann, F.T. (2014). COMBINE archive and OMEX format: one file to share all information
to reproduce a modeling project. BMC bioinformatics,15(1), 1.
Combine Archive
Systems Biology
Systems Medicine
https://sems.unirostock.de/projects/combinearchive/
12. Research Objects
Bundled
together**
Digital
objects*
• PIDs
• Metadata
*Turning FAIR into reality Final report and action plan from the European Commission expert group on FAIR data , Nov 2018
** Bechhofer et al (2013) Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
13. Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
machine processable metadata
in common and specific to
different object types.
bundle together and relate
digital resources with their
context into a unit.
snapshot | cite | exchange
Research Object
Framework
14. Container
“Unbounded” Objects: External references to things
A Digital Package ObjectType
composed of many interrelated
elements that bundles together and
relates digital resources of a scientific
investigation with context.
A Metadata Object
represents properties in
common across all research
artefacts types, common
PIDs and metadata
Bigger on the
inside than
the outside
15. Archive formats to
encode the object
Container Profile
Manifest Construction Profile
Model & format for
constructing the manifest
Standards!
OADM
OAI
Manifest Content Profile
About the Object
What is in the Object
Tailored to the Object type
Validate - what expect to be there
Domain
Ontologies
PROV
GitHub
18. Describes computational
workflows to be portable,
scalable & interoperable
with different workflow
systems and
containerised tools
Bundles the CWL workflow
descriptions
Adds context, provenance,
examples, validation data …
Snapshots workflow.
Relates it to other objects -
studies, data collections,
SOPs and Lab protocols …
https://www.commonwl.org/
19. Description of tools, inputs and
outputs.
Ontology markup using EDAM
CWL files in GitHub
Or export from native
platforms
Bundles it all together
Example input files
Validation tests
Links to research study
Software components are
containerised to make them portable and
handle software dependencies
20. Manifest
Annotations about the content of the manifest
SHACL
Create
Validate
Curate
Explore
https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics-
workflow/blob/master/detect_variants/detect_variants.cwl
For the
JSON fans…
21. For example: CWL Provenance
Data lineage and licence/citation tracking
24. Inspect and replicate the
computational analytical
workflow to review and
approve the bioinformatics
Standardize exchange of
HTS workflows for
regulatory submissions
between FDA, pharma,
bioinformatics platform
providers and researchers
“Parametric
domain”
IEEE P2791 BioCompute Working Group
http://biocomputeobject.org
26. NIH Data Commons
Big data distributed over multiple locations,
Efficiently and safely moved on demand
ROs are verified collections of references
[Chard, et al 2016]
27. European Open Science Cloud Commons
Tools and Workflow
Collaboratories
RO-based
Workflow Commons
30. Acknowledgements
Stian Soiland-Reyes
Michael Crusoe
Rob Finn
Kyle Chard
Daniel Garijo
Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
Catarina Martins
Iain Buchan
Carl Kesselman
Ian Foster
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz,
Denis Dean II
Durga Addepalli
Wouter Haak
Anita De Waard
Paul Groth
Oscar Corcho
CWL and RO communities
Project ID: 675728