Reproducibility
in Scientific Data Analysis
Samuel Lampa @smllmp
PhD Student
Pharmaceutical Bioinformatics at pharmb.io
with Assoc. Prof. Ola Spjuth @ola_spjuth
@ Dept. of Pharm. Biosci. / Uppsala University
Farmbio BioScience Seminar – Dec 16 2016
Structure of this talk
Reproducibility in Scientific Data Analysis …
● What is it?
● Why is it important?
● Why is it a problem?
● What can we do about it?
● What does pharmb.io do about it?
What is it?
“it” = reproducibility in scientific data analysis
reproducible ≠ replicable
reproducible ≠ correct
Why is it important?
“it” = reproducibility in scientific data analysis
Why is it important?
● More and more data generation automated
→ More and more focus on data analysis
● Culture of replicability not (yet) as established
in computational as in classical disciplines
● “it is the only thing that an investigator can
guarantee about a study”
simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
Why is it a problem?
“it” = reproducibility in scientific data analysis
wet lab data analysis?
Why is it a problem?
● Complexity of computing environment
– Software versions, Data versions ...
● More black box components
● Assumptions on computing
environment often left out
● Manual steps often left out
What can we do about it?
“it” = reproducibility in scientific data analysis
What can we do about it?
Utopia: Infrastructure for all data and
computations to be inspected and re-run
with other data and parameters by anyone
But: We can’t wait for that
In the meanwhile: Even small steps towards
reproducibility will help. Start today!
General themes
Know exactly what data and results mean
Know exactly how results were obtained
Be able to get same result independently
More concretely ...
Know exactly what data and results mean
– Open standards, Ontologies, Data formats
Know exactly how results were obtained
– Keeping track of manual steps, parameters, versions of
software and data ...
– Version control
– Automation (scripts)
Be able to get same result independently
– code, data, and scripts … make it all available!
Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for
Reproducible Computational Research. PLoS Comput Biol.
2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
FAIR Principles
for data and meta data
F - Findable
A - Accessible
I - Interoperable
R – Reusable
Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al.
The FAIR Guiding Principles for scientific data management and
stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
What does pharmb.io do about it?
“it” = reproducibility in scientific data analysis
What does pharmb.io do about it?
● Open data, open source, open standards
Promoting and using as much as possible
● BioImg.org
Store Virtual Machines & Containers
● Semantic Data Technologies
Machine readability - Avoiding ambiguity
● Re-runnable computational experiments
Via workflows, containers, infrastructure as code
O’Boyle NM, Guha R, Willighagen EL, et al.
Open Data, Open Source and Open Standards in chemistry: The
Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16.
doi:10.1186/1758-2946-3-37
BioImg.org
Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O.
BioImg.org: A catalog of virtual machine images for the life sciences.
Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.
Martin Dahlö
Semantic Data Technologies
Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R,
Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable
biomedical data management. J Biomed Sem. Submitted.
Re-runnable experiments
via containers
(and infrastructure as code)
Marco Capuccini
github.com/kubenow/KubeNow
github.com/mcapuccini/SparkNow
Re-runnable experiments
via workflows
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
Thank you
pharmb.io

Reproducibility in Scientific Data Analysis - BioScience Seminar

  • 1.
    Reproducibility in Scientific DataAnalysis Samuel Lampa @smllmp PhD Student Pharmaceutical Bioinformatics at pharmb.io with Assoc. Prof. Ola Spjuth @ola_spjuth @ Dept. of Pharm. Biosci. / Uppsala University Farmbio BioScience Seminar – Dec 16 2016
  • 3.
    Structure of thistalk Reproducibility in Scientific Data Analysis … ● What is it? ● Why is it important? ● Why is it a problem? ● What can we do about it? ● What does pharmb.io do about it?
  • 4.
    What is it? “it”= reproducibility in scientific data analysis
  • 5.
  • 6.
  • 7.
    Why is itimportant? “it” = reproducibility in scientific data analysis
  • 8.
    Why is itimportant? ● More and more data generation automated → More and more focus on data analysis ● Culture of replicability not (yet) as established in computational as in classical disciplines ● “it is the only thing that an investigator can guarantee about a study” simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
  • 9.
    Why is ita problem? “it” = reproducibility in scientific data analysis
  • 10.
    wet lab dataanalysis?
  • 11.
    Why is ita problem? ● Complexity of computing environment – Software versions, Data versions ... ● More black box components ● Assumptions on computing environment often left out ● Manual steps often left out
  • 12.
    What can wedo about it? “it” = reproducibility in scientific data analysis
  • 13.
    What can wedo about it? Utopia: Infrastructure for all data and computations to be inspected and re-run with other data and parameters by anyone But: We can’t wait for that In the meanwhile: Even small steps towards reproducibility will help. Start today!
  • 14.
    General themes Know exactlywhat data and results mean Know exactly how results were obtained Be able to get same result independently
  • 15.
    More concretely ... Knowexactly what data and results mean – Open standards, Ontologies, Data formats Know exactly how results were obtained – Keeping track of manual steps, parameters, versions of software and data ... – Version control – Automation (scripts) Be able to get same result independently – code, data, and scripts … make it all available!
  • 16.
    Sandve GK, NekrutenkoA, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol. 2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
  • 17.
    FAIR Principles for dataand meta data F - Findable A - Accessible I - Interoperable R – Reusable Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
  • 18.
    What does pharmb.iodo about it? “it” = reproducibility in scientific data analysis
  • 19.
    What does pharmb.iodo about it? ● Open data, open source, open standards Promoting and using as much as possible ● BioImg.org Store Virtual Machines & Containers ● Semantic Data Technologies Machine readability - Avoiding ambiguity ● Re-runnable computational experiments Via workflows, containers, infrastructure as code
  • 20.
    O’Boyle NM, GuhaR, Willighagen EL, et al. Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16. doi:10.1186/1758-2946-3-37
  • 21.
    BioImg.org Dahlö M, HazizaF, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A catalog of virtual machine images for the life sciences. Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636. Martin Dahlö
  • 22.
    Semantic Data Technologies LampaS, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Sem. Submitted.
  • 23.
    Re-runnable experiments via containers (andinfrastructure as code) Marco Capuccini github.com/kubenow/KubeNow github.com/mcapuccini/SparkNow
  • 24.
  • 25.
    Lampa S, AlvarssonJ, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
  • 26.
    Lampa S, AlvarssonJ, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
  • 27.