SlideShare a Scribd company logo
1 of 35
Stephen D. Turner, Ph.D.
Bioinformatics Core Director
University of Virginia School of Medicine
bioinformatics.virginia.edu
@strnr
Tools for Improving Rigor &
Reproducibility in Bioinformatics
Slides: bit.ly/madssci2018repro
We Are in the Middle of a
New Movement in Genomics
• Genomics/bioinformatics advancing at grueling pace
- New questions
- New study designs
- New technologies, new [fill-in-the-blank]-seq
• New movements have:
- Leaders / method developers / early adopters
- First followers
- Everybody else
• New technology leads to more reproducibility issues
2
CORES!
Reproducibility is hard!
• Genomics data is too large and high
dimensional to easily inspect or visualize.
• Workflows involve multiple steps and it's hard
to inspect every step.
• Unlike in the wet lab, we don't always know
what to expect of our genomics data analysis.
• It can be hard to distinguish good from bad
results.
3
4
Reproducibility:

What's in it for you?
• Your future self will thank you
- Re-running analysis with different parameters
- Re-running analysis with new data
- Documentation
• Faster/cheaper
- Modular workflows
- Reusable code chunks
• Makes collaboration with others easier
5
"Robust research is about doing small things that
stack the deck in your favor to prevent mistakes."
–Vince Buffalo, author of Bioinformatics Data Skills (2015).
Obstacles to Reproducibility
1. Bioinformatics software
2. Pipeline / workflow management
3. Documentation
4. Data / code sharing
6
A non-comprehensive list of
Bioinformatics
Software
7
Bioinformatics Software
• Bioinformatics software implements complex algorithms.
- Dozens of parameters, endless permutations
- Defaults not always optimal
• Perception:
8
ACACTCGCATCCGCACATCGCACTA
GGTCAGCATACGCCGACTCCGACCG
GCGCTATCGCCAGCGGAAATCGCAA
Bioinformatics Software
• Bioinformatics software implements complex algorithms.
- Dozens of parameters, endless permutations
- Defaults not always optimal
• Reality: Software is written by smart people, but:
- Not software engineers
- Not using good practice (version control,
modularization, commentary, testing)
- Unable to offer long-term 

maintenance / support
- Focus on graduating / 

publishing, not support
- Not always easily available
9
• Missing or incomplete documentation
• Distribution is missing files
• Missing third party package
• Dependencies failed to build
• Runtime error
• Internal compiler error
• My last week:
- samtools: error while loading shared libraries:
libbz2.so.1.0: cannot open shared object file
- error while loading shared libraries: libz.so.1:
failed to map segment from shared object:
Operation not permitted
- /lib64/libc.so.6: version `GLIBC_2.14' not found
11
https://twitter.com/ianholmes/status/288689712636493824
Package managers
12
Mac OS WindowsLinux
apt-get
yum
homebrew
macports
?????
?????
Cross-platform
Conda
• Cross-platform package manager: Win, Mac, Linux
• Language agnostic (can be used to install C/C++,
Fortran, Go, R, Python, Perl, Java, etc.).
• User-installable – no admin/root privileges needed.
• Describes packages with a recipe defining
dependencies and a build script that installs.
• Channels: conda provides many common packages by
default. Additional channels add more.
• Isolated environments
- Versions and tools can be managed per-project
- No conflicts or version incompatibility
- Environments can be shared via simple text files
13
Conda: Main commands
• conda create -n <environment>
• source activate <environment>
• conda search <package>
• conda install <package>
• conda upgrade <package>
• conda uninstall <package>
14
Conda: example
• Create a new environment named madssci:

conda create -n madssci
• Activate that environment

source activate madssci
• Install some packages

conda install blast bioconductor-flowcore
• Install a particular version

conda install samtools=0.1.19
15
Bioconda
• bioconda.github.io
• Bioconda is a channel for the conda package manager
• Repository for more than 3,000 bioinformatics
packages ready to use with conda install
• >250 contributors have added/updated recipes
• Preprint: Grüning, Björn, et al. "Bioconda: A sustainable
and comprehensive software distribution for the life
sciences." bioRxiv (2017): 207092.

https://www.biorxiv.org/content/early/2017/10/27/207092
• See also: "Nature TechBlog: Bioconda Promises to Ease
Bioinformatics Software Installation Woes" 

http://blogs.nature.com/naturejobs/2017/11/03/techblog-bioconda-
promises-to-ease-bioinformatics-software-installation-woes/
16
Docker
• docker.com
• Lightweight virtualization technology
• Package software with all of its dependencies into an isolated "container"
• Containers have everything needed to run: code, system tools & libraries
• Like VMs: portable. = reproducibility!
• Unlike VMs: containers virtualize the OS instead of the hardware. = More
efficient, more portable. Near native performance, instant startup, small
images. Easy to share.
• https://www.docker.com/what-container
• https://blog.docker.com/2016/03/containers-are-not-vms/
17
Containers are an abstraction at
the app layer that packages code
and dependencies together.
Multiple containers can run on the
same machine and share the OS
kernel with other containers, each
running as isolated processes in
user space. Containers take up
less space than VMs (container
images are typically tens of MBs in
size), and start almost instantly.
Virtual machines (VMs) are an
abstraction of physical hardware
turning one server into many
servers. The hypervisor allows
multiple VMs to run on a single
machine. Each VM includes a full
copy of an operating system, one
or more apps, necessary binaries
and libraries - taking up tens of
GBs. VMs can also be slow to
boot.
Pipeline / workflow
management
18
Pipeline / Workflow Management
• Bioinformatics data analysis: series of steps
involving many different programs tied together
with file-based inputs and outputs. E.g.:
19
Pipeline / Workflow Management
• Simple solution: simple (bash) script
- List of commands
- Pros: quick, easy, portable, universal
- Cons: not scalable, no re-entry / partial execution,
assumes dependency availability, difficult / no
parallelization
• Workflow management systems
- Make (installed on most systems)
- Snakemake
- Nextflow
- Galaxy
- Many more: github.com/pditommaso/awesome-pipeline
20
Nextflow
• nextflow.io
• Di Tommaso, Paolo, et al. "Nextflow enables reproducible
computational workflows." Nature biotechnology 35.4
(2017): 316-319.
• Features:
- Free
- Actively developed
- Supports docker containers
- Easy parallelization, implicitly defined
- Continuous checkpointing & resumed execution
- Easily portable across architectures (SGE, LSF,
SLURM, PBS, Amazon AWS, ...).
21
Beware of Pipelineitis
• “Pipelines” can kill your creativity and force
you to think too rigidly.
• Don’t “pipeline” too early, if at all.
• Does it even need to be pipeline-ified?
• Who’s running it?
- You, once: don’t pipeline-ify. Document, move along.
- You, 2-5 times: documented script?
- You, 10+ times: consider pipeline-ifying.
- Others: create sharable pipeline
• See: Loman & Watson. "So you want to be a
computational biologist?" Nat Biotechnol 31
(2013): 996-998.
22
Documentation
23
Dynamic Documentation: RMarkdown
• R: widely used for data science & bioinformatics
• Markdown: a simple markup language that allows you to
render structured/formatted documents from plain text.
• RMarkdown: embeds R code in a Markdown
document.
- Write documents that execute embedded code and
integrates results into the final report.
- Allows you to keep code and documentation together.
- Easily re-render the document, re-running analysis
and re-incorporating results on the fly.
- Many output formats: PDF, DOCX, HTML, EPUB, ...
24
25
Write plain text document
Embed R code
Rendered output report
26
output: pdf_document output: word_document
Jupyter notebooks
• jupyter.org
• Jupyter: open source project to
develop software, standards,
services across many languages
• Jupyter notebook: free
application to create documents
containing live code,
visualizations, narrative text.
- Supports >40 programming
languages
- Easily shared
- Interactive output
- Multi-user versions for
companies, classrooms, labs
27
Data / code sharing
28
Sharing Code
• State of the art early-2000's
- "Data/code available upon request"
- "Code available on <lab website>"
- None of the above
• Schultheiss, Sebastian J., et al. "Persistence and
availability of web services in computational
biology." PloS one 6.9 (2011): e24914.
- Surveyed ~1000 web services published in NAR
2003-2009
- ~30% unavailable
- ~80% developed by students / non-permanent
researcher
• Russell, Pamela H., et al. "A large-scale analysis of
bioinformatics code on GitHub." bioRxiv (2018):
321919.
• github.com is becoming the de facto standard for
archiving and sharing code
29
Sharing any research output
30
- figshare.com
- Free
- Upload any file format
- Get a DOI
- 5 GB max file size
- 20GB private space
- Unlimited public space
- Launched 2012
- Hosted on S3, multiple
redundant copies
- SLA: 10 yr persistence
- zenodo.org
- Free
- Upload any file format
- Get a DOI
- 50 GB per record
- Higher quota by request
- Unlimited records
- Launched 2013
- Hosted at CERN (est
1954), with defined
program of ≥20 years
- about.zenodo.org/policies/
- about.zenodo.org/principles/
- osf.io
- Free
- Upload any file format
- Get a DOI
- 5 GB per file
- Connect to any external
storage provider
- Launched 2013
- Preservation fund
guaranteeing 50+ years
of persistent availability
- osf.io/faq
31
doi.org/10.5281/zenodo.1255003
bit.ly/madssci2018repro
Slides:
Other Resources
32
Wilson, et al. "Good enough practices
in scientific computing." PLoS
computational biology 13.6 (2017):
e1005510.
Wilson, et al. "Best practices for
scientific computing." PLoS
biology 12.1 (2014): e1001745.
https://doi.org/10.1371/journal.pbio.1001745
https://doi.org/10.1371/journal.pcbi.1005510
Other Resources
• 2017: Ten simple rules for making research software more robust: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005412
• 2017: Ten simple rules for responsible big data research: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005399
• 2017: Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005278
• 2016: Ten Simple Rules for Digital Data Storage: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097
• 2016: Ten Simple Rules for Effective Statistical Practice: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004961
• 2015: Ten Simple Rules for Creating a Good Data Management Plan: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525
• 2015: Ten Simple Rules for Experiments’ Provenance: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004384
• 2015: Ten Simple Rules for a Computational Biologist’s Laboratory Notebook: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004385
• 2015: Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004191
• 2014: Ten Simple Rules for the Care and Feeding of Scientific Data: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542
• 2014: Ten Simple Rules for Effective Computational Research: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003506
• 2013: Ten Simple Rules for Reproducible Computational Research: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285
• 2012: Ten Simple Rules for the Open Development of Scientific Software: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002802
• 2014: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article: 

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003858
33
collections.plos.org/
ten-simple-rules
Other Resources
• Baker, Monya. “1,500 Scientists Lift the Lid on Reproducibility.” Nature News, vol.
533, no. 7604, May 2016, p. 452. www.nature.com, doi:10.1038/533452a.
• Grüning, Björn, et al. “Practical Computational Reproducibility in the Life Sciences.”
BioRxiv, Oct. 2017, p. 200683. www.biorxiv.org, doi:10.1101/200683.
• Leek, Jeff. "A Few Things That Would Reduce Stress around Reproducibility/
Replicability in Science." Simply Statistics, November 2017: https://simplystatistics.org/
2017/11/21/rr-sress/.
• Mesirov, Jill P. “Accessible Reproducible Research.” Science, vol. 327, no. 5964, Jan.
2010, pp. 415–16. science.sciencemag.org, doi:10.1126/science.1179653.
• Munafò, Marcus R., et al. “A Manifesto for Reproducible Science.” Nature Human
Behaviour, vol. 1, no. 1, Jan. 2017, p. 0021. www.nature.com, doi:10.1038/
s41562-016-0021.
• Patil, Prasad, et al. “A Statistical Definition for Reproducibility and Replicability.”
BioRxiv, July 2016, p. 066803. www.biorxiv.org, doi:10.1101/066803.
• Russell, Pamela, et al. “A Large-Scale Analysis of Bioinformatics Code on GitHub.”
BioRxiv, May 2018, p. 321919. www.biorxiv.org, doi:10.1101/321919.
• Schultheiss, Sebastian J., et al. “Persistence and Availability of Web Services in
Computational Biology.” PLOS ONE, vol. 6, no. 9, Sept. 2011, p. e24914. PLoS
Journals, doi:10.1371/journal.pone.0024914.
34
Stephen D. Turner, Ph.D.
Bioinformatics Core Director
University of Virginia School of Medicine
bioinformatics.virginia.edu
@strnr
THANKYOU
bit.ly/madssci2018repro
doi.org/10.5281/zenodo.1255003

More Related Content

Similar to Tools for Improving Rigor & Reproducibility in Bioinformatics

2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practicesc.titus.brown
 
DevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
DevOpsGuys - DevOps Automation - The Good, The Bad and The UglyDevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
DevOpsGuys - DevOps Automation - The Good, The Bad and The UglyDevOpsGroup
 
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit Europe
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit EuropeAutomation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit Europe
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit EuropeAppDynamics
 
Open.source.innovation.20070624
Open.source.innovation.20070624Open.source.innovation.20070624
Open.source.innovation.20070624Vu Hung Nguyen
 
Computational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARCComputational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARCMatthieu Foll
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesAnnika Eriksson
 
Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1inside-BigData.com
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informaticsDavid Wallom
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practiceC. Tobin Magle
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big DataTom Mens
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityRob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityGigaScience, BGI Hong Kong
 
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...Daniel Bryant
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesKrzysztof Gorgolewski
 
The Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter DeploymentThe Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter DeploymentFrederick Reiss
 
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...PranavPatil822557
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...eZ Systems
 
Managing changes to eZPublish Database
Managing changes to eZPublish DatabaseManaging changes to eZPublish Database
Managing changes to eZPublish DatabaseGaetano Giunta
 

Similar to Tools for Improving Rigor & Reproducibility in Bioinformatics (20)

2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
G3 talk rld_2
G3 talk rld_2G3 talk rld_2
G3 talk rld_2
 
DevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
DevOpsGuys - DevOps Automation - The Good, The Bad and The UglyDevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
DevOpsGuys - DevOps Automation - The Good, The Bad and The Ugly
 
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit Europe
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit EuropeAutomation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit Europe
Automation: The Good, The Bad and The Ugly with DevOpsGuys - AppD Summit Europe
 
Open.source.innovation.20070624
Open.source.innovation.20070624Open.source.innovation.20070624
Open.source.innovation.20070624
 
Computational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARCComputational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARC
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple Rules
 
Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informatics
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practice
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityRob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
 
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
 
The Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter DeploymentThe Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter Deployment
 
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...
 
Managing changes to eZPublish Database
Managing changes to eZPublish DatabaseManaging changes to eZPublish Database
Managing changes to eZPublish Database
 

Recently uploaded

User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 

Recently uploaded (20)

User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 

Tools for Improving Rigor & Reproducibility in Bioinformatics

  • 1. Stephen D. Turner, Ph.D. Bioinformatics Core Director University of Virginia School of Medicine bioinformatics.virginia.edu @strnr Tools for Improving Rigor & Reproducibility in Bioinformatics Slides: bit.ly/madssci2018repro
  • 2. We Are in the Middle of a New Movement in Genomics • Genomics/bioinformatics advancing at grueling pace - New questions - New study designs - New technologies, new [fill-in-the-blank]-seq • New movements have: - Leaders / method developers / early adopters - First followers - Everybody else • New technology leads to more reproducibility issues 2 CORES!
  • 3. Reproducibility is hard! • Genomics data is too large and high dimensional to easily inspect or visualize. • Workflows involve multiple steps and it's hard to inspect every step. • Unlike in the wet lab, we don't always know what to expect of our genomics data analysis. • It can be hard to distinguish good from bad results. 3
  • 4. 4
  • 5. Reproducibility:
 What's in it for you? • Your future self will thank you - Re-running analysis with different parameters - Re-running analysis with new data - Documentation • Faster/cheaper - Modular workflows - Reusable code chunks • Makes collaboration with others easier 5 "Robust research is about doing small things that stack the deck in your favor to prevent mistakes." –Vince Buffalo, author of Bioinformatics Data Skills (2015).
  • 6. Obstacles to Reproducibility 1. Bioinformatics software 2. Pipeline / workflow management 3. Documentation 4. Data / code sharing 6 A non-comprehensive list of
  • 8. Bioinformatics Software • Bioinformatics software implements complex algorithms. - Dozens of parameters, endless permutations - Defaults not always optimal • Perception: 8 ACACTCGCATCCGCACATCGCACTA GGTCAGCATACGCCGACTCCGACCG GCGCTATCGCCAGCGGAAATCGCAA
  • 9. Bioinformatics Software • Bioinformatics software implements complex algorithms. - Dozens of parameters, endless permutations - Defaults not always optimal • Reality: Software is written by smart people, but: - Not software engineers - Not using good practice (version control, modularization, commentary, testing) - Unable to offer long-term 
 maintenance / support - Focus on graduating / 
 publishing, not support - Not always easily available 9
  • 10. • Missing or incomplete documentation • Distribution is missing files • Missing third party package • Dependencies failed to build • Runtime error • Internal compiler error • My last week: - samtools: error while loading shared libraries: libbz2.so.1.0: cannot open shared object file - error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted - /lib64/libc.so.6: version `GLIBC_2.14' not found
  • 12. Package managers 12 Mac OS WindowsLinux apt-get yum homebrew macports ????? ????? Cross-platform
  • 13. Conda • Cross-platform package manager: Win, Mac, Linux • Language agnostic (can be used to install C/C++, Fortran, Go, R, Python, Perl, Java, etc.). • User-installable – no admin/root privileges needed. • Describes packages with a recipe defining dependencies and a build script that installs. • Channels: conda provides many common packages by default. Additional channels add more. • Isolated environments - Versions and tools can be managed per-project - No conflicts or version incompatibility - Environments can be shared via simple text files 13
  • 14. Conda: Main commands • conda create -n <environment> • source activate <environment> • conda search <package> • conda install <package> • conda upgrade <package> • conda uninstall <package> 14
  • 15. Conda: example • Create a new environment named madssci:
 conda create -n madssci • Activate that environment
 source activate madssci • Install some packages
 conda install blast bioconductor-flowcore • Install a particular version
 conda install samtools=0.1.19 15
  • 16. Bioconda • bioconda.github.io • Bioconda is a channel for the conda package manager • Repository for more than 3,000 bioinformatics packages ready to use with conda install • >250 contributors have added/updated recipes • Preprint: Grüning, Björn, et al. "Bioconda: A sustainable and comprehensive software distribution for the life sciences." bioRxiv (2017): 207092.
 https://www.biorxiv.org/content/early/2017/10/27/207092 • See also: "Nature TechBlog: Bioconda Promises to Ease Bioinformatics Software Installation Woes" 
 http://blogs.nature.com/naturejobs/2017/11/03/techblog-bioconda- promises-to-ease-bioinformatics-software-installation-woes/ 16
  • 17. Docker • docker.com • Lightweight virtualization technology • Package software with all of its dependencies into an isolated "container" • Containers have everything needed to run: code, system tools & libraries • Like VMs: portable. = reproducibility! • Unlike VMs: containers virtualize the OS instead of the hardware. = More efficient, more portable. Near native performance, instant startup, small images. Easy to share. • https://www.docker.com/what-container • https://blog.docker.com/2016/03/containers-are-not-vms/ 17 Containers are an abstraction at the app layer that packages code and dependencies together. Multiple containers can run on the same machine and share the OS kernel with other containers, each running as isolated processes in user space. Containers take up less space than VMs (container images are typically tens of MBs in size), and start almost instantly. Virtual machines (VMs) are an abstraction of physical hardware turning one server into many servers. The hypervisor allows multiple VMs to run on a single machine. Each VM includes a full copy of an operating system, one or more apps, necessary binaries and libraries - taking up tens of GBs. VMs can also be slow to boot.
  • 19. Pipeline / Workflow Management • Bioinformatics data analysis: series of steps involving many different programs tied together with file-based inputs and outputs. E.g.: 19
  • 20. Pipeline / Workflow Management • Simple solution: simple (bash) script - List of commands - Pros: quick, easy, portable, universal - Cons: not scalable, no re-entry / partial execution, assumes dependency availability, difficult / no parallelization • Workflow management systems - Make (installed on most systems) - Snakemake - Nextflow - Galaxy - Many more: github.com/pditommaso/awesome-pipeline 20
  • 21. Nextflow • nextflow.io • Di Tommaso, Paolo, et al. "Nextflow enables reproducible computational workflows." Nature biotechnology 35.4 (2017): 316-319. • Features: - Free - Actively developed - Supports docker containers - Easy parallelization, implicitly defined - Continuous checkpointing & resumed execution - Easily portable across architectures (SGE, LSF, SLURM, PBS, Amazon AWS, ...). 21
  • 22. Beware of Pipelineitis • “Pipelines” can kill your creativity and force you to think too rigidly. • Don’t “pipeline” too early, if at all. • Does it even need to be pipeline-ified? • Who’s running it? - You, once: don’t pipeline-ify. Document, move along. - You, 2-5 times: documented script? - You, 10+ times: consider pipeline-ifying. - Others: create sharable pipeline • See: Loman & Watson. "So you want to be a computational biologist?" Nat Biotechnol 31 (2013): 996-998. 22
  • 24. Dynamic Documentation: RMarkdown • R: widely used for data science & bioinformatics • Markdown: a simple markup language that allows you to render structured/formatted documents from plain text. • RMarkdown: embeds R code in a Markdown document. - Write documents that execute embedded code and integrates results into the final report. - Allows you to keep code and documentation together. - Easily re-render the document, re-running analysis and re-incorporating results on the fly. - Many output formats: PDF, DOCX, HTML, EPUB, ... 24
  • 25. 25 Write plain text document Embed R code Rendered output report
  • 27. Jupyter notebooks • jupyter.org • Jupyter: open source project to develop software, standards, services across many languages • Jupyter notebook: free application to create documents containing live code, visualizations, narrative text. - Supports >40 programming languages - Easily shared - Interactive output - Multi-user versions for companies, classrooms, labs 27
  • 28. Data / code sharing 28
  • 29. Sharing Code • State of the art early-2000's - "Data/code available upon request" - "Code available on <lab website>" - None of the above • Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PloS one 6.9 (2011): e24914. - Surveyed ~1000 web services published in NAR 2003-2009 - ~30% unavailable - ~80% developed by students / non-permanent researcher • Russell, Pamela H., et al. "A large-scale analysis of bioinformatics code on GitHub." bioRxiv (2018): 321919. • github.com is becoming the de facto standard for archiving and sharing code 29
  • 30. Sharing any research output 30 - figshare.com - Free - Upload any file format - Get a DOI - 5 GB max file size - 20GB private space - Unlimited public space - Launched 2012 - Hosted on S3, multiple redundant copies - SLA: 10 yr persistence - zenodo.org - Free - Upload any file format - Get a DOI - 50 GB per record - Higher quota by request - Unlimited records - Launched 2013 - Hosted at CERN (est 1954), with defined program of ≥20 years - about.zenodo.org/policies/ - about.zenodo.org/principles/ - osf.io - Free - Upload any file format - Get a DOI - 5 GB per file - Connect to any external storage provider - Launched 2013 - Preservation fund guaranteeing 50+ years of persistent availability - osf.io/faq
  • 32. Other Resources 32 Wilson, et al. "Good enough practices in scientific computing." PLoS computational biology 13.6 (2017): e1005510. Wilson, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745. https://doi.org/10.1371/journal.pbio.1001745 https://doi.org/10.1371/journal.pcbi.1005510
  • 33. Other Resources • 2017: Ten simple rules for making research software more robust: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005412 • 2017: Ten simple rules for responsible big data research: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005399 • 2017: Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005278 • 2016: Ten Simple Rules for Digital Data Storage: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097 • 2016: Ten Simple Rules for Effective Statistical Practice: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004961 • 2015: Ten Simple Rules for Creating a Good Data Management Plan: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525 • 2015: Ten Simple Rules for Experiments’ Provenance: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004384 • 2015: Ten Simple Rules for a Computational Biologist’s Laboratory Notebook: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004385 • 2015: Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004191 • 2014: Ten Simple Rules for the Care and Feeding of Scientific Data: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542 • 2014: Ten Simple Rules for Effective Computational Research: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003506 • 2013: Ten Simple Rules for Reproducible Computational Research: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285 • 2012: Ten Simple Rules for the Open Development of Scientific Software: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002802 • 2014: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article: 
 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003858 33 collections.plos.org/ ten-simple-rules
  • 34. Other Resources • Baker, Monya. “1,500 Scientists Lift the Lid on Reproducibility.” Nature News, vol. 533, no. 7604, May 2016, p. 452. www.nature.com, doi:10.1038/533452a. • Grüning, Björn, et al. “Practical Computational Reproducibility in the Life Sciences.” BioRxiv, Oct. 2017, p. 200683. www.biorxiv.org, doi:10.1101/200683. • Leek, Jeff. "A Few Things That Would Reduce Stress around Reproducibility/ Replicability in Science." Simply Statistics, November 2017: https://simplystatistics.org/ 2017/11/21/rr-sress/. • Mesirov, Jill P. “Accessible Reproducible Research.” Science, vol. 327, no. 5964, Jan. 2010, pp. 415–16. science.sciencemag.org, doi:10.1126/science.1179653. • Munafò, Marcus R., et al. “A Manifesto for Reproducible Science.” Nature Human Behaviour, vol. 1, no. 1, Jan. 2017, p. 0021. www.nature.com, doi:10.1038/ s41562-016-0021. • Patil, Prasad, et al. “A Statistical Definition for Reproducibility and Replicability.” BioRxiv, July 2016, p. 066803. www.biorxiv.org, doi:10.1101/066803. • Russell, Pamela, et al. “A Large-Scale Analysis of Bioinformatics Code on GitHub.” BioRxiv, May 2018, p. 321919. www.biorxiv.org, doi:10.1101/321919. • Schultheiss, Sebastian J., et al. “Persistence and Availability of Web Services in Computational Biology.” PLOS ONE, vol. 6, no. 9, Sept. 2011, p. e24914. PLoS Journals, doi:10.1371/journal.pone.0024914. 34
  • 35. Stephen D. Turner, Ph.D. Bioinformatics Core Director University of Virginia School of Medicine bioinformatics.virginia.edu @strnr THANKYOU bit.ly/madssci2018repro doi.org/10.5281/zenodo.1255003