The document summarizes a proposed method called "Continuous Analysis" that aims to improve reproducibility in computational research. Continuous Analysis combines continuous integration (CI) practices with computational research to automatically verify reproducibility. It proposes using Docker containers to capture computational environments and CI tools like Drone to automatically rebuild images and rerun analyses on code changes, flagging any differences in results. An experiment applying Continuous Analysis to two genomic analyses demonstrated it could more easily detect changes between versions.
Pteris : features, anatomy, morphology and lifecycle
Reproducibility of computational workflows is automated using continuous analysis
1. Reproducibility of computational
workflows is automated using
continuous analysis
Brett K Beaulieu-Jones, Casey S Greene
Nature Biotechnology, vol.35, No.4, pp.342-346, 2017.
April 20th, 2017
Ph.D. Student Kento Aoyama
Akiyama Laboratory
Department of Computer Science, School of Computing
Tokyo Institute of Technology
2. Nature Biotechnology
• Top Scientific Journal in
biological, biomedical, agricultural
and environmental sciences
• 2-year IF: 43.113 (2016)
• e.g.) Nature, IF = 38.138 (2016)
Source :
http://www.nature.com/npg_/company_info/jour
nal_metrics.html
Journal Information 2
nature biotechnology, April 2017, vol.35 no.4
3. Brett K Beaulieu-Jones1, Casey S Greene2
1. Genomics and Computational Biology Graduate Group,
Perelman School of Medicine,
University of Pennsylvania
(Twitter: @beaulieujones)
2. Department of Systems Pharmacology and Translational
Therapeutics,
Perelman School of Medicine,
University of Pennsylvania
(Twitter: @GreeneScientist)
Authors Information 3
4. Target Problem
Reproducibility of computational research
Proposed Method
Continuous Integration + Computational Research
= Continuous Analysis
Continuous Analysis can automatically
verify the research reproducibility
• Easy to reproduce, review, and cooperate
What is the value of this research ? 4
[GitHub] https://greenelab.github.io/continuous_analysis/
6. Research reproducibility is crucial for science
But 90% of researchers acknowledged reproducibility crisis[1]
Background | Reproducibility Crisis 6
[1] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
7. Reproducibility Problems
• lack of details of experiment
• data, parameters, code, etc.
• lack of machine environment information
• software versions, libraries, operating systems, etc.
Computational research should be reproducible
Background | Reproducibility Spectrum 7
Peng, R.D. Reproducible research in computational science.
Science 334, 1226–1227 (2011).
8. Background | Reproducibility in Biology 8
18 articles, published in Nature Genetics (2005, 2006)
• can not reproduce (10 articles, 56%)
• can reproduce with discrepancies (6 articles, 33%)
Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression
analyses”, Nat. Genet. 41, 149–155 (2009)
10. Survey of Differential Gene Expression Research
• Probe information is necessary for reproduction
• probe, is the oligonucleotides of certain sequences,
is used to measure transcript expression levels
BrainArray Custom CDF [1]
• A popular source of probe set description files
• [Dai, M. et al.] published and maintains
• Version of Custom CDF can verify detailed information of probe set
Authors analyzed the 200 articles, which cited [Dai, M. et al.][1].
Reproducibility on RNA-Analysis 10
[1] Dai, M. et al. Evolving gene/transcript definitions significantly alter the
interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
11. Reporting of Custom CDF in articles 11
a) Most Recent 100 articles
51% of articles do NOT showed version of Custom CDF
b) Highest cited 100 articles
64% of articles do NOT showed version of Custom CDF
cannot download (14 Nov. 2016)
12. How different versions affect the analysis result
To measure the effects,
• download the different version of Custom CDFs
• use the same data set
• normal HeLa cells and HeLa cells
in which TIA1 and TIAR (TIAL1) were knocked down
Comparing the results
• same source code
• same data set
• different versions of BrainArray Custom CDF (18, 19, 20)
• different versions of software packages
Effects on Analysis Result 12
13. Figure 2a. differential gene expression
analysis of HeLa cells 13
Each version identified
different number of significantly
altered genes.
• e.g.) 15 genes were identified as
significant in v19,
but not in version 18.
…
Analysis results are
NOT reproducible without
accurate version of software,
dataset
14. Figure 2b. container-based approaches 14
Using Docker[1] containers
improves reproducibility
• Docker can create “image”
which contains software env.
• Docker allows users to run the
exact same apps in any env.
Using Docker container enabled
versions to be matched and
produced same result.
[1] https://www.docker.com
15. • Docker is useful for reproducible workflow
• same versions of software
• same version of dataset
• isolation from host OS software environment
• Image tags is useful for management of software
release and paper revisions.
Supplementary Information
• Docker (Container Virtualization) is attached at the end of this slide.
Docker for reproducible workflow 15
17. Resolving Reproducible Problem
To avoid the problem of version of data & software
• Docker can share the executable container
which contains data & software
But sometimes, we need to upgrade the software.
Then, it is necessary to check the result.
Automatic verification is needed.
An automatic & verifiable software development
approach
Continuous Integration (CI)
Continuous Analysis 17
18. Continuous Integration (CI)[1]
• is a software engineering practice for fast development
• automatically build, run tests, and make analytics
which triggered by version control system (e.g. git)
About Continuous Integration 18
[1] Grady (1991). Object Oriented Design: With Applications. Benjamin
Cummings. p. 209. ISBN 9780805300918. Retrieved 2014-08-18.
[2] Travis CI, https://travis-ci.org/
e.g.) Travis CI[2] badge
19. 1. Developer pushed commits to repository
2. Test script is executed automatically on CI service
3. Test result is generated automatically
e.g.) Travis CI 19
e.g.) https://github.com/galaxyproject/galaxy
20. e.g.) CI on Product Development 20
figure: https://developer.xamarin.com/guides/cross-platform/ci/intro_to_ci/
e.g.) Xamarin Test Cloud
21. Docker provides environment reproducibility
• same version of dataset
• same version of software
• easy to build the environment (Dockerfile)
• easy to share the environment (Docker Hub)
• Continuous Analysis can verify
reproducibility of computational research
• automatically tests the reproducibility
• automatically updates results
Continuous Analysis 21
23. Workflow 23
1. Push source code changes
2. (Generate the base Docker image from Dockerfile)
3. Read parameters and commands from YAML files
• Users can descript and execute any commands using YAML
e.g.) pre-processing, data-analysis, etc.
4. Generate the outputs to another branch
• result data, figures, logs (managed in VCS)
5. Update the latest Docker Image
24. Drone
• Continuous Integration Open Source Software
• https://github.com/drone/drone
• Easy to setup using Docker container
• (almost same as other CI services)
GitHub
• Online Git Repository
• BitBucket and GitLab are also available
System Components 24
26. Introducing this system to their work
• “Denoising Autoencoders for Phenotype Stratification
(DAPS): Preprint Release”
• http://doi.org/10.5281/zenodo.46165
They runs 2 example analyses:
• a phylogenetic tree–building analysis
• an RNA-seq differential expression analysis
(detailed information is in Online Method)
Experiments 26
28. • Continuous analysis provides a verifiable scientific
software in fully specified environment
• easy to get reproducible environment using Docker
• environment have been automatically kept up-to-date
• It allows reviewers, editors and readers to assess
reproducibility without a large time commitment
Discussion | Conclusion 28
29. • It may be impractical to use it on
large-computational analysis at every commit
• Cloud computing environment can resolve it,
but it requires auto-provisioning skills
• It is possible to skip CI steps using registered phrase
• It does not address reproducibility in the broader
sense:
• robustness of results to parameter settings
• starting conditions
• partitions in the data
(these are not target of this research)
Discussion | Limitations 29
30. Linux Container
• virtualizes the host resource as containers
• Filesystem, hostname, IPC, PID, Network, User, etc.
• can be used like Virtual Machines
Linux Kernel Features
• Containers are sharing same host kernel
• namespace[1], chroot, cgroup, SELinux, etc.
Container-based Virtualization 30
[1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”,
In Proceedings of the 2006 Ottawa Linux Symposium, 2006.
Machine
Linux Kernel Space
Container
Process
Process
Container
Process
Process
31. Docker [1]
• Most popular Linux Container management platform
• Many useful components and services
Linux Container Management Tools 31
[1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker
[2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for
HPC,” Cray User Group, pp. 1–12, 2016.
[3] “Singularity” - http://singularity.lbl.gov/
[1]
[2] [3]
32. Easy container sharing – Docker Hub 32
Portability & Reproducibility
• Easy to share the application environment via Docker Hub
• Containers can be executed on other host machine
Ubuntu
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Docker Hub
Image
App
Bins/Libs
Push Pull
Dockerfile
apt-get install …
wget …
…
make
CentOS
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Generate
Share
34. Linux Container – Performance [1] 34
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and
Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
0.96 1.00 0.98
0.78
0.83
0.99
0.82
0.98
0.00
0.20
0.40
0.60
0.80
1.00
PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS]
PerformanceRatio
[basedNative]
Native Docker KVM KVM-tuned