SlideShare a Scribd company logo
1 of 34
Download to read offline
Reproducibility of computational
workflows is automated using
continuous analysis
Brett K Beaulieu-Jones, Casey S Greene
Nature Biotechnology, vol.35, No.4, pp.342-346, 2017.
April 20th, 2017
Ph.D. Student Kento Aoyama
Akiyama Laboratory
Department of Computer Science, School of Computing
Tokyo Institute of Technology
Nature Biotechnology
• Top Scientific Journal in
biological, biomedical, agricultural
and environmental sciences
• 2-year IF: 43.113 (2016)
• e.g.) Nature, IF = 38.138 (2016)
Source :
http://www.nature.com/npg_/company_info/jour
nal_metrics.html
Journal Information 2
nature biotechnology, April 2017, vol.35 no.4
Brett K Beaulieu-Jones1, Casey S Greene2
1. Genomics and Computational Biology Graduate Group,
Perelman School of Medicine,
University of Pennsylvania
(Twitter: @beaulieujones)
2. Department of Systems Pharmacology and Translational
Therapeutics,
Perelman School of Medicine,
University of Pennsylvania
(Twitter: @GreeneScientist)
Authors Information 3
Target Problem
Reproducibility of computational research
Proposed Method
Continuous Integration + Computational Research
= Continuous Analysis
Continuous Analysis can automatically
verify the research reproducibility
• Easy to reproduce, review, and cooperate
What is the value of this research ? 4
[GitHub] https://greenelab.github.io/continuous_analysis/
1. Background
2. Result (Survey)
3. Proposed Method (Architecture)
4. Experiments
5. Discussion, Conclusion
Outline 5
Research reproducibility is crucial for science
But 90% of researchers acknowledged reproducibility crisis[1]
Background | Reproducibility Crisis 6
[1] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Reproducibility Problems
• lack of details of experiment
• data, parameters, code, etc.
• lack of machine environment information
• software versions, libraries, operating systems, etc.
Computational research should be reproducible
Background | Reproducibility Spectrum 7
Peng, R.D. Reproducible research in computational science.
Science 334, 1226–1227 (2011).
Background | Reproducibility in Biology 8
18 articles, published in Nature Genetics (2005, 2006)
• can not reproduce (10 articles, 56%)
• can reproduce with discrepancies (6 articles, 33%)
Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression
analyses”, Nat. Genet. 41, 149–155 (2009)
Result (Survey)
9
Survey of Differential Gene Expression Research
• Probe information is necessary for reproduction
• probe, is the oligonucleotides of certain sequences,
is used to measure transcript expression levels
BrainArray Custom CDF [1]
• A popular source of probe set description files
• [Dai, M. et al.] published and maintains
• Version of Custom CDF can verify detailed information of probe set
Authors analyzed the 200 articles, which cited [Dai, M. et al.][1].
Reproducibility on RNA-Analysis 10
[1] Dai, M. et al. Evolving gene/transcript definitions significantly alter the
interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
Reporting of Custom CDF in articles 11
a) Most Recent 100 articles
51% of articles do NOT showed version of Custom CDF
b) Highest cited 100 articles
64% of articles do NOT showed version of Custom CDF
cannot download (14 Nov. 2016)
How different versions affect the analysis result
To measure the effects,
• download the different version of Custom CDFs
• use the same data set
• normal HeLa cells and HeLa cells
in which TIA1 and TIAR (TIAL1) were knocked down
Comparing the results
• same source code
• same data set
• different versions of BrainArray Custom CDF (18, 19, 20)
• different versions of software packages
Effects on Analysis Result 12
Figure 2a. differential gene expression
analysis of HeLa cells 13
Each version identified
different number of significantly
altered genes.
• e.g.) 15 genes were identified as
significant in v19,
but not in version 18.
…
Analysis results are
NOT reproducible without
accurate version of software,
dataset
Figure 2b. container-based approaches 14
Using Docker[1] containers
improves reproducibility
• Docker can create “image”
which contains software env.
• Docker allows users to run the
exact same apps in any env.
Using Docker container enabled
versions to be matched and
produced same result.
[1] https://www.docker.com
• Docker is useful for reproducible workflow
• same versions of software
• same version of dataset
• isolation from host OS software environment
• Image tags is useful for management of software
release and paper revisions.
Supplementary Information
• Docker (Container Virtualization) is attached at the end of this slide.
Docker for reproducible workflow 15
Proposed Method
16
Resolving Reproducible Problem
To avoid the problem of version of data & software
• Docker can share the executable container
which contains data & software
But sometimes, we need to upgrade the software.
Then, it is necessary to check the result.
Automatic verification is needed.
An automatic & verifiable software development
approach
Continuous Integration (CI)
Continuous Analysis 17
Continuous Integration (CI)[1]
• is a software engineering practice for fast development
• automatically build, run tests, and make analytics
which triggered by version control system (e.g. git)
About Continuous Integration 18
[1] Grady (1991). Object Oriented Design: With Applications. Benjamin
Cummings. p. 209. ISBN 9780805300918. Retrieved 2014-08-18.
[2] Travis CI, https://travis-ci.org/
e.g.) Travis CI[2] badge
1. Developer pushed commits to repository
2. Test script is executed automatically on CI service
3. Test result is generated automatically
e.g.) Travis CI 19
e.g.) https://github.com/galaxyproject/galaxy
e.g.) CI on Product Development 20
figure: https://developer.xamarin.com/guides/cross-platform/ci/intro_to_ci/
e.g.) Xamarin Test Cloud
Docker provides environment reproducibility
• same version of dataset
• same version of software
• easy to build the environment (Dockerfile)
• easy to share the environment (Docker Hub)
• Continuous Analysis can verify
reproducibility of computational research
• automatically tests the reproducibility
• automatically updates results
Continuous Analysis 21
Fig.3 Continuous Analysis Workflow 22
Workflow 23
1. Push source code changes
2. (Generate the base Docker image from Dockerfile)
3. Read parameters and commands from YAML files
• Users can descript and execute any commands using YAML
e.g.) pre-processing, data-analysis, etc.
4. Generate the outputs to another branch
• result data, figures, logs (managed in VCS)
5. Update the latest Docker Image
Drone
• Continuous Integration Open Source Software
• https://github.com/drone/drone
• Easy to setup using Docker container
• (almost same as other CI services)
GitHub
• Online Git Repository
• BitBucket and GitLab are also available
System Components 24
.drone.yml Example Configuration
https://greenelab.github.io/continuous_analysis/
https://github.com/greenelab/continuous_analysis/blob/master/.drone.yml
Example of YAML file 25
# choose the base docker image
image: brettbj/continuous_analysis_base
script:
# run pre-process
# run tests
# perform analysis
# publish results
publish:
docker:
# docker details
Introducing this system to their work
• “Denoising Autoencoders for Phenotype Stratification
(DAPS): Preprint Release”
• http://doi.org/10.5281/zenodo.46165
They runs 2 example analyses:
• a phylogenetic tree–building analysis
• an RNA-seq differential expression analysis
(detailed information is in Online Method)
Experiments 26
Experiments Result (Fig.4) 27
easy to compare the changed output figure
• Continuous analysis provides a verifiable scientific
software in fully specified environment
• easy to get reproducible environment using Docker
• environment have been automatically kept up-to-date
• It allows reviewers, editors and readers to assess
reproducibility without a large time commitment
Discussion | Conclusion 28
• It may be impractical to use it on
large-computational analysis at every commit
• Cloud computing environment can resolve it,
but it requires auto-provisioning skills
• It is possible to skip CI steps using registered phrase
• It does not address reproducibility in the broader
sense:
• robustness of results to parameter settings
• starting conditions
• partitions in the data
(these are not target of this research)
Discussion | Limitations 29
Linux Container
• virtualizes the host resource as containers
• Filesystem, hostname, IPC, PID, Network, User, etc.
• can be used like Virtual Machines
Linux Kernel Features
• Containers are sharing same host kernel
• namespace[1], chroot, cgroup, SELinux, etc.
Container-based Virtualization 30
[1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”,
In Proceedings of the 2006 Ottawa Linux Symposium, 2006.
Machine
Linux Kernel Space
Container
Process
Process
Container
Process
Process
Docker [1]
• Most popular Linux Container management platform
• Many useful components and services
Linux Container Management Tools 31
[1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker
[2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for
HPC,” Cray User Group, pp. 1–12, 2016.
[3] “Singularity” - http://singularity.lbl.gov/
[1]
[2] [3]
Easy container sharing – Docker Hub 32
Portability & Reproducibility
• Easy to share the application environment via Docker Hub
• Containers can be executed on other host machine
Ubuntu
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Docker Hub
Image
App
Bins/Libs
Push Pull
Dockerfile
apt-get install …
wget …
…
make
CentOS
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Generate
Share
AUFS (Advanced multi layered unification filesystem) [1]
• Docker default filesystem as AUFS
• Layers can be reused in other container image
• AUFS helps software Reproducibility
Docker - Filesystem 33
[1] Advanced multi layered unification filesystem. http://aufs.sourceforge.net, 2014.
Docker Container (image)
f49eec89601e 129.5 MB ubuntu:16.04 (base image)
366a03547595 39.85 MB
ef122501292c 3.6 MB
e50c89716342 15.4 KB
tag: beta
tag: version-1.0
tag: version-1.0.2
tag: version-1.15aec9aa5462c 1.17 MB
tag: latest0d3cccd04bdb 1.07 MB
Linux Container – Performance [1] 34
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and
Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
0.96 1.00 0.98
0.78
0.83
0.99
0.82
0.98
0.00
0.20
0.40
0.60
0.80
1.00
PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS]
PerformanceRatio
[basedNative]
Native Docker KVM KVM-tuned

More Related Content

What's hot

KVM and docker LXC Benchmarking with OpenStack
KVM and docker LXC Benchmarking with OpenStackKVM and docker LXC Benchmarking with OpenStack
KVM and docker LXC Benchmarking with OpenStackBoden Russell
 
Why everyone is excited about Docker (and you should too...) - Carlo Bonamic...
Why everyone is excited about Docker (and you should too...) -  Carlo Bonamic...Why everyone is excited about Docker (and you should too...) -  Carlo Bonamic...
Why everyone is excited about Docker (and you should too...) - Carlo Bonamic...Codemotion
 
The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)Casey Bisson
 
Docker 初探,實驗室中的運貨鯨
Docker 初探,實驗室中的運貨鯨Docker 初探,實驗室中的運貨鯨
Docker 初探,實驗室中的運貨鯨Ruoshi Ling
 
Docker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupDocker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupJérôme Petazzoni
 
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)Ruoshi Ling
 
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special EditionIntroduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special EditionJérôme Petazzoni
 
Docker Warsaw Meetup 12/2017 - DockerCon 2017 Recap
Docker Warsaw Meetup 12/2017 - DockerCon 2017 RecapDocker Warsaw Meetup 12/2017 - DockerCon 2017 Recap
Docker Warsaw Meetup 12/2017 - DockerCon 2017 RecapKrzysztof Sobczak
 
Streamline your development environment with docker
Streamline your development environment with dockerStreamline your development environment with docker
Streamline your development environment with dockerGiacomo Bagnoli
 
Docker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingDocker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingOpen Source Consulting
 
Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Sim Janghoon
 
Docker 101 - from 0 to Docker in 30 minutes
Docker 101 - from 0 to Docker in 30 minutesDocker 101 - from 0 to Docker in 30 minutes
Docker 101 - from 0 to Docker in 30 minutesLuciano Fiandesio
 
Container sig#1 ansible-container
Container sig#1 ansible-containerContainer sig#1 ansible-container
Container sig#1 ansible-containerNaoya Hashimoto
 
Securing Containers, One Patch at a Time - Michael Crosby, Docker
Securing Containers, One Patch at a Time - Michael Crosby, DockerSecuring Containers, One Patch at a Time - Michael Crosby, Docker
Securing Containers, One Patch at a Time - Michael Crosby, DockerDocker, Inc.
 
First steps to docker
First steps to dockerFirst steps to docker
First steps to dockerGuilhem Marty
 
Continuous delivery with docker
Continuous delivery with dockerContinuous delivery with docker
Continuous delivery with dockerJohan Janssen
 

What's hot (20)

KVM and docker LXC Benchmarking with OpenStack
KVM and docker LXC Benchmarking with OpenStackKVM and docker LXC Benchmarking with OpenStack
KVM and docker LXC Benchmarking with OpenStack
 
Docker as an every day work tool
Docker as an every day work toolDocker as an every day work tool
Docker as an every day work tool
 
Why everyone is excited about Docker (and you should too...) - Carlo Bonamic...
Why everyone is excited about Docker (and you should too...) -  Carlo Bonamic...Why everyone is excited about Docker (and you should too...) -  Carlo Bonamic...
Why everyone is excited about Docker (and you should too...) - Carlo Bonamic...
 
The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)The Lies We Tell Our Code (#seascale 2015 04-22)
The Lies We Tell Our Code (#seascale 2015 04-22)
 
Docker 初探,實驗室中的運貨鯨
Docker 初探,實驗室中的運貨鯨Docker 初探,實驗室中的運貨鯨
Docker 初探,實驗室中的運貨鯨
 
App container rkt
App container rktApp container rkt
App container rkt
 
Docker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupDocker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing Meetup
 
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)
當專案漸趕,當遷移也不再那麼難 (Ship Your Projects with Docker EcoSystem)
 
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special EditionIntroduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
 
Introducing Docker
Introducing DockerIntroducing Docker
Introducing Docker
 
Docker Warsaw Meetup 12/2017 - DockerCon 2017 Recap
Docker Warsaw Meetup 12/2017 - DockerCon 2017 RecapDocker Warsaw Meetup 12/2017 - DockerCon 2017 Recap
Docker Warsaw Meetup 12/2017 - DockerCon 2017 Recap
 
Streamline your development environment with docker
Streamline your development environment with dockerStreamline your development environment with docker
Streamline your development environment with docker
 
Docker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingDocker on openstack by OpenSource Consulting
Docker on openstack by OpenSource Consulting
 
Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Docker - container and lightweight virtualization
Docker - container and lightweight virtualization
 
Docker 101 - from 0 to Docker in 30 minutes
Docker 101 - from 0 to Docker in 30 minutesDocker 101 - from 0 to Docker in 30 minutes
Docker 101 - from 0 to Docker in 30 minutes
 
ABCs of docker
ABCs of dockerABCs of docker
ABCs of docker
 
Container sig#1 ansible-container
Container sig#1 ansible-containerContainer sig#1 ansible-container
Container sig#1 ansible-container
 
Securing Containers, One Patch at a Time - Michael Crosby, Docker
Securing Containers, One Patch at a Time - Michael Crosby, DockerSecuring Containers, One Patch at a Time - Michael Crosby, Docker
Securing Containers, One Patch at a Time - Michael Crosby, Docker
 
First steps to docker
First steps to dockerFirst steps to docker
First steps to docker
 
Continuous delivery with docker
Continuous delivery with dockerContinuous delivery with docker
Continuous delivery with docker
 

Similar to Reproducibility of computational workflows is automated using continuous analysis

Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...
Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...
Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...Kento Aoyama
 
naveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agilenaveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agileNaveed Kamran
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsSimon Cockell
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
 
A personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchA personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchOlivier Bonaventure
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practiceC. Tobin Magle
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersKento Aoyama
 
Evaluating Open Source Security Software
Evaluating Open Source Security SoftwareEvaluating Open Source Security Software
Evaluating Open Source Security SoftwareJohn ILIADIS
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slidesMichael Reich
 
Hands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestrationHands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestrationAmir Hossein Sorouri
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting examplecorehard_by
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data ScienceAlessandro Adamo
 
OpenRepGrid – An Open Source Software for the Analysis of Repertory Grids
OpenRepGrid – An Open Source Software for the Analysis of Repertory GridsOpenRepGrid – An Open Source Software for the Analysis of Repertory Grids
OpenRepGrid – An Open Source Software for the Analysis of Repertory GridsMark Heckmann
 
A Tool for Optimizing Java 8 Stream Software via Automated Refactoring
A Tool for Optimizing Java 8 Stream Software via Automated RefactoringA Tool for Optimizing Java 8 Stream Software via Automated Refactoring
A Tool for Optimizing Java 8 Stream Software via Automated RefactoringRaffi Khatchadourian
 

Similar to Reproducibility of computational workflows is automated using continuous analysis (20)

Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...
Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...
Evaluation of Container Virtualized MEGADOCK System in Distributed Computing ...
 
naveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agilenaveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agile
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformatics
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
A personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchA personal journey towards more reproducible networking research
A personal journey towards more reproducible networking research
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practice
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux Containers
 
Evaluating Open Source Security Software
Evaluating Open Source Security SoftwareEvaluating Open Source Security Software
Evaluating Open Source Security Software
 
Design For Testability
Design For TestabilityDesign For Testability
Design For Testability
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides
 
Hands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestrationHands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestration
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
 
G3 talk rld_2
G3 talk rld_2G3 talk rld_2
G3 talk rld_2
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data Science
 
OpenRepGrid – An Open Source Software for the Analysis of Repertory Grids
OpenRepGrid – An Open Source Software for the Analysis of Repertory GridsOpenRepGrid – An Open Source Software for the Analysis of Repertory Grids
OpenRepGrid – An Open Source Software for the Analysis of Repertory Grids
 
A Tool for Optimizing Java 8 Stream Software via Automated Refactoring
A Tool for Optimizing Java 8 Stream Software via Automated RefactoringA Tool for Optimizing Java 8 Stream Software via Automated Refactoring
A Tool for Optimizing Java 8 Stream Software via Automated Refactoring
 

Recently uploaded

Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil RecordSangram Sahoo
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfCherry
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methodsimroshankoirala
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Nistarini College, Purulia (W.B) India
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Cherry
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.takadzanijustinmaime
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACherry
 
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Cherry
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycleCherry
 

Recently uploaded (20)

Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 

Reproducibility of computational workflows is automated using continuous analysis

  • 1. Reproducibility of computational workflows is automated using continuous analysis Brett K Beaulieu-Jones, Casey S Greene Nature Biotechnology, vol.35, No.4, pp.342-346, 2017. April 20th, 2017 Ph.D. Student Kento Aoyama Akiyama Laboratory Department of Computer Science, School of Computing Tokyo Institute of Technology
  • 2. Nature Biotechnology • Top Scientific Journal in biological, biomedical, agricultural and environmental sciences • 2-year IF: 43.113 (2016) • e.g.) Nature, IF = 38.138 (2016) Source : http://www.nature.com/npg_/company_info/jour nal_metrics.html Journal Information 2 nature biotechnology, April 2017, vol.35 no.4
  • 3. Brett K Beaulieu-Jones1, Casey S Greene2 1. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania (Twitter: @beaulieujones) 2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania (Twitter: @GreeneScientist) Authors Information 3
  • 4. Target Problem Reproducibility of computational research Proposed Method Continuous Integration + Computational Research = Continuous Analysis Continuous Analysis can automatically verify the research reproducibility • Easy to reproduce, review, and cooperate What is the value of this research ? 4 [GitHub] https://greenelab.github.io/continuous_analysis/
  • 5. 1. Background 2. Result (Survey) 3. Proposed Method (Architecture) 4. Experiments 5. Discussion, Conclusion Outline 5
  • 6. Research reproducibility is crucial for science But 90% of researchers acknowledged reproducibility crisis[1] Background | Reproducibility Crisis 6 [1] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
  • 7. Reproducibility Problems • lack of details of experiment • data, parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Background | Reproducibility Spectrum 7 Peng, R.D. Reproducible research in computational science. Science 334, 1226–1227 (2011).
  • 8. Background | Reproducibility in Biology 8 18 articles, published in Nature Genetics (2005, 2006) • can not reproduce (10 articles, 56%) • can reproduce with discrepancies (6 articles, 33%) Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses”, Nat. Genet. 41, 149–155 (2009)
  • 10. Survey of Differential Gene Expression Research • Probe information is necessary for reproduction • probe, is the oligonucleotides of certain sequences, is used to measure transcript expression levels BrainArray Custom CDF [1] • A popular source of probe set description files • [Dai, M. et al.] published and maintains • Version of Custom CDF can verify detailed information of probe set Authors analyzed the 200 articles, which cited [Dai, M. et al.][1]. Reproducibility on RNA-Analysis 10 [1] Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
  • 11. Reporting of Custom CDF in articles 11 a) Most Recent 100 articles 51% of articles do NOT showed version of Custom CDF b) Highest cited 100 articles 64% of articles do NOT showed version of Custom CDF cannot download (14 Nov. 2016)
  • 12. How different versions affect the analysis result To measure the effects, • download the different version of Custom CDFs • use the same data set • normal HeLa cells and HeLa cells in which TIA1 and TIAR (TIAL1) were knocked down Comparing the results • same source code • same data set • different versions of BrainArray Custom CDF (18, 19, 20) • different versions of software packages Effects on Analysis Result 12
  • 13. Figure 2a. differential gene expression analysis of HeLa cells 13 Each version identified different number of significantly altered genes. • e.g.) 15 genes were identified as significant in v19, but not in version 18. … Analysis results are NOT reproducible without accurate version of software, dataset
  • 14. Figure 2b. container-based approaches 14 Using Docker[1] containers improves reproducibility • Docker can create “image” which contains software env. • Docker allows users to run the exact same apps in any env. Using Docker container enabled versions to be matched and produced same result. [1] https://www.docker.com
  • 15. • Docker is useful for reproducible workflow • same versions of software • same version of dataset • isolation from host OS software environment • Image tags is useful for management of software release and paper revisions. Supplementary Information • Docker (Container Virtualization) is attached at the end of this slide. Docker for reproducible workflow 15
  • 17. Resolving Reproducible Problem To avoid the problem of version of data & software • Docker can share the executable container which contains data & software But sometimes, we need to upgrade the software. Then, it is necessary to check the result. Automatic verification is needed. An automatic & verifiable software development approach Continuous Integration (CI) Continuous Analysis 17
  • 18. Continuous Integration (CI)[1] • is a software engineering practice for fast development • automatically build, run tests, and make analytics which triggered by version control system (e.g. git) About Continuous Integration 18 [1] Grady (1991). Object Oriented Design: With Applications. Benjamin Cummings. p. 209. ISBN 9780805300918. Retrieved 2014-08-18. [2] Travis CI, https://travis-ci.org/ e.g.) Travis CI[2] badge
  • 19. 1. Developer pushed commits to repository 2. Test script is executed automatically on CI service 3. Test result is generated automatically e.g.) Travis CI 19 e.g.) https://github.com/galaxyproject/galaxy
  • 20. e.g.) CI on Product Development 20 figure: https://developer.xamarin.com/guides/cross-platform/ci/intro_to_ci/ e.g.) Xamarin Test Cloud
  • 21. Docker provides environment reproducibility • same version of dataset • same version of software • easy to build the environment (Dockerfile) • easy to share the environment (Docker Hub) • Continuous Analysis can verify reproducibility of computational research • automatically tests the reproducibility • automatically updates results Continuous Analysis 21
  • 23. Workflow 23 1. Push source code changes 2. (Generate the base Docker image from Dockerfile) 3. Read parameters and commands from YAML files • Users can descript and execute any commands using YAML e.g.) pre-processing, data-analysis, etc. 4. Generate the outputs to another branch • result data, figures, logs (managed in VCS) 5. Update the latest Docker Image
  • 24. Drone • Continuous Integration Open Source Software • https://github.com/drone/drone • Easy to setup using Docker container • (almost same as other CI services) GitHub • Online Git Repository • BitBucket and GitLab are also available System Components 24
  • 25. .drone.yml Example Configuration https://greenelab.github.io/continuous_analysis/ https://github.com/greenelab/continuous_analysis/blob/master/.drone.yml Example of YAML file 25 # choose the base docker image image: brettbj/continuous_analysis_base script: # run pre-process # run tests # perform analysis # publish results publish: docker: # docker details
  • 26. Introducing this system to their work • “Denoising Autoencoders for Phenotype Stratification (DAPS): Preprint Release” • http://doi.org/10.5281/zenodo.46165 They runs 2 example analyses: • a phylogenetic tree–building analysis • an RNA-seq differential expression analysis (detailed information is in Online Method) Experiments 26
  • 27. Experiments Result (Fig.4) 27 easy to compare the changed output figure
  • 28. • Continuous analysis provides a verifiable scientific software in fully specified environment • easy to get reproducible environment using Docker • environment have been automatically kept up-to-date • It allows reviewers, editors and readers to assess reproducibility without a large time commitment Discussion | Conclusion 28
  • 29. • It may be impractical to use it on large-computational analysis at every commit • Cloud computing environment can resolve it, but it requires auto-provisioning skills • It is possible to skip CI steps using registered phrase • It does not address reproducibility in the broader sense: • robustness of results to parameter settings • starting conditions • partitions in the data (these are not target of this research) Discussion | Limitations 29
  • 30. Linux Container • virtualizes the host resource as containers • Filesystem, hostname, IPC, PID, Network, User, etc. • can be used like Virtual Machines Linux Kernel Features • Containers are sharing same host kernel • namespace[1], chroot, cgroup, SELinux, etc. Container-based Virtualization 30 [1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”, In Proceedings of the 2006 Ottawa Linux Symposium, 2006. Machine Linux Kernel Space Container Process Process Container Process Process
  • 31. Docker [1] • Most popular Linux Container management platform • Many useful components and services Linux Container Management Tools 31 [1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker [2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for HPC,” Cray User Group, pp. 1–12, 2016. [3] “Singularity” - http://singularity.lbl.gov/ [1] [2] [3]
  • 32. Easy container sharing – Docker Hub 32 Portability & Reproducibility • Easy to share the application environment via Docker Hub • Containers can be executed on other host machine Ubuntu Docker Engine Container App Bins/Libs Image App Bins/Libs Docker Hub Image App Bins/Libs Push Pull Dockerfile apt-get install … wget … … make CentOS Docker Engine Container App Bins/Libs Image App Bins/Libs Generate Share
  • 33. AUFS (Advanced multi layered unification filesystem) [1] • Docker default filesystem as AUFS • Layers can be reused in other container image • AUFS helps software Reproducibility Docker - Filesystem 33 [1] Advanced multi layered unification filesystem. http://aufs.sourceforge.net, 2014. Docker Container (image) f49eec89601e 129.5 MB ubuntu:16.04 (base image) 366a03547595 39.85 MB ef122501292c 3.6 MB e50c89716342 15.4 KB tag: beta tag: version-1.0 tag: version-1.0.2 tag: version-1.15aec9aa5462c 1.17 MB tag: latest0d3cccd04bdb 1.07 MB
  • 34. Linux Container – Performance [1] 34 [1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.) 0.96 1.00 0.98 0.78 0.83 0.99 0.82 0.98 0.00 0.20 0.40 0.60 0.80 1.00 PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS] PerformanceRatio [basedNative] Native Docker KVM KVM-tuned