Poster X-Meeting 2015 - Improving automation, reproducibility and installation of genomic pipelines with Docker.
Authors: Marcel Caraciolo and Felipe Figuereirdo
1. Picard
Target coverage
SamTools
Bamfile statistics
Statistics
SamTools
Variant calling
" VCF
GATK
Variant calling
" VCF
Variant detection
M. P. Caraciolo1
, F. V. Fiqueiredo1
, V. Monteiro1
1
Genomika Diagnósticos
Improving automation, reproducibility and
installation of genomic analysis pipelines with Docker
ABSTRACT
Bioinformatics pipelines usually rely on a combination of several components and
deploying them incurs substantial configuration and maintenance burden.
Genomics and variant analysis pipeline is normally difficult to install, configure and
deploy. We tackled this issue with a scalable and repeatable approach using Docker
containers (lightweight virtualization). Encapsulating NGS workflows working in
containers, a user can quickly deploy any pipeline version in any environment and
overcomes several issues from common used approaches with virtual machines.
The goal is to share our experiences for developing, distributing and running
pipelines encapsulated in containers using Docker.
bioinfo@genomika.com.br | genomika.com.br
Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil
INTRODUCTION AND MOTIVATION
The current approach using VM's lack portability, have substantial
overhead (disk, CPU, RAM) and require allocated resources to be
provisioned statically. The tools used in the pipelines generally are
installed using automatic scripts that may break due to no longer exist
tools or incorrect versions. For the biologists the problem is more
critical, since the adversities of finding and installing the required
softwares or limited documentation and obtaining good results requires
experiences.
WHAT IS DOCKER?
REFERENCES Benchmarks
Dockerized Pipeline Approach 1 Dockerized Pipeline Approach (in progress)
Boettiger C. 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems
Review, Special Issue on Repeatability and Sharing of Experimental Artifacts 49(1):71-79
Di Tommaso P, Chatzou M, Baraja P, Notredame C. 2014. Nextflow: a novel tool for highly scalable
computational pipelines.
Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. The impact of Docker containers
on the performance of genomic pipelines. PeerJ PrePrints. 2015;3:e1428.
doi:10.7287/peerj.preprints.1171v2.
Felter W, Ferreira A, Rajamony R, Rubio J. 2014. An updated performance comparison of virtual machines
and linux contain. IBM Research Available at http://ibm.co/V55Otq (accessed 1 June 2015)
PIPELINE ARCHITECTURE BEFORE CONTAINERS
OUR APPROACH
Time is expressed in minutes. The mean and the standard deviation were estimated from 10
separate runs. Slowdown represents the ratio of the mean execution time with Docker to the
mean execution time when Docker was not used.
Mean execution times for pipelines and tasks with and without Docker.
Docker is a open-source software, it isolates the tools
and software involved in processing, and makes easier
to recreate a snapshot of the current environment of
the pipeline for reproducibility without manual
re-installation of specific versions of software.
mounted volume
or
volume container
BWA
Hypervisor
Host OS
Server
App A
Bins/Libs
Guest OS
App B
Bins/Libs
Guest OS
Docker Engine
Host OS
Server
App A
Bins/Libs
App B
Bins/Libs
...
...
SamTools
Workflow
Base Container Base Container
mount
mount
input/output
Pros: Single container, easy to maintain
Cons: VM-like approach; huge, monolithic container,
difficult to share (against Docker philosophy)
Pros: Completely modularized,
easy to re-use/share workflow components
Cons: “Container hell”?
Mean task time
Native Docker
Mean execution
time
Native Docker
Execution time
std. deviation
Native Docker
SlowdownTasksPipeline
48Variant calling
Pipeline for WES
26.5 27.1 1254.4 4.9 2.6 1.0221293.8
VM Container
BWA
Mapping &
Pairing
SamTools
Format
conversion
" BAM
Picard
Remove
duplicates
" BAM
SamTools
Remove reads
with mapQV=0
" BAM
IGV
GATK
Local realignment around indels
Quality score recalibration
" BAM
Tool
Final alignment
in BAM format
config
file
input
fastq
mounted volume
or
volume container
Container A
BWA
Container B
SamTools
Workflow
mount
mount
input/output
containerized apps
Container C
Tool
config
file
input
fastq