Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Reproducible Bioinformatics Pipelines
with Docker & Anduril
Christian Frech, PhD
Bioinformatician at Children‘s Cancer R...
Why care about reproducible pipelines
in bioinformatics?
 For your (future) self
 Quickly re-run analysis with different...
Obstacles to computational reproducibility
 Software/script not available (even upon request)
 Black box: Code (or even ...
Computational pipelines to the rescue
 In bioinformatics, data analysis typically consists of a series of
heterogeneous p...
No shortage of pipeline frameworks
 Script-based
 GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake,
Nextflow, …
 GUI-bas...
Personal wish list for pipeline framework
 Script-based (maximum flexibility, minimum overhead)
 Powerful scripting lang...
What’s wrong with good ol’ GNU make?
 Available on all Linux platforms
 Stood the test of time
(developed in 1970s)
 Ra...
Anduril
8
http://www.anduril.org
Anduril
 Developed since 2008 at Biomedicum Systems Biology Laboratory,
Helsinki, Finland
 http://research.med.helsinki....
Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Cluster integration...
Example workflow: RNA-seq alignment with GSNAP
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Fold...
Embedding native R code in Anduril script
12
ensembl = REvaluate(
table1 = ucsc,
script = StringInput(content=
'''
table.o...
Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Cluster integration...
 “Lightweight” virtualization technology for Unix-based systems
 Processes run in isolated namespaces (“containers”), bu...
How to bundle workflow with execution environment?
15
Container
Anduril
Workflow
Component 1
Component 2
Component 3
Pro: ...
Hybrid solution
16
Pro: Workflow completely containerized (= portable);
only shared components in common containers
Con: S...
Dockerized GSNAP in Anduril
17
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Folder2Array(folder1...
So, Anduril is great… but
 Proprietary scripting language
 Biggest hurdle for widespread adoption IMO
 Will likely impr...
Anduril RNA-seq case study
19
RNA-seq case study
Step 1: Configure Anduril workflow
title = “My project long title“
shortName = “My project short title“...
RNA-seq case study
Step 2: Run Anduril workflow on cluster
$ anduril run main.and --exec-mode slurm
21
RNA-seq case study
Step 3: Go for lunch
22
RNA-seq case study
Step 4: Study PDF report
23
What follows are screenshots from this PDF report
24
QC: Read counts
25
QC: Gene body coverage
26
QC: Distribution of expression values per sample
27
QC: Sample PCA & heatmap
28
Vulcano plot for each comparison
29
Table report of DEGs for each comparison
30
Expression values of top diff. expressed
genes per comparison
31
GO term enrichment for each comparison
32
Interaction network of DEGs for each comparison
33
Chromosomal distribution of DEGs
34
GSEA heat map summarizing all comparisons
35
Rows = enriched gene sets
Columns = comparisons
Value = normalized enrichment...
Future developments
 Push new Anduril components to public repository
(needs some refactoring, documentation, test cases)...
In the (not so) distant future …
$ docker pull cfrech/frech2015_et_al
$ docker run cfrech/frech2015_et_al --use-cloud --ma...
Further reading
 Discussion thread on Docker & Anduril
https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw
38
Acknowledgement
39
 Marko Laakso (Significo)
 Sirku Kaarinen (Significo)
 Kristian Ovaska (Valuemotive)
 Pekka Lehti (...
Upcoming SlideShare
Loading in …5
×

of

Reproducible bioinformatics pipelines with Docker and Anduril Slide 1 Reproducible bioinformatics pipelines with Docker and Anduril Slide 2 Reproducible bioinformatics pipelines with Docker and Anduril Slide 3 Reproducible bioinformatics pipelines with Docker and Anduril Slide 4 Reproducible bioinformatics pipelines with Docker and Anduril Slide 5 Reproducible bioinformatics pipelines with Docker and Anduril Slide 6 Reproducible bioinformatics pipelines with Docker and Anduril Slide 7 Reproducible bioinformatics pipelines with Docker and Anduril Slide 8 Reproducible bioinformatics pipelines with Docker and Anduril Slide 9 Reproducible bioinformatics pipelines with Docker and Anduril Slide 10 Reproducible bioinformatics pipelines with Docker and Anduril Slide 11 Reproducible bioinformatics pipelines with Docker and Anduril Slide 12 Reproducible bioinformatics pipelines with Docker and Anduril Slide 13 Reproducible bioinformatics pipelines with Docker and Anduril Slide 14 Reproducible bioinformatics pipelines with Docker and Anduril Slide 15 Reproducible bioinformatics pipelines with Docker and Anduril Slide 16 Reproducible bioinformatics pipelines with Docker and Anduril Slide 17 Reproducible bioinformatics pipelines with Docker and Anduril Slide 18 Reproducible bioinformatics pipelines with Docker and Anduril Slide 19 Reproducible bioinformatics pipelines with Docker and Anduril Slide 20 Reproducible bioinformatics pipelines with Docker and Anduril Slide 21 Reproducible bioinformatics pipelines with Docker and Anduril Slide 22 Reproducible bioinformatics pipelines with Docker and Anduril Slide 23 Reproducible bioinformatics pipelines with Docker and Anduril Slide 24 Reproducible bioinformatics pipelines with Docker and Anduril Slide 25 Reproducible bioinformatics pipelines with Docker and Anduril Slide 26 Reproducible bioinformatics pipelines with Docker and Anduril Slide 27 Reproducible bioinformatics pipelines with Docker and Anduril Slide 28 Reproducible bioinformatics pipelines with Docker and Anduril Slide 29 Reproducible bioinformatics pipelines with Docker and Anduril Slide 30 Reproducible bioinformatics pipelines with Docker and Anduril Slide 31 Reproducible bioinformatics pipelines with Docker and Anduril Slide 32 Reproducible bioinformatics pipelines with Docker and Anduril Slide 33 Reproducible bioinformatics pipelines with Docker and Anduril Slide 34 Reproducible bioinformatics pipelines with Docker and Anduril Slide 35 Reproducible bioinformatics pipelines with Docker and Anduril Slide 36 Reproducible bioinformatics pipelines with Docker and Anduril Slide 37 Reproducible bioinformatics pipelines with Docker and Anduril Slide 38 Reproducible bioinformatics pipelines with Docker and Anduril Slide 39
Upcoming SlideShare
Principals, Practices, and Habits
Next
Download to read offline and view in fullscreen.

4 Likes

Share

Download to read offline

Reproducible bioinformatics pipelines with Docker and Anduril

Download to read offline

An approach to develop reproducible bioinformatics pipelines using Docker and Anduril.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Reproducible bioinformatics pipelines with Docker and Anduril

  1. 1. 1 Reproducible Bioinformatics Pipelines with Docker & Anduril Christian Frech, PhD Bioinformatician at Children‘s Cancer Research Institute, Vienna CeMM Special Seminar September 25th , 2015
  2. 2. Why care about reproducible pipelines in bioinformatics?  For your (future) self  Quickly re-run analysis with different parameters/tools  Best documentation how results have been produced  For others  Allow others to easily reproduce your findings (“reproducibility crisis”)*  Code re-use between projects and colleagues 2 *) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998
  3. 3. Obstacles to computational reproducibility  Software/script not available (even upon request)  Black box: Code (or even virtual machine) available, but no documentation how to run it  Dependency hell: Software and documentation available, but (too) difficult to get it running  Code rot: Code breaks over time due to software updates  404 Not Found: unstable URLs, e.g. links to lab homepages 3 Go figure…
  4. 4. Computational pipelines to the rescue  In bioinformatics, data analysis typically consists of a series of heterogeneous programs stringed together via file-based inputs and outputs  Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant annotation (SnpEff) -> custom R script  Simple automation via (bash/R/Python/Perl) scripting has its limitations  No error checking  No partial execution  No parallelization 4
  5. 5. No shortage of pipeline frameworks  Script-based  GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake, Nextflow, …  GUI-based  Galaxy, GenePattern, Chipster, Taverna, Pegasus, …  Various commercial solutions for more standardized workflows (e.g. RNA-seq)  Geared toward biologists without programming skills (“point-and-click”) 5 See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/
  6. 6. Personal wish list for pipeline framework  Script-based (maximum flexibility, minimum overhead)  Powerful scripting language  Cluster integration (preferably via slurm)  Modular (allow code re-use b/w projects and colleagues)  Component library for frequent tasks (e.g. join two CSV files)  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 6
  7. 7. What’s wrong with good ol’ GNU make?  Available on all Linux platforms  Stood the test of time (developed in 1970s)  Rapid development (Bash scripting + target rules)  Multi-threading (-j parameter) 7  No cluster support  Arcane syntax, cryptic pattern rules  Half-baked multi-output rules  No type checking (everything is a generic file)  Difficult to modularize (code re-use)  Rebuild not triggered by recipe change  No reporting PRO CON
  8. 8. Anduril 8 http://www.anduril.org
  9. 9. Anduril  Developed since 2008 at Biomedicum Systems Biology Laboratory, Helsinki, Finland  http://research.med.helsinki.fi/gsb/hautaniemi/  Built for scientific data analysis with focus on bioinformatics  Proprietary workflow scripting language “Anduril script”  Possibility to embed native code (Bash/R/Python/Perl)  Version 2 will switch to Scala  Open source & free  Significo (http://www.significo.fi/) is commercial spin-off offering Anduril consulting services  No widespread adoption (yet?) 9
  10. 10. Anduril features  Script-based (maximum flexibility, less overhead)  Expressive scripting language  Cluster integration (preferably via slurm)  Modular to allow code re-use (b/w projects and colleagues)  Ready-made component library for frequent analysis steps  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 10 X
  11. 11. Example workflow: RNA-seq alignment with GSNAP inputBamDir = INPUT(path="/data/bam", recursive=false) inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$") alignedBams = record() for bam : std.iterArray(inputBamFiles) { gsnap = GSNAP ( reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0", @cpu = 10, @memory = 40000, @name = "gsnap_" + bam.key ) alignedBams[bam.key] = gsnap.alignment } 11 Anduril script Execute with $ anduril run workflow.and --exec-mode slurm Distributed execution on cluster
  12. 12. Embedding native R code in Anduril script 12 ensembl = REvaluate( table1 = ucsc, script = StringInput(content= ''' table.out <- table1 table.out$chrom <- gsub("^chr", "", table.out$chrom) ''' ) ) Supports also inlining of Bash, Python, Java, and Perl scripts Convert UCSC to Ensembl chromosome names in a CSV file containing column ‘chrom’:
  13. 13. Anduril features  Script-based (maximum flexibility, less overhead)  Expressive scripting language  Cluster integration (preferably via slurm)  Modular to allow code re-use (b/w projects and colleagues)  Ready-made component library for frequent analysis steps  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 13 ?
  14. 14.  “Lightweight” virtualization technology for Unix-based systems  Processes run in isolated namespaces (“containers”), but share same kernel  Like VMs: containers portable between systems -> reproducibility!  Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization 14 VM Container
  15. 15. How to bundle workflow with execution environment? 15 Container Anduril Workflow Component 1 Component 2 Component 3 Pro: Single container, easy to maintain Con: VM-like approach; huge, monolithic container, difficult to share (against Docker philosophy) Pro: Completely modularized, easy to re- use/share workflow components Con: “container hell”? Workflow Anduril Solution 1 Solution 2 Container A Component 1 Container B Component 2 Container C Component 3
  16. 16. Hybrid solution 16 Pro: Workflow completely containerized (= portable); only shared components in common containers Con: Still (but greatly reduced) overhead for container maintenance Workflow Anduril Container A Component 1 Component 2 Component 3 Master container Project- and user- specific components installed in master container Shared components installed in common container (e.g. container “RNA-seq”) “Docker inside docker”
  17. 17. Dockerized GSNAP in Anduril 17 inputBamDir = INPUT(path="/data/bam", recursive=false) inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$") alignedBams = record() for bam : std.iterArray(inputBamFiles) { gsnap = GSNAP ( reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0", docker = "cfrech/anduril-gsnap-2015-09-21", @cpu = 10, @memory = 40000, @name = "gsnap_" + bam.key ) alignedBams[bam.key] = gsnap.alignment }
  18. 18. So, Anduril is great… but  Proprietary scripting language  Biggest hurdle for widespread adoption IMO  Will likely improve with version 2 (which uses Scala)  Documentation opaque for beginners  WANTED: Simple step-by-step guide to build your first Anduril workflow  High upfront investment to get going (because of the above)  In-lining Bash/R/Perl/Python should be simpler  Currently too much clutter when using “BashEvaluate” and alike  Coding in Anduril sometimes “feels heavy” compared to other frameworks (e.g. GNU Make)  Will improve with fluency in workflow scripting language 18
  19. 19. Anduril RNA-seq case study 19
  20. 20. RNA-seq case study Step 1: Configure Anduril workflow title = “My project long title“ shortName = “My project short title“ authors = "Christian Frech" // analyses to run runNetworkAnalysis = true runMutationAnalysis = true runGSEA = true // constants PROJECT_BASE="/mnt/projects/myproject“ gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz") referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta") ... 20 + description of samples, sample groups, and group comparisons in external CSV file
  21. 21. RNA-seq case study Step 2: Run Anduril workflow on cluster $ anduril run main.and --exec-mode slurm 21
  22. 22. RNA-seq case study Step 3: Go for lunch 22
  23. 23. RNA-seq case study Step 4: Study PDF report 23
  24. 24. What follows are screenshots from this PDF report 24
  25. 25. QC: Read counts 25
  26. 26. QC: Gene body coverage 26
  27. 27. QC: Distribution of expression values per sample 27
  28. 28. QC: Sample PCA & heatmap 28
  29. 29. Vulcano plot for each comparison 29
  30. 30. Table report of DEGs for each comparison 30
  31. 31. Expression values of top diff. expressed genes per comparison 31
  32. 32. GO term enrichment for each comparison 32
  33. 33. Interaction network of DEGs for each comparison 33
  34. 34. Chromosomal distribution of DEGs 34
  35. 35. GSEA heat map summarizing all comparisons 35 Rows = enriched gene sets Columns = comparisons Value = normalized enrichment score (NES) Red = enriched for up-regulated genes Blue = enriched for down-regulated genes * = significant (FDR < 0.05) ** = highly significant (FDR < 0.01)
  36. 36. Future developments  Push new Anduril components to public repository (needs some refactoring, documentation, test cases)  Help on Anduril2 manuscript  Port custom Makefiles to Anduril (ongoing)  Cloud deployment of dockerized workflow  Couple slurm to AWS EC2  Automatic spin-up of docker-enabled AMIs serving as computing nodes 36
  37. 37. In the (not so) distant future … $ docker pull cfrech/frech2015_et_al $ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output $ evince output/figure1.pdf 37
  38. 38. Further reading  Discussion thread on Docker & Anduril https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw 38
  39. 39. Acknowledgement 39  Marko Laakso (Significo)  Sirku Kaarinen (Significo)  Kristian Ovaska (Valuemotive)  Pekka Lehti (Valuemotive)  Ville Rantanen (University of Helsinki, Hautaniemi lab)  Nuno Andrade (CCRI)  Andreas Heitger (CCRI)
  • JaimeCampagna

    Nov. 25, 2021
  • brindhasaran7

    Apr. 26, 2019
  • SamuelLampa

    Oct. 5, 2016
  • FeiLi7

    Jan. 12, 2016

An approach to develop reproducible bioinformatics pipelines using Docker and Anduril.

Views

Total views

4,155

On Slideshare

0

From embeds

0

Number of embeds

1,077

Actions

Downloads

48

Shares

0

Comments

0

Likes

4

×