Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloudgene - A MapReduce based Workflow Management System

Cloudgene is a freely available platform to improve the usability of MapReduce programs by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds).

  • Login to see the comments

Cloudgene - A MapReduce based Workflow Management System

  1. 1. Cloudgene A MapReduce based Workflow Management System Lukas Forer and Sebstian Schönherr Division of Genetic Epidemiology Medical University of Innsbruck, Austria UPPNEX Workshop - January 2015
  2. 2. Page 2 Motivation: Bioinformatics • Next Generation Sequencing (NGS) – Sequencing the whole genome at low cost – Gigabytes of produced data per experiment – Allows data production at high scale • Data generation is not the bottleneck anymore • Data processing as the current bottleneck – Single Workstation not sufficient – Super-Computers too expensive
  3. 3. Page 3 MapReduce • Commodity computing – Parallel computing on a large number of low budget components • MapReduce – Parallelization framework – Enables analyzing large data – User writes map/reduce function – Framework takes care about fault-tolerance, data distribution, load balancing – Apache Hadoop: Open Source implementation
  4. 4. Page 4 MapReduce in Bioinformatics (1) Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  5. 5. Page 5 MapReduce in Bioinformatics (2) • Bioinformatics MapReduce Applications – Available only on a per-tool basis – Cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science • Popular workflow systems – Enable this level of abstraction for the traditional tools – Do not support tools based on MapReduce Missing: System which enables building MapReduce workflows
  6. 6. Page 6 Cloudgene • System to execute MapReduce programs graphically and combine them to workflows • One platform – many programs – Integration of existing MapReduce programs without source code adaptations – Create workflows using MapReduce, Apache Pig, R or Unix command-line programs • Runs in your browser
  7. 7. Page 7 Cloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows • Requires a compatible cluster to execute workflows – Small/Medium sized research institutes can hardly afford own clusters – Cloud computing: rent computer hardware from different providers (e.g. Amazon, HP)
  8. 8. Page 8 CloudgeneCloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Cloudgene-Cluster Infrastructure Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
  9. 9. Page 9 Cloudgene: Advantages
  10. 10. Page 10 Architecture
  11. 11. Page 11 Workflow Composition • New MapReduce algorithms can be integrated easily • Integration of existing MapReduce algorithms without adaptations in source code • Cloudgene uses its own workflow language • Workflow Definition Language (WDL) – Formal description of tasks and workflow steps – Property-based and uses the YAML syntax – Supports heterogeneous software components (MapReduce, R and unix command-line programs) – Basic workflow control patterns (loops and conditions)
  12. 12. Page 13 Workflow Composition • Example of a simple WDL-Manifest file Command line parameters Inputs: Are set by the user through the web interface Outputs: are created by tasks (intermediate or persistent)
  13. 13. Page 14 Workflow Composition • The user interface is created automatically
  14. 14. Page 15 Workflow Execution Engine 1. Creates a dependency graph based on the WDL file and user input 2. Optimizes the graph to minimize the execution time (i.e. caching) 3. Schedules and submits jobs to the Hadoop Cluster
  15. 15. Page 16 Web Interface
  16. 16. Page 17 Workflow Results Used Parameters Download links to result files
  17. 17. Page 18 Supported Technologies • Apache Hadoop MapReduce • Apache PIG • RMarkdown – Ideal to generate html files with charts, statistics, … • Unix command line programs – Cloudgene exports automatically all HDFS files – No manual file staging between HDFS and POSIX filesystem needed! Advantage: Composition of hybrid Workflows possible
  18. 18. Page 19 Other Features • Authentication and User-Management • Parameter Tracking • HDFS Workspace – Hides HDFS filesystem by the end-user – Importing Data from Amazon S3 Buckets, HTTP and (S)FTP Servers, File Uploads, ... – Facilitates the management of datasets on the cluster
  19. 19. Page 20 Preview: Cloudgene 2.0 • Interface for web-services – Same WDL file, but different interface – User Registration – Intelligent Queuing – User Notification • Examples: – –
  20. 20. Page 21 Preview: Cloudgene 2.0 • Generic data analysis platform – Integration of additional data processing models Cloudgene Hadoop 1.0 MapReduce Cloudgene Hadoop 2.0 YARN MapReduce Spark Giraph …
  21. 21. Page 22 Conclusion • Website – • Virtual Machine – • Getting started – • Developer Guide –
  22. 22. Page 23 Acks • Cloudgene – Lukas Forer (@lukfor) and Sebastian Schoenherr (@seppinho) • Imputation with Minimac – Goncalo Abecasis, Christian Fuchsberger • mtDNA-Server – Hansi Weißensteiner • Univ.-Prof. Florian Kronenberg – Head of the Division of Genetic Epidemiology, Medical University of Innsbruck 23