Successfully reported this slideshow.

Cloudgene - A MapReduce based Workflow Management System



1 of 22
1 of 22

Cloudgene - A MapReduce based Workflow Management System



Cloudgene is a freely available platform to improve the usability of MapReduce programs by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds).

Cloudgene is a freely available platform to improve the usability of MapReduce programs by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds).

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Cloudgene - A MapReduce based Workflow Management System

  1. 1. Cloudgene A MapReduce based Workflow Management System Lukas Forer and Sebstian Schönherr Division of Genetic Epidemiology Medical University of Innsbruck, Austria UPPNEX Workshop - January 2015
  2. 2. Page 2 Motivation: Bioinformatics • Next Generation Sequencing (NGS) – Sequencing the whole genome at low cost – Gigabytes of produced data per experiment – Allows data production at high scale • Data generation is not the bottleneck anymore • Data processing as the current bottleneck – Single Workstation not sufficient – Super-Computers too expensive
  3. 3. Page 3 MapReduce • Commodity computing – Parallel computing on a large number of low budget components • MapReduce – Parallelization framework – Enables analyzing large data – User writes map/reduce function – Framework takes care about fault-tolerance, data distribution, load balancing – Apache Hadoop: Open Source implementation
  4. 4. Page 4 MapReduce in Bioinformatics (1) Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  5. 5. Page 5 MapReduce in Bioinformatics (2) • Bioinformatics MapReduce Applications – Available only on a per-tool basis – Cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science • Popular workflow systems – Enable this level of abstraction for the traditional tools – Do not support tools based on MapReduce Missing: System which enables building MapReduce workflows
  6. 6. Page 6 Cloudgene • System to execute MapReduce programs graphically and combine them to workflows • One platform – many programs – Integration of existing MapReduce programs without source code adaptations – Create workflows using MapReduce, Apache Pig, R or Unix command-line programs • Runs in your browser
  7. 7. Page 7 Cloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows • Requires a compatible cluster to execute workflows – Small/Medium sized research institutes can hardly afford own clusters – Cloud computing: rent computer hardware from different providers (e.g. Amazon, HP)
  8. 8. Page 8 CloudgeneCloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Cloudgene-Cluster Infrastructure Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
  9. 9. Page 9 Cloudgene: Advantages
  10. 10. Page 10 Architecture
  11. 11. Page 11 Workflow Composition • New MapReduce algorithms can be integrated easily • Integration of existing MapReduce algorithms without adaptations in source code • Cloudgene uses its own workflow language • Workflow Definition Language (WDL) – Formal description of tasks and workflow steps – Property-based and uses the YAML syntax – Supports heterogeneous software components (MapReduce, R and unix command-line programs) – Basic workflow control patterns (loops and conditions)
  12. 12. Page 13 Workflow Composition • Example of a simple WDL-Manifest file Command line parameters Inputs: Are set by the user through the web interface Outputs: are created by tasks (intermediate or persistent)
  13. 13. Page 14 Workflow Composition • The user interface is created automatically
  14. 14. Page 15 Workflow Execution Engine 1. Creates a dependency graph based on the WDL file and user input 2. Optimizes the graph to minimize the execution time (i.e. caching) 3. Schedules and submits jobs to the Hadoop Cluster
  15. 15. Page 16 Web Interface
  16. 16. Page 17 Workflow Results Used Parameters Download links to result files
  17. 17. Page 18 Supported Technologies • Apache Hadoop MapReduce • Apache PIG • RMarkdown – Ideal to generate html files with charts, statistics, … • Unix command line programs – Cloudgene exports automatically all HDFS files – No manual file staging between HDFS and POSIX filesystem needed! Advantage: Composition of hybrid Workflows possible
  18. 18. Page 19 Other Features • Authentication and User-Management • Parameter Tracking • HDFS Workspace – Hides HDFS filesystem by the end-user – Importing Data from Amazon S3 Buckets, HTTP and (S)FTP Servers, File Uploads, ... – Facilitates the management of datasets on the cluster
  19. 19. Page 20 Preview: Cloudgene 2.0 • Interface for web-services – Same WDL file, but different interface – User Registration – Intelligent Queuing – User Notification • Examples: – –
  20. 20. Page 21 Preview: Cloudgene 2.0 • Generic data analysis platform – Integration of additional data processing models Cloudgene Hadoop 1.0 MapReduce Cloudgene Hadoop 2.0 YARN MapReduce Spark Giraph …
  21. 21. Page 22 Conclusion • Website – • Virtual Machine – • Getting started – • Developer Guide –
  22. 22. Page 23 Acks • Cloudgene – Lukas Forer (@lukfor) and Sebastian Schoenherr (@seppinho) • Imputation with Minimac – Goncalo Abecasis, Christian Fuchsberger • mtDNA-Server – Hansi Weißensteiner • Univ.-Prof. Florian Kronenberg – Head of the Division of Genetic Epidemiology, Medical University of Innsbruck 23

Editor's Notes

  • Welcome everybody to the defense of my phd thesis.

    In the next 20 minutes i give you an overview about the results and outcomes of my thesis.

    The main topic is the efficient analysis of data in the field of bioinformatics.

  • NGS enables to sequence the whole genome.

    This is done in a extremely parallel way and enables to sequence the genome at low cost and high scale-

    This has the consequence that more and more data will be produced.

    So the bottleneck is no longer the data production in the lab, but its analysis.

    This is because one experiment produces gigabytes of data.

    Therefore ,one single workstation is no sufficient for the data analysis and super computers are often too expensive!
  • So one solution fot that problem is to use commodity computing.

    That means we use a large number of normal cheap computing components and use them to perform our analysis in parallel

    And one approach which was developed specially for that kind of infrastructure is mapreduce.

    It is a parallelization framework developed by google in 2004 and enables to analyze large data efficiently in parallel

    The user writes only the map and the reduce function and the framework takes care about fault-tolerance, data-dist and load balancing.

    All the stuff we need in parallel computing environment.

    The map and reduce functions are stateless, and can be executed in parallel and therefore this approach scales very well!

    Apache hadoop is open source implementation of mapreduce.

  • As this table shows, there exist already several Mapreduce apps in the field of bioinformatics and it is a high potential.

    For example there are algorithm available for mapping shot reads to a reference ot for rna analysis.
  • But the problem of such approaches is that thei are available only on a per-tool basis

    in genetics we often need large workflows which consits of several steps To analyze data.

    But those tools cover only one aspect of such a pipeline.

    Moreover, for biologists without background in cs it is very hard to use them

    Most of popular workflow systems such as galaxy enable this abstraction only for traditional tools and not for mapreduce.

    So a system which enables building such mapreduce workflows is missing.

  • So the aim sof my thesis can be classified in two parts:

    First, developing a system to compose complex workflows of multiple Mapreduce tools. This is done by abstracting all the technical details

    Second, evaluating this system by applying it to Bioinformatics. For that reason i adapted 3 different workflows to MapReduce.

    The first workflow is for genotype imputation, the second for genome-wide association studies and the last one detects Copy number variations.
  • The first aim was solved by implementing a Workflow Execution Engine called Cloudgene MapRed.

    And on the top of this i have integrated the three workflows.

    Cloudgene-Mapred requires a compatible cluster to execute the pipeline.

    Especially for small research institutes it can be hard to afford and maintain their own cluster

    So a possible solution is cloud computing which enables to rent computer hardware from different providers for example amazon.

    So they can use the rented resources on demand.
  • To overcome this issues, sebastian developed in his thesis a infrastructure manager which enables to launch and manage an hadoop cluster through the browser.

    So ist possible to run the same workflows on a local cluster, on private cloud or on a public cloud.

    This whole system is called cloudgene and in my presentation today i talk about the workflow executing engine and one of the three workflows.

    And this workflow is called imputation server.
  • On this slide you can see the advantages of cloudgene compared with the manual appraoch.
  • This workflow manager assists scientists in executing and monitoring worklfows

    The core of the architecture is the execution engine.

    As you can see in this picture, the workflow execution engine operates on a hadoop cluster .

    Therefore data reliability and fault tolerance are provided.

    The workflow engine contains an optimizer which tries to minimize the execution time by using caching mechanisms.

    Moreover, it contains a data manager for importing and exporting datasets.

    The system has rest api interface in order to communicate with clients.

    In our case the the client is a webapplication.
  • The Workflow composition in Cloudgene was developed with two aims in mind:

    first it should be possible to implement new algorithms easily

    Second, it should be possible to integrate existing algorithm without source code adaptations

    For that reason, i developed a new Workflow Language which is called WDL and is used by cloudgene.

    - It enables a formal description of workflows and their tasks

    It is property based and uses a human readable syntax

    Supports different software components as tasks

    And supports some basic control patterns like conditions and loops.
  • Here is a very simple example of such a workflow written in WDL.

    I don't want to go too much in detail, but you can define inputs and outputs and then you can reuse them in your tasks.
  • Based on this manifest file, we create automatically a use rinterface which can be used to submit the job with different parameters and datasets.

    And when the user clicks on the submit button then the workflow engine comes into play.
  • We have the WDL manifest file with thr workflow structure, and the user input which is used to execute it.

    Based on this information a graph is created which contains all tasks and their dependencies.

    Then the optimized tries to minimize the graph by using caching.

    And finally based on this graph are task execution plan is created which is used to submit the jobs to the cluster.
  • Once the job is submitted, we can monitor the progress.
  • When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  • Beside the hadoop technologies we support also other useful technologies.

    For example rmarkdown to create html reports

    Or any other unix command line program. In this case cloudgene automatically exports files from the hdfs to the local filesystem. So an intuitive combination of these technologies is possible.
  • The next step of this project is to turn Cloudgene into a more generic data analysis cloud platform.

    Therefore we plan to integrate additional big data computation models so that cloudgene is not limited to mapreduce.

    One possibility is to integrate YARN which is the new version of hadoop and is a middle layer between hadoop and mapreduce.

    So we can support also other models for example for graph data processing and in-memory calculations.
  • ×