Cloudgene
A MapReduce based Workflow Management System
Lukas Forer and Sebstian Schönherr
Division of Genetic Epidemiology
Medical University of Innsbruck, Austria
UPPNEX Workshop - January 2015
Page 2
Motivation: Bioinformatics
• Next Generation Sequencing (NGS)
– Sequencing the whole genome at low cost
– Gigabytes of produced data per experiment
– Allows data production at high scale
• Data generation is not the bottleneck anymore
• Data processing as the current bottleneck
– Single Workstation not sufficient
– Super-Computers too expensive
Page 3
MapReduce
• Commodity computing
– Parallel computing on a large number of low budget
components
• MapReduce
– Parallelization framework
– Enables analyzing large data
– User writes map/reduce function
– Framework takes care about
fault-tolerance, data distribution, load balancing
– Apache Hadoop: Open Source implementation
Page 4
MapReduce in Bioinformatics (1)
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
Page 5
MapReduce in Bioinformatics (2)
• Bioinformatics MapReduce Applications
– Available only on a per-tool basis
– Cover one aspect of a larger data analysis pipeline
– Hard to use for scientists without background in
Computer Science
• Popular workflow systems
– Enable this level of abstraction for the traditional tools
– Do not support tools based on MapReduce
Missing: System which enables building
MapReduce workflows
Page 6
Cloudgene
• System to execute MapReduce programs graphically
and combine them to workflows
• One platform – many programs
– Integration of existing MapReduce programs without
source code adaptations
– Create workflows using MapReduce, Apache Pig, R or
Unix command-line programs
• Runs in your browser
Page 7
Cloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• Requires a compatible cluster to execute workflows
– Small/Medium sized research institutes can hardly
afford own clusters
– Cloud computing: rent computer hardware from different
providers (e.g. Amazon, HP)
Page 8
CloudgeneCloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Cloudgene-Cluster
Infrastructure Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
Page 9
Cloudgene: Advantages
Page 10
Architecture
Page 11
Workflow Composition
• New MapReduce algorithms can be integrated easily
• Integration of existing MapReduce algorithms without
adaptations in source code
• Cloudgene uses its own workflow language
• Workflow Definition Language (WDL)
– Formal description of tasks and workflow steps
– Property-based and uses the YAML syntax
– Supports heterogeneous software components
(MapReduce, R and unix command-line programs)
– Basic workflow control patterns (loops and conditions)
Page 13
Workflow Composition
• Example of a simple WDL-Manifest file
Command line
parameters
Inputs:
Are set by the user
through the web
interface
Outputs:
are created by
tasks (intermediate
or persistent)
Page 14
Workflow Composition
• The user interface is created automatically
Page 15
Workflow Execution Engine
1. Creates a dependency graph based on the WDL file and user input
2. Optimizes the graph to minimize the execution time (i.e. caching)
3. Schedules and submits jobs to the Hadoop Cluster
Page 16
Web Interface
Page 17
Workflow Results
Used
Parameters
Download links
to result files
Page 18
Supported Technologies
• Apache Hadoop MapReduce
• Apache PIG
• RMarkdown
– Ideal to generate html files with charts, statistics, …
• Unix command line programs
– Cloudgene exports automatically all HDFS files
– No manual file staging between HDFS and POSIX filesystem
needed!
Advantage: Composition of
hybrid Workflows possible
Page 19
Other Features
• Authentication and User-Management
• Parameter Tracking
• HDFS Workspace
– Hides HDFS filesystem by the end-user
– Importing Data from Amazon S3 Buckets,
HTTP and (S)FTP Servers, File Uploads, ...
– Facilitates the management of datasets on
the cluster
Page 20
Preview: Cloudgene 2.0
• Interface for web-services
– Same WDL file, but different interface
– User Registration
– Intelligent Queuing
– User Notification
• Examples:
– https://imputationserver.sph.umich.edu
– http://mtdna-server.uibk.ac.at
Page 21
Preview: Cloudgene 2.0
• Generic data analysis platform
– Integration of additional data processing models
Cloudgene
Hadoop 1.0
MapReduce
Cloudgene
Hadoop 2.0
YARN
MapReduce Spark Giraph …
Page 22
Conclusion
• Website
– http://cloudgene.uibk.ac.at
• Virtual Machine
– https://bioimg.org/cloudgene
• Getting started
– http://cloudgene.uibk.ac.at/getting-started
• Developer Guide
– http://cloudgene.uibk.ac.at/developer-guide
Page 23
Acks
• Cloudgene
– Lukas Forer (@lukfor) and Sebastian Schoenherr
(@seppinho)
• Imputation with Minimac
– Goncalo Abecasis, Christian Fuchsberger
• mtDNA-Server
– Hansi Weißensteiner
• Univ.-Prof. Florian Kronenberg
– Head of the Division of Genetic Epidemiology,
Medical University of Innsbruck
23

Cloudgene - A MapReduce based Workflow Management System

  • 1.
    Cloudgene A MapReduce basedWorkflow Management System Lukas Forer and Sebstian Schönherr Division of Genetic Epidemiology Medical University of Innsbruck, Austria UPPNEX Workshop - January 2015
  • 2.
    Page 2 Motivation: Bioinformatics •Next Generation Sequencing (NGS) – Sequencing the whole genome at low cost – Gigabytes of produced data per experiment – Allows data production at high scale • Data generation is not the bottleneck anymore • Data processing as the current bottleneck – Single Workstation not sufficient – Super-Computers too expensive
  • 3.
    Page 3 MapReduce • Commoditycomputing – Parallel computing on a large number of low budget components • MapReduce – Parallelization framework – Enables analyzing large data – User writes map/reduce function – Framework takes care about fault-tolerance, data distribution, load balancing – Apache Hadoop: Open Source implementation
  • 4.
    Page 4 MapReduce inBioinformatics (1) Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  • 5.
    Page 5 MapReduce inBioinformatics (2) • Bioinformatics MapReduce Applications – Available only on a per-tool basis – Cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science • Popular workflow systems – Enable this level of abstraction for the traditional tools – Do not support tools based on MapReduce Missing: System which enables building MapReduce workflows
  • 6.
    Page 6 Cloudgene • Systemto execute MapReduce programs graphically and combine them to workflows • One platform – many programs – Integration of existing MapReduce programs without source code adaptations – Create workflows using MapReduce, Apache Pig, R or Unix command-line programs • Runs in your browser
  • 7.
    Page 7 Cloudgene: Overview Cloudgene-MapRed MapReduceWorkflow Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows • Requires a compatible cluster to execute workflows – Small/Medium sized research institutes can hardly afford own clusters – Cloud computing: rent computer hardware from different providers (e.g. Amazon, HP)
  • 8.
    Page 8 CloudgeneCloudgene: Overview Cloudgene-MapRed MapReduceWorkflow Manager Cloudgene-Cluster Infrastructure Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
  • 9.
  • 10.
  • 11.
    Page 11 Workflow Composition •New MapReduce algorithms can be integrated easily • Integration of existing MapReduce algorithms without adaptations in source code • Cloudgene uses its own workflow language • Workflow Definition Language (WDL) – Formal description of tasks and workflow steps – Property-based and uses the YAML syntax – Supports heterogeneous software components (MapReduce, R and unix command-line programs) – Basic workflow control patterns (loops and conditions)
  • 12.
    Page 13 Workflow Composition •Example of a simple WDL-Manifest file Command line parameters Inputs: Are set by the user through the web interface Outputs: are created by tasks (intermediate or persistent)
  • 13.
    Page 14 Workflow Composition •The user interface is created automatically
  • 14.
    Page 15 Workflow ExecutionEngine 1. Creates a dependency graph based on the WDL file and user input 2. Optimizes the graph to minimize the execution time (i.e. caching) 3. Schedules and submits jobs to the Hadoop Cluster
  • 15.
  • 16.
  • 17.
    Page 18 Supported Technologies •Apache Hadoop MapReduce • Apache PIG • RMarkdown – Ideal to generate html files with charts, statistics, … • Unix command line programs – Cloudgene exports automatically all HDFS files – No manual file staging between HDFS and POSIX filesystem needed! Advantage: Composition of hybrid Workflows possible
  • 18.
    Page 19 Other Features •Authentication and User-Management • Parameter Tracking • HDFS Workspace – Hides HDFS filesystem by the end-user – Importing Data from Amazon S3 Buckets, HTTP and (S)FTP Servers, File Uploads, ... – Facilitates the management of datasets on the cluster
  • 19.
    Page 20 Preview: Cloudgene2.0 • Interface for web-services – Same WDL file, but different interface – User Registration – Intelligent Queuing – User Notification • Examples: – https://imputationserver.sph.umich.edu – http://mtdna-server.uibk.ac.at
  • 20.
    Page 21 Preview: Cloudgene2.0 • Generic data analysis platform – Integration of additional data processing models Cloudgene Hadoop 1.0 MapReduce Cloudgene Hadoop 2.0 YARN MapReduce Spark Giraph …
  • 21.
    Page 22 Conclusion • Website –http://cloudgene.uibk.ac.at • Virtual Machine – https://bioimg.org/cloudgene • Getting started – http://cloudgene.uibk.ac.at/getting-started • Developer Guide – http://cloudgene.uibk.ac.at/developer-guide
  • 22.
    Page 23 Acks • Cloudgene –Lukas Forer (@lukfor) and Sebastian Schoenherr (@seppinho) • Imputation with Minimac – Goncalo Abecasis, Christian Fuchsberger • mtDNA-Server – Hansi Weißensteiner • Univ.-Prof. Florian Kronenberg – Head of the Division of Genetic Epidemiology, Medical University of Innsbruck 23

Editor's Notes

  • #2 Welcome everybody to the defense of my phd thesis. In the next 20 minutes i give you an overview about the results and outcomes of my thesis. The main topic is the efficient analysis of data in the field of bioinformatics.
  • #3  NGS enables to sequence the whole genome. This is done in a extremely parallel way and enables to sequence the genome at low cost and high scale- This has the consequence that more and more data will be produced. So the bottleneck is no longer the data production in the lab, but its analysis. This is because one experiment produces gigabytes of data. Therefore ,one single workstation is no sufficient for the data analysis and super computers are often too expensive!
  • #4 So one solution fot that problem is to use commodity computing. That means we use a large number of normal cheap computing components and use them to perform our analysis in parallel And one approach which was developed specially for that kind of infrastructure is mapreduce. It is a parallelization framework developed by google in 2004 and enables to analyze large data efficiently in parallel The user writes only the map and the reduce function and the framework takes care about fault-tolerance, data-dist and load balancing. All the stuff we need in parallel computing environment. The map and reduce functions are stateless, and can be executed in parallel and therefore this approach scales very well! Apache hadoop is open source implementation of mapreduce.
  • #5 As this table shows, there exist already several Mapreduce apps in the field of bioinformatics and it is a high potential. For example there are algorithm available for mapping shot reads to a reference ot for rna analysis.
  • #6 But the problem of such approaches is that thei are available only on a per-tool basis in genetics we often need large workflows which consits of several steps To analyze data. But those tools cover only one aspect of such a pipeline. Moreover, for biologists without background in cs it is very hard to use them Most of popular workflow systems such as galaxy enable this abstraction only for traditional tools and not for mapreduce. So a system which enables building such mapreduce workflows is missing.
  • #7 So the aim sof my thesis can be classified in two parts: First, developing a system to compose complex workflows of multiple Mapreduce tools. This is done by abstracting all the technical details Second, evaluating this system by applying it to Bioinformatics. For that reason i adapted 3 different workflows to MapReduce. The first workflow is for genotype imputation, the second for genome-wide association studies and the last one detects Copy number variations.
  • #8 The first aim was solved by implementing a Workflow Execution Engine called Cloudgene MapRed. And on the top of this i have integrated the three workflows. Cloudgene-Mapred requires a compatible cluster to execute the pipeline. Especially for small research institutes it can be hard to afford and maintain their own cluster So a possible solution is cloud computing which enables to rent computer hardware from different providers for example amazon. So they can use the rented resources on demand.
  • #9 To overcome this issues, sebastian developed in his thesis a infrastructure manager which enables to launch and manage an hadoop cluster through the browser. So ist possible to run the same workflows on a local cluster, on private cloud or on a public cloud. This whole system is called cloudgene and in my presentation today i talk about the workflow executing engine and one of the three workflows. And this workflow is called imputation server.
  • #10 On this slide you can see the advantages of cloudgene compared with the manual appraoch.
  • #11 This workflow manager assists scientists in executing and monitoring worklfows The core of the architecture is the execution engine. As you can see in this picture, the workflow execution engine operates on a hadoop cluster . Therefore data reliability and fault tolerance are provided. The workflow engine contains an optimizer which tries to minimize the execution time by using caching mechanisms. Moreover, it contains a data manager for importing and exporting datasets. The system has rest api interface in order to communicate with clients. In our case the the client is a webapplication.
  • #12 The Workflow composition in Cloudgene was developed with two aims in mind: first it should be possible to implement new algorithms easily Second, it should be possible to integrate existing algorithm without source code adaptations For that reason, i developed a new Workflow Language which is called WDL and is used by cloudgene. - It enables a formal description of workflows and their tasks It is property based and uses a human readable syntax Supports different software components as tasks And supports some basic control patterns like conditions and loops.
  • #14 Here is a very simple example of such a workflow written in WDL. I don't want to go too much in detail, but you can define inputs and outputs and then you can reuse them in your tasks.
  • #15 Based on this manifest file, we create automatically a use rinterface which can be used to submit the job with different parameters and datasets. And when the user clicks on the submit button then the workflow engine comes into play.
  • #16 We have the WDL manifest file with thr workflow structure, and the user input which is used to execute it. Based on this information a graph is created which contains all tasks and their dependencies. Then the optimized tries to minimize the graph by using caching. And finally based on this graph are task execution plan is created which is used to submit the jobs to the cluster.
  • #17 Once the job is submitted, we can monitor the progress.
  • #18 When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  • #19 Beside the hadoop technologies we support also other useful technologies. For example rmarkdown to create html reports Or any other unix command line program. In this case cloudgene automatically exports files from the hdfs to the local filesystem. So an intuitive combination of these technologies is possible.
  • #22 The next step of this project is to turn Cloudgene into a more generic data analysis cloud platform. Therefore we plan to integrate additional big data computation models so that cloudgene is not limited to mapreduce. One possibility is to integrate YARN which is the new version of hadoop and is a middle layer between hadoop and mapreduce. So we can support also other models for example for graph data processing and in-memory calculations.