A MapReduce based Workflow Management System
Lukas Forer and Sebstian Schönherr
Division of Genetic Epidemiology
Medical University of Innsbruck, Austria
UPPNEX Workshop - January 2015
• Next Generation Sequencing (NGS)
– Sequencing the whole genome at low cost
– Gigabytes of produced data per experiment
– Allows data production at high scale
• Data generation is not the bottleneck anymore
• Data processing as the current bottleneck
– Single Workstation not sufficient
– Super-Computers too expensive
• Commodity computing
– Parallel computing on a large number of low budget
– Parallelization framework
– Enables analyzing large data
– User writes map/reduce function
– Framework takes care about
fault-tolerance, data distribution, load balancing
– Apache Hadoop: Open Source implementation
MapReduce in Bioinformatics (1)
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BioPig Processing NGS data with Apache Pig; Presenting UDFs
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
MapReduce in Bioinformatics (2)
• Bioinformatics MapReduce Applications
– Available only on a per-tool basis
– Cover one aspect of a larger data analysis pipeline
– Hard to use for scientists without background in
• Popular workflow systems
– Enable this level of abstraction for the traditional tools
– Do not support tools based on MapReduce
Missing: System which enables building
• System to execute MapReduce programs graphically
and combine them to workflows
• One platform – many programs
– Integration of existing MapReduce programs without
source code adaptations
– Create workflows using MapReduce, Apache Pig, R or
Unix command-line programs
• Runs in your browser
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• Requires a compatible cluster to execute workflows
– Small/Medium sized research institutes can hardly
afford own clusters
– Cloud computing: rent computer hardware from different
providers (e.g. Amazon, HP)
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• New MapReduce algorithms can be integrated easily
• Integration of existing MapReduce algorithms without
adaptations in source code
• Cloudgene uses its own workflow language
• Workflow Definition Language (WDL)
– Formal description of tasks and workflow steps
– Property-based and uses the YAML syntax
– Supports heterogeneous software components
(MapReduce, R and unix command-line programs)
– Basic workflow control patterns (loops and conditions)
• Example of a simple WDL-Manifest file
Are set by the user
through the web
are created by
• The user interface is created automatically
Workflow Execution Engine
1. Creates a dependency graph based on the WDL file and user input
2. Optimizes the graph to minimize the execution time (i.e. caching)
3. Schedules and submits jobs to the Hadoop Cluster
to result files
• Apache Hadoop MapReduce
• Apache PIG
– Ideal to generate html files with charts, statistics, …
• Unix command line programs
– Cloudgene exports automatically all HDFS files
– No manual file staging between HDFS and POSIX filesystem
Advantage: Composition of
hybrid Workflows possible
• Authentication and User-Management
• Parameter Tracking
• HDFS Workspace
– Hides HDFS filesystem by the end-user
– Importing Data from Amazon S3 Buckets,
HTTP and (S)FTP Servers, File Uploads, ...
– Facilitates the management of datasets on
Preview: Cloudgene 2.0
• Interface for web-services
– Same WDL file, but different interface
– User Registration
– Intelligent Queuing
– User Notification
Preview: Cloudgene 2.0
• Generic data analysis platform
– Integration of additional data processing models
MapReduce Spark Giraph …
• Virtual Machine
• Getting started
• Developer Guide
– Lukas Forer (@lukfor) and Sebastian Schoenherr
• Imputation with Minimac
– Goncalo Abecasis, Christian Fuchsberger
– Hansi Weißensteiner
• Univ.-Prof. Florian Kronenberg
– Head of the Division of Genetic Epidemiology,
Medical University of Innsbruck
Welcome everybody to the defense of my phd thesis.
In the next 20 minutes i give you an overview about the results and outcomes of my thesis.
The main topic is the efficient analysis of data in the field of bioinformatics.
NGS enables to sequence the whole genome.
This is done in a extremely parallel way and enables to sequence the genome at low cost and high scale-
This has the consequence that more and more data will be produced.
So the bottleneck is no longer the data production in the lab, but its analysis.
This is because one experiment produces gigabytes of data.
Therefore ,one single workstation is no sufficient for the data analysis and super computers are often too expensive!
So one solution fot that problem is to use commodity computing.
That means we use a large number of normal cheap computing components and use them to perform our analysis in parallel
And one approach which was developed specially for that kind of infrastructure is mapreduce.
It is a parallelization framework developed by google in 2004 and enables to analyze large data efficiently in parallel
The user writes only the map and the reduce function and the framework takes care about fault-tolerance, data-dist and load balancing.
All the stuff we need in parallel computing environment.
The map and reduce functions are stateless, and can be executed in parallel and therefore this approach scales very well!
Apache hadoop is open source implementation of mapreduce.
As this table shows, there exist already several Mapreduce apps in the field of bioinformatics and it is a high potential.
For example there are algorithm available for mapping shot reads to a reference ot for rna analysis.
But the problem of such approaches is that thei are available only on a per-tool basis
in genetics we often need large workflows which consits of several steps To analyze data.
But those tools cover only one aspect of such a pipeline.
Moreover, for biologists without background in cs it is very hard to use them
Most of popular workflow systems such as galaxy enable this abstraction only for traditional tools and not for mapreduce.
So a system which enables building such mapreduce workflows is missing.
So the aim sof my thesis can be classified in two parts:
First, developing a system to compose complex workflows of multiple Mapreduce tools. This is done by abstracting all the technical details
Second, evaluating this system by applying it to Bioinformatics. For that reason i adapted 3 different workflows to MapReduce.
The first workflow is for genotype imputation, the second for genome-wide association studies and the last one detects Copy number variations.
The first aim was solved by implementing a Workflow Execution Engine called Cloudgene MapRed.
And on the top of this i have integrated the three workflows.
Cloudgene-Mapred requires a compatible cluster to execute the pipeline.
Especially for small research institutes it can be hard to afford and maintain their own cluster
So a possible solution is cloud computing which enables to rent computer hardware from different providers for example amazon.
So they can use the rented resources on demand.
To overcome this issues, sebastian developed in his thesis a infrastructure manager which enables to launch and manage an hadoop cluster through the browser.
So ist possible to run the same workflows on a local cluster, on private cloud or on a public cloud.
This whole system is called cloudgene and in my presentation today i talk about the workflow executing engine and one of the three workflows.
And this workflow is called imputation server.
On this slide you can see the advantages of cloudgene compared with the manual appraoch.
This workflow manager assists scientists in executing and monitoring worklfows
The core of the architecture is the execution engine.
As you can see in this picture, the workflow execution engine operates on a hadoop cluster .
Therefore data reliability and fault tolerance are provided.
The workflow engine contains an optimizer which tries to minimize the execution time by using caching mechanisms.
Moreover, it contains a data manager for importing and exporting datasets.
The system has rest api interface in order to communicate with clients.
In our case the the client is a webapplication.
The Workflow composition in Cloudgene was developed with two aims in mind:
first it should be possible to implement new algorithms easily
Second, it should be possible to integrate existing algorithm without source code adaptations
For that reason, i developed a new Workflow Language which is called WDL and is used by cloudgene.
- It enables a formal description of workflows and their tasks
It is property based and uses a human readable syntax
Supports different software components as tasks
And supports some basic control patterns like conditions and loops.
Here is a very simple example of such a workflow written in WDL.
I don't want to go too much in detail, but you can define inputs and outputs and then you can reuse them in your tasks.
Based on this manifest file, we create automatically a use rinterface which can be used to submit the job with different parameters and datasets.
And when the user clicks on the submit button then the workflow engine comes into play.
We have the WDL manifest file with thr workflow structure, and the user input which is used to execute it.
Based on this information a graph is created which contains all tasks and their dependencies.
Then the optimized tries to minimize the graph by using caching.
And finally based on this graph are task execution plan is created which is used to submit the jobs to the cluster.
Once the job is submitted, we can monitor the progress.
When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
Beside the hadoop technologies we support also other useful technologies.
For example rmarkdown to create html reports
Or any other unix command line program. In this case cloudgene automatically exports files from the hdfs to the local filesystem. So an intuitive combination of these technologies is possible.
The next step of this project is to turn Cloudgene into a more generic data analysis cloud platform.
Therefore we plan to integrate additional big data computation models so that cloudgene is not limited to mapreduce.
One possibility is to integrate YARN which is the new version of hadoop and is a middle layer between hadoop and mapreduce.
So we can support also other models for example for graph data processing and in-memory calculations.