Delivering Bioinformatics MapReduce Applications in the Cloud
1. Delivering Bioinformatics MapReduce
Applications in the Cloud
Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*,
Davor Davidović**, Florian Kronenberg* and Enis Afgan**
* Division of Genetic Epidemiology, Medical University of Innsbruck, Austria
** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
2. Page 2
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Introduction
• Cooperation between
– Division of Genetic Epidemiology, Innsbruck,
Austria
– Ruđer Bošković Institute (RBI), Zagreb, Croatia
• Project
• Aim: Developing a Bioinformatics Platform for the Cloud
3. Page 3
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Bioinformatics
• More and more data is produced
• Next-Generation Sequencing (NGS)
– Allows us to sequence the complete human
genome (3.2 billion positions)
– Decreasing Costs Increasing Data
Production
Bottleneck is no longer the data production in the
laboratory, but the analysis!
4. Page 4
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Next Generation Sequencing
• NGS Data Characteristics
– Data size in terabyte scale
– Independent text data rows
– Batch processing required for analysis files
• MapReduce
– One possible candidate for batch processing
– scalable approach to manage and process large data
efficiently
“High potential for MapReduce in Genomics”
5. Page 5
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
MapReduce in Bioinformatics
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
6. Page 6
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 1: Complex Analysis Pipelines
• Bioinformatics MapReduce Applications
– available only on a per-tool basis
– cover one aspect of a larger data analysis
pipeline
– Hard to use for scientists without background
in Computer Science
Needed: System which enables building
MapReduce workflows
7. Page 7
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Overview
• Cloudgene
– MapReduce based Workflow System
– assists scientists in executing and monitoring
workflows graphically
– data management (import/export)
– workflow parameter track→ reproducibility
• Supports Apache’s Big-Data stack
S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform
for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
8. Page 8
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Reuse existing Apache Hadoop programs
• No adaptations in source code needed
• All meta data about tasks and workflows
are defined in one single file
– the manifest file
MapReduce
program
Manifest File
for Cloudgene
9. Page 9
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Manifest file defines the workflow
10. Page 10
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Interface
Used
Parameters
Download links
to result files
11. Page 11
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 2: Missing Infrastructure
• Cloudgene requires a functional compatible cluster
– Small/Medium sized research institutes can hardly afford
own clusters
• Possible Solution: Cloud Computing
– Rent computer hardware from different providers (e.g.
AWS, HP)
– Use resources and services on demand
Needed: System which enables delivering
MapReduce clusters in the cloud
12. Page 12
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan: Overview
• Enables launching/managing a analysis
platform on a cloud infrastructure via a web
browser
– Delivers a scalable cluster-in-the-cloud
• Preconfigure a number of applications
– Workflow System: Galaxy
– Batch scheduler: Sun Grid Engine (SGE)
– MapReduce using Hadoop and SGE
integration
Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics
2012;
13. Page 13
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Configure the Cluster
14. Page 14
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Manage the Cluster
15. Page 15
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Use the Cluster
16. Page 16
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan-as-a-Platform
• CloudMan implements a service dependency
management framework
• Enables creation of user-specific cloud platforms
Idea: Implementing Cloudgene as a CloudMan
service
17. Page 17
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene meets CloudMan
Cloudgene
A Bioinformatics
MapReduce workflow
framework
CloudMan
A cloud manager for
delivering application
execution environments
CloudMan platform
OpenNebulaOpenNebula
AWS OpenStack Eucalyptus
SGE Hadoop HTCondor ...
GalaxyCloudgene
MapReduce tools Traditional tools
Batch
Cluster jobs
18. Page 18
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
New Opportunities
• One single data analysis cloud platform
– MapReduce and SGE integrated
– Additionally, more big data computational
models can be supported in future
CloudMan CloudMan
19. Page 19
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Conclusion
• MapReduce
– simple but effective programming model
– well suited for the Cloud
– still limited to a small number of highly qualified domain experts
• Cloudgene
– as a possible MapReduce Workflow Manager
• CloudMan
– as a possible Cloud Manager
• Implementing Cloudgene as a CloudMan service
– Enables domain experts to incorporate and make use of
additional big data models
20. Page 20
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Acknowledge
• CloudMan
– Enis Afgan
– Tomislav Lipić
– Davor Davidović
• Cloudgene
– Lukas Forer
– Sebastian Schönherr
– Hansi Weißensteiner
– Florian Kronenberg
Editor's Notes
Welcome to my presentation.
I am happy to be here and to have the opportunity to talk about our approach which enables to deliver bioinf mapred apps in the cloud
I am happy to be here and to have the opportunity to talk about our cooperation project, which is called „Delivering Bioinformatics MapReduce Applications in the Cloud”
The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb.
The aim of the project is to develop a bioinformatics platform for the cloud
feasibility study
I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome.
This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses.
This has the consequence that more and more data will produce.
So the bottleneck is no longer the data production in the lab, but ist analysis.
the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows.
And to analyze this data we need batch processing.
One possible candidate to anaylze this large data efficiently is mapreduce.
So there is a high potential for mapreduce in genomics.
This table shows that there exist already several Mapreduce apps in the field of mapreduce.
For example for mapping shot reads to reference ot for rna analysis.
But the problem of such pproaches is that thei are available only on a per-tool basis
To analyze data we need large pipeline and the avaialble tools cover only one apsect.
Moreover, for biologists without background in cs it is very hard to use them
Most ov popular workflow systems such as galaxy enable this abstraction only for tradional tools and not for mappreduce.
So a system which enable building such mapreduce workflows is missing.
To overcome this issue we developed cloudgene, a mpreduce based workflow system
Which assts scients in executing and monitoring worklfows
Moreover, we provide a simple data management in order to import and exporrt datasets.
Finally, all parameters are tracked which improves the reproducibility of experiments.
At the moment we support the whole apache big data stack, that menans mapred, pig and hdfs
To build workflow existing hadoop programs can be reused and no adaption in source code is needed
All meta data are defined in a single manifest file
For example, here we have an existing tool called cloudburst, and here we have the correspong manifest file which enables to integrate it into cloudgene
Based on this manifest file, we create automatically a userinterface which can be used to submit the job with different parameters and datasets.
When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
A workflow manager like cloudgene requires a compatible cluster
Especially for smale research institutes it can be hard to afford and mantain their own cluster
So possible solution is cloud computing which enables to rent compoter hardware from different providers for example amazon.
So they can use the rented resources on demand.
But the problem here is a missing system which enables devlivering mapreduce clusters in the cloud
To overcome this issues, our cooperation partners from croatia developed cloudman which solves this problem
It enables to launch and manage an analysis paltofrm on a cloud infratrsture.
So through the browser a scalable cluster-in-the-cloud can be delivered.
Moreover, cloudman preconfigures this new cluster with a lot of applications
For example galaxy, a batach scheduler, mapreduce or HTCondor.
So in a first step, the user configures the cluster through a simple web wizard.
For example the type of the cluster can be selected or the size of the data volume
In a second step, the users can manage it cluster.
That means she or he can add new nodes, remove nodes or finally terminates the clusters.
Moreover, cloudman provides a very nice feature which enables an automatically scaling.
After that, cloudman installs the required application on the the new cluster.
For example, in this case galaxy was automatically installed and can be used by the user to execute non-hadoop workflows.
Cloudman implements a service dependency management framework which spepates different modules.
That means, the infrastruce layer is separted by the application execution environement and from the applications and the datasets.
This features enable to switch this modules with alternative service and implementation in order to create a user pecific cloud platform.
So we can implement cloudgene as a cloudman service in order to provide a mpareduce bioinformatics platform in the cloud.
So the overall picture is to use cloudman as the cloudmanager to create a cluster with the need envirnoment.
An then we use cloudgene on top of this cluster to execute and manage mapreduce workflows.
This results in a new platform which provides several tools for bioinformatic usecases.
So we have the user which uses cloudgene or galaxy on top of the clusters create by cloudman.
Moreover, through a cloudabstraction this platform can be instantiated on different coud providers.
So we have one single data analysis platform which integrates MapReduce and SGE.
At the moment cloudgene uses only mapreduce.
But hadoop 2 introduced a new layer bettwen mapreduce and hadoop. This layer is called yarn and is jobmanager which enables to integrate other parallelizing models like sparkl or giraph.
This enables as to use other big data computational modesl
Moreover, the platform integratesalready several bioformatics workflows.
To conclude my presentation, we saw that mapred is a simple but effective programming mode.
It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science.
We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager.
Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
Cloudman was developed by enis, tamislav and davor from croatia.
And cloudgene was developed by sebm, hansi, florian and me.
Now i am at the end of my presentation so thank you very much for your attention.