L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds
1. Cloudgene - an execution platform for
MapReduce programs in public and
private clouds
Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner
University of Innsbruck, Austria
Medical University Innsbruck, Austria
BOSC 2012
2. MapReduce
cluster
Serial approach Parallel approach
cloud
private public
How to support scientists when using (our) MapReduce
programs?
Simplify the execution of MapReduce programs including
data management
Simplify access to a working MapReduce cluster
Maintain data sensitivity
2
MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
3. MapReduce in Genetics
CloudBurst
highly sensitive read mapping with MapReduce; Schatz, 2009
Crossbow
Searching for SNPs with cloud computing; Langmead et al., 2009
MyRNA
Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al.,
2010
Seal
a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012
Hadoop BAM
directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al.,
2012
CloudBioLinux
CloudBioLinux: pre-configured and on-demand bioinformatics computing for the
genomics community; Krampis et al., 2012
3
4. Difficulties with MapReduce
Additional steps, when setting up a
cluster in a public environment
Required steps when cluster is up and
running, Hadoop installed
4
5. Approaches
Possible approaches
Program specific approach
Implement a GUI for every program
Redundant work for the developer
Heterogeneity
Workflow systems
Galaxy, Taverna, Mobyle
Possible, but no HDFS support, blackbox
Our approach for Hadoop MapReduce
One GUI for different programs
Feedback, Standardized Import/Export
Integration of programs via a plugin interface
5
6. What is Cloudgene?
Open-source platform to improve the usability of Hadoop
MapReduce jobs
Provides a graphical web interface for their execution
Programs can be integrated by writing a simple configuration file
Public cloud & private cloud
Setting up a cluster in the cloud, installs all data on it
History of executed jobs with defined input/output parameters
Runs in your browser
Myrna
CloudBurst
Seal
Crossbow
CloudBioLinux
Cloudgene
6
8. Features
Integration of programs easily possible
standard MapReduce programs (Java -> CloudBurst)
streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna)
command line programs (e.g. using Pydoop -> Seal)
Data can be imported from different sources
S3 / HTTP / FTP
Import of huge datasets
Export results to S3 (public cloud)
Connect different MapReduce programs to a pipeline
Install additional programs via a web repository
8
9. Features
Cloudgene can be used on private and public clusters
sensitive data
local data
} private cloud
data on S3
no in-house cluster
} public cloud
available
Open source
9
11. Cloudgene in Action
How to integrate a new program in Cloudgene
1. Implement the program (or use existing)
2. Write plugin configuration file
11
12. Cloudgene in Action
Step 1 - Implement a program, executable via the command line
e.g: FastQ pre-processing with MapReduce
base quality / sequence quality / duplication levels / length distribution
hadoop jar exomePreprocessing.jar -input exomeData
-step baseJob -encoding 0 -output resultsOutput
12
13. Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 1 – General information:
13
14. Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 2 – Public cloud information:
14
15. Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 3 – MapReduce information:
15