Cloudgene - an execution platform forMapReduce programs in public andprivate cloudsLukas Forer, Sebastian Schönherr, Hansi...
MapReduce                                                                          cluster    Serial approach             ...
MapReduce in Genetics    CloudBurst           highly sensitive read mapping with MapReduce; Schatz, 2009    Crossbow      ...
Difficulties with MapReduce                    Additional steps, when setting up a                    cluster in a public ...
Approaches    Possible approaches      Program specific approach         Implement a GUI for every program         Redunda...
What is Cloudgene?    Open-source platform to improve the usability of Hadoop    MapReduce jobs       Provides a graphical...
Cloudgene7
Features    Integration of programs easily possible       standard MapReduce programs (Java -> CloudBurst)       streaming...
Features    Cloudgene can be used on private and public clusters       sensitive data       local data                    ...
Summary10
Cloudgene in Action     How to integrate a new program in Cloudgene       1. Implement the program (or use existing)      ...
Cloudgene in Action     Step 1 - Implement a program, executable via the command line     e.g: FastQ pre-processing with M...
Cloudgene in Action     Step 2 - Write configuration file including 3 parts     Part 1 – General information:13
Cloudgene in Action     Step 2 - Write configuration file including 3 parts     Part 2 – Public cloud information:14
Cloudgene in Action     Step 2 - Write configuration file including 3 parts     Part 3 – MapReduce information:15
Cloudgene in Action16
Cloudgene in Action17
Cloudgene in Action18
Cloudgene in Action19
Cloudgene in Action     Different application – different GUI20
Technologies     Apache Hadoop          http://hadoop.apache.org     Apache Whirr          http://whirr.apache.org     Res...
Evaluation                                              4000 sec     Amazon Elastic MapReduce (EMR)           3500 sec    ...
Integrated programs Wordcount, Grep, etc.                    http://sourceforge.net/apps/medihouse                        ...
Acknowledgements                                                                      Project-Website:Sebastian Schönherr ...
Upcoming SlideShare
Loading in …5
×

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

1,150 views

Published on

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,150
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

  1. 1. Cloudgene - an execution platform forMapReduce programs in public andprivate cloudsLukas Forer, Sebastian Schönherr, Hansi WeißensteinerUniversity of Innsbruck, AustriaMedical University Innsbruck, Austria BOSC 2012
  2. 2. MapReduce cluster Serial approach Parallel approach cloud private public How to support scientists when using (our) MapReduce programs? Simplify the execution of MapReduce programs including data management Simplify access to a working MapReduce cluster Maintain data sensitivity2 MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
  3. 3. MapReduce in Genetics CloudBurst highly sensitive read mapping with MapReduce; Schatz, 2009 Crossbow Searching for SNPs with cloud computing; Langmead et al., 2009 MyRNA Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010 Seal a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012 Hadoop BAM directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012 CloudBioLinux CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 20123
  4. 4. Difficulties with MapReduce Additional steps, when setting up a cluster in a public environment Required steps when cluster is up and running, Hadoop installed4
  5. 5. Approaches Possible approaches Program specific approach Implement a GUI for every program Redundant work for the developer Heterogeneity Workflow systems Galaxy, Taverna, Mobyle Possible, but no HDFS support, blackbox Our approach for Hadoop MapReduce One GUI for different programs Feedback, Standardized Import/Export Integration of programs via a plugin interface5
  6. 6. What is Cloudgene? Open-source platform to improve the usability of Hadoop MapReduce jobs Provides a graphical web interface for their execution Programs can be integrated by writing a simple configuration file Public cloud & private cloud Setting up a cluster in the cloud, installs all data on it History of executed jobs with defined input/output parameters Runs in your browser Myrna CloudBurst Seal Crossbow CloudBioLinux Cloudgene6
  7. 7. Cloudgene7
  8. 8. Features Integration of programs easily possible standard MapReduce programs (Java -> CloudBurst) streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna) command line programs (e.g. using Pydoop -> Seal) Data can be imported from different sources S3 / HTTP / FTP Import of huge datasets Export results to S3 (public cloud) Connect different MapReduce programs to a pipeline Install additional programs via a web repository8
  9. 9. Features Cloudgene can be used on private and public clusters sensitive data local data } private cloud data on S3 no in-house cluster } public cloud available Open source9
  10. 10. Summary10
  11. 11. Cloudgene in Action How to integrate a new program in Cloudgene 1. Implement the program (or use existing) 2. Write plugin configuration file11
  12. 12. Cloudgene in Action Step 1 - Implement a program, executable via the command line e.g: FastQ pre-processing with MapReduce base quality / sequence quality / duplication levels / length distribution hadoop jar exomePreprocessing.jar -input exomeData -step baseJob -encoding 0 -output resultsOutput12
  13. 13. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 1 – General information:13
  14. 14. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 2 – Public cloud information:14
  15. 15. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 3 – MapReduce information:15
  16. 16. Cloudgene in Action16
  17. 17. Cloudgene in Action17
  18. 18. Cloudgene in Action18
  19. 19. Cloudgene in Action19
  20. 20. Cloudgene in Action Different application – different GUI20
  21. 21. Technologies Apache Hadoop http://hadoop.apache.org Apache Whirr http://whirr.apache.org Restlet http://www.restlet.org ExtJS http://www.sencha.com H2 http://www.h2database.com21
  22. 22. Evaluation 4000 sec Amazon Elastic MapReduce (EMR) 3500 sec 3000 sec Graphical execution for MapReduce programs 2500 sec Export Excellent solution for public clouds 2000 sec Calculation Import Combination with S3 1500 sec Setup but 1000 sec data sensitivity 500 sec Reproducibility 0 sec Additional costs Cloudgene Amazon EMR22
  23. 23. Integrated programs Wordcount, Grep, etc. http://sourceforge.net/apps/medihouse in awiki/cloudburst- bio/nfs/project/c/cl/cloudburst- Exome Preprocessing bio/7/70/MediaWikiSidebarLogo .png Finding SNPs23
  24. 24. Acknowledgements Project-Website:Sebastian Schönherr Lukas Forer Hansi Weissensteiner http://cloudgene.uibk.ac.at Source Code: http://github.com/genepi Thanks to the Open SourceAnita Kloss-Brandstätter Florian Kronenberg Günther Specht Community24

×