Successfully reported this slideshow.

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

1

Share

Loading in …3
×
1 of 24
1 of 24

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

1

Share

Download to read offline

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

  1. 1. Cloudgene - an execution platform for MapReduce programs in public and private clouds Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner University of Innsbruck, Austria Medical University Innsbruck, Austria BOSC 2012
  2. 2. MapReduce cluster Serial approach Parallel approach cloud private public How to support scientists when using (our) MapReduce programs? Simplify the execution of MapReduce programs including data management Simplify access to a working MapReduce cluster Maintain data sensitivity 2 MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
  3. 3. MapReduce in Genetics CloudBurst highly sensitive read mapping with MapReduce; Schatz, 2009 Crossbow Searching for SNPs with cloud computing; Langmead et al., 2009 MyRNA Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010 Seal a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012 Hadoop BAM directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012 CloudBioLinux CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 2012 3
  4. 4. Difficulties with MapReduce Additional steps, when setting up a cluster in a public environment Required steps when cluster is up and running, Hadoop installed 4
  5. 5. Approaches Possible approaches Program specific approach Implement a GUI for every program Redundant work for the developer Heterogeneity Workflow systems Galaxy, Taverna, Mobyle Possible, but no HDFS support, blackbox Our approach for Hadoop MapReduce One GUI for different programs Feedback, Standardized Import/Export Integration of programs via a plugin interface 5
  6. 6. What is Cloudgene? Open-source platform to improve the usability of Hadoop MapReduce jobs Provides a graphical web interface for their execution Programs can be integrated by writing a simple configuration file Public cloud & private cloud Setting up a cluster in the cloud, installs all data on it History of executed jobs with defined input/output parameters Runs in your browser Myrna CloudBurst Seal Crossbow CloudBioLinux Cloudgene 6
  7. 7. Cloudgene 7
  8. 8. Features Integration of programs easily possible standard MapReduce programs (Java -> CloudBurst) streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna) command line programs (e.g. using Pydoop -> Seal) Data can be imported from different sources S3 / HTTP / FTP Import of huge datasets Export results to S3 (public cloud) Connect different MapReduce programs to a pipeline Install additional programs via a web repository 8
  9. 9. Features Cloudgene can be used on private and public clusters sensitive data local data } private cloud data on S3 no in-house cluster } public cloud available Open source 9
  10. 10. Summary 10
  11. 11. Cloudgene in Action How to integrate a new program in Cloudgene 1. Implement the program (or use existing) 2. Write plugin configuration file 11
  12. 12. Cloudgene in Action Step 1 - Implement a program, executable via the command line e.g: FastQ pre-processing with MapReduce base quality / sequence quality / duplication levels / length distribution hadoop jar exomePreprocessing.jar -input exomeData -step baseJob -encoding 0 -output resultsOutput 12
  13. 13. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 1 – General information: 13
  14. 14. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 2 – Public cloud information: 14
  15. 15. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 3 – MapReduce information: 15
  16. 16. Cloudgene in Action 16
  17. 17. Cloudgene in Action 17
  18. 18. Cloudgene in Action 18
  19. 19. Cloudgene in Action 19
  20. 20. Cloudgene in Action Different application – different GUI 20
  21. 21. Technologies Apache Hadoop http://hadoop.apache.org Apache Whirr http://whirr.apache.org Restlet http://www.restlet.org ExtJS http://www.sencha.com H2 http://www.h2database.com 21
  22. 22. Evaluation 4000 sec Amazon Elastic MapReduce (EMR) 3500 sec 3000 sec Graphical execution for MapReduce programs 2500 sec Export Excellent solution for public clouds 2000 sec Calculation Import Combination with S3 1500 sec Setup but 1000 sec data sensitivity 500 sec Reproducibility 0 sec Additional costs Cloudgene Amazon EMR 22
  23. 23. Integrated programs Wordcount, Grep, etc. http://sourceforge.net/apps/medihouse in awiki/cloudburst- bio/nfs/project/c/cl/cloudburst- Exome Preprocessing bio/7/70/MediaWikiSidebarLogo .png Finding SNPs 23
  24. 24. Acknowledgements Project-Website: Sebastian Schönherr Lukas Forer Hansi Weissensteiner http://cloudgene.uibk.ac.at Source Code: http://github.com/genepi Thanks to the Open Source Anita Kloss-Brandstätter Florian Kronenberg Günther Specht Community 24

×