L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

  • 824 views
Uploaded on

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
824
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloudgene - an execution platform forMapReduce programs in public andprivate cloudsLukas Forer, Sebastian Schönherr, Hansi WeißensteinerUniversity of Innsbruck, AustriaMedical University Innsbruck, Austria BOSC 2012
  • 2. MapReduce cluster Serial approach Parallel approach cloud private public How to support scientists when using (our) MapReduce programs? Simplify the execution of MapReduce programs including data management Simplify access to a working MapReduce cluster Maintain data sensitivity2 MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
  • 3. MapReduce in Genetics CloudBurst highly sensitive read mapping with MapReduce; Schatz, 2009 Crossbow Searching for SNPs with cloud computing; Langmead et al., 2009 MyRNA Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010 Seal a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012 Hadoop BAM directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012 CloudBioLinux CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 20123
  • 4. Difficulties with MapReduce Additional steps, when setting up a cluster in a public environment Required steps when cluster is up and running, Hadoop installed4
  • 5. Approaches Possible approaches Program specific approach Implement a GUI for every program Redundant work for the developer Heterogeneity Workflow systems Galaxy, Taverna, Mobyle Possible, but no HDFS support, blackbox Our approach for Hadoop MapReduce One GUI for different programs Feedback, Standardized Import/Export Integration of programs via a plugin interface5
  • 6. What is Cloudgene? Open-source platform to improve the usability of Hadoop MapReduce jobs Provides a graphical web interface for their execution Programs can be integrated by writing a simple configuration file Public cloud & private cloud Setting up a cluster in the cloud, installs all data on it History of executed jobs with defined input/output parameters Runs in your browser Myrna CloudBurst Seal Crossbow CloudBioLinux Cloudgene6
  • 7. Cloudgene7
  • 8. Features Integration of programs easily possible standard MapReduce programs (Java -> CloudBurst) streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna) command line programs (e.g. using Pydoop -> Seal) Data can be imported from different sources S3 / HTTP / FTP Import of huge datasets Export results to S3 (public cloud) Connect different MapReduce programs to a pipeline Install additional programs via a web repository8
  • 9. Features Cloudgene can be used on private and public clusters sensitive data local data } private cloud data on S3 no in-house cluster } public cloud available Open source9
  • 10. Summary10
  • 11. Cloudgene in Action How to integrate a new program in Cloudgene 1. Implement the program (or use existing) 2. Write plugin configuration file11
  • 12. Cloudgene in Action Step 1 - Implement a program, executable via the command line e.g: FastQ pre-processing with MapReduce base quality / sequence quality / duplication levels / length distribution hadoop jar exomePreprocessing.jar -input exomeData -step baseJob -encoding 0 -output resultsOutput12
  • 13. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 1 – General information:13
  • 14. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 2 – Public cloud information:14
  • 15. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 3 – MapReduce information:15
  • 16. Cloudgene in Action16
  • 17. Cloudgene in Action17
  • 18. Cloudgene in Action18
  • 19. Cloudgene in Action19
  • 20. Cloudgene in Action Different application – different GUI20
  • 21. Technologies Apache Hadoop http://hadoop.apache.org Apache Whirr http://whirr.apache.org Restlet http://www.restlet.org ExtJS http://www.sencha.com H2 http://www.h2database.com21
  • 22. Evaluation 4000 sec Amazon Elastic MapReduce (EMR) 3500 sec 3000 sec Graphical execution for MapReduce programs 2500 sec Export Excellent solution for public clouds 2000 sec Calculation Import Combination with S3 1500 sec Setup but 1000 sec data sensitivity 500 sec Reproducibility 0 sec Additional costs Cloudgene Amazon EMR22
  • 23. Integrated programs Wordcount, Grep, etc. http://sourceforge.net/apps/medihouse in awiki/cloudburst- bio/nfs/project/c/cl/cloudburst- Exome Preprocessing bio/7/70/MediaWikiSidebarLogo .png Finding SNPs23
  • 24. Acknowledgements Project-Website:Sebastian Schönherr Lukas Forer Hansi Weissensteiner http://cloudgene.uibk.ac.at Source Code: http://github.com/genepi Thanks to the Open SourceAnita Kloss-Brandstätter Florian Kronenberg Günther Specht Community24