L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Jan Aerts
Jan AertsAssistant Professor at Leuven University
Cloudgene - an execution platform for
MapReduce programs in public and
private clouds

Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner

University of Innsbruck, Austria
Medical University Innsbruck, Austria


                                              BOSC 2012
MapReduce
                                                                          cluster


    Serial approach              Parallel approach
                                                                                    cloud

                                                                             private        public
       How to support scientists when using (our) MapReduce
       programs?
           Simplify the execution of MapReduce programs including
           data management
           Simplify access to a working MapReduce cluster
           Maintain data sensitivity




2
                      MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
MapReduce in Genetics
    CloudBurst
           highly sensitive read mapping with MapReduce; Schatz, 2009
    Crossbow
           Searching for SNPs with cloud computing; Langmead et al., 2009
    MyRNA
           Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al.,
           2010
    Seal
           a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012
    Hadoop BAM
           directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al.,
           2012
    CloudBioLinux
           CloudBioLinux: pre-configured and on-demand bioinformatics computing for the
           genomics community; Krampis et al., 2012

3
Difficulties with MapReduce


                    Additional steps, when setting up a
                    cluster in a public environment




                    Required steps when cluster is up and
                    running, Hadoop installed




4
Approaches
    Possible approaches
      Program specific approach
         Implement a GUI for every program
         Redundant work for the developer
         Heterogeneity

      Workflow systems
         Galaxy, Taverna, Mobyle
         Possible, but no HDFS support, blackbox

    Our approach for Hadoop MapReduce
         One GUI for different programs
         Feedback, Standardized Import/Export
         Integration of programs via a plugin interface

5
What is Cloudgene?
    Open-source platform to improve the usability of Hadoop
    MapReduce jobs
       Provides a graphical web interface for their execution
       Programs can be integrated by writing a simple configuration file
       Public cloud & private cloud
          Setting up a cluster in the cloud, installs all data on it
       History of executed jobs with defined input/output parameters


    Runs in your browser
                                           Myrna
                                         CloudBurst
                                             Seal
                                         Crossbow
                                         CloudBioLinux

                                         Cloudgene
6
Cloudgene




7
Features
    Integration of programs easily possible
       standard MapReduce programs (Java -> CloudBurst)
       streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna)
       command line programs (e.g. using Pydoop -> Seal)


    Data can be imported from different sources
       S3 / HTTP / FTP
       Import of huge datasets
       Export results to S3 (public cloud)


    Connect different MapReduce programs to a pipeline
    Install additional programs via a web repository
8
Features

    Cloudgene can be used on private and public clusters


       sensitive data
       local data
                             } private cloud

       data on S3
       no in-house cluster
                             } public cloud
       available


    Open source


9
Summary




10
Cloudgene in Action




     How to integrate a new program in Cloudgene
       1. Implement the program (or use existing)
       2. Write plugin configuration file




11
Cloudgene in Action



     Step 1 - Implement a program, executable via the command line


     e.g: FastQ pre-processing with MapReduce
          base quality / sequence quality / duplication levels / length distribution


          hadoop jar exomePreprocessing.jar -input exomeData
          -step baseJob -encoding 0 -output resultsOutput




12
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 1 – General information:




13
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 2 – Public cloud information:




14
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 3 – MapReduce information:




15
Cloudgene in Action




16
Cloudgene in Action




17
Cloudgene in Action




18
Cloudgene in Action




19
Cloudgene in Action

     Different application – different GUI




20
Technologies
     Apache Hadoop
          http://hadoop.apache.org
     Apache Whirr
          http://whirr.apache.org
     Restlet
          http://www.restlet.org
     ExtJS
          http://www.sencha.com
     H2
          http://www.h2database.com



21
Evaluation

                                              4000 sec


     Amazon Elastic MapReduce (EMR)           3500 sec

                                              3000 sec
       Graphical execution for MapReduce
       programs                               2500 sec
                                                                                  Export
       Excellent solution for public clouds   2000 sec                            Calculation
                                                                                  Import
           Combination with S3                1500 sec
                                                                                  Setup
     but                                      1000 sec

           data sensitivity                    500 sec
           Reproducibility
                                                 0 sec
           Additional costs                              Cloudgene   Amazon EMR




22
Integrated programs


 Wordcount, Grep, etc.




                    http://sourceforge.net/apps/medihouse
                                                     in
                    awiki/cloudburst-
                    bio/nfs/project/c/cl/cloudburst-
                                Exome Preprocessing
                    bio/7/70/MediaWikiSidebarLogo
                    .png        Finding SNPs
23
Acknowledgements



                                                                      Project-Website:
Sebastian Schönherr       Lukas Forer         Hansi Weissensteiner    http://cloudgene.uibk.ac.at

                                                                      Source Code:
                                                                      http://github.com/genepi


                                                                     Thanks to the Open Source
Anita Kloss-Brandstätter Florian Kronenberg     Günther Specht       Community




24
1 of 24

Recommended

Near Exascale Computing in the Cloud by
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the CloudFrank Wuerthwein
118 views36 slides
Algorithms and tools for point cloud generation by
Algorithms and tools for point cloud generationAlgorithms and tools for point cloud generation
Algorithms and tools for point cloud generationRadhe Syam
546 views6 slides
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ... by
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
6.1K views21 slides
52 nfs by
52 nfs52 nfs
52 nfsmapr-academy
903 views21 slides
SkyhookDM - Towards an Arrow-Native Storage System by
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
2.6K views17 slides
55a remote cluster by
55a remote cluster55a remote cluster
55a remote clustermapr-academy
451 views14 slides

More Related Content

Similar to L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Delivering Bioinformatics MapReduce Applications in the Cloud by
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudLukas Forer
555 views20 slides
The evolution of data center network fabrics by
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabricsCisco Canada
1.8K views24 slides
cncf overview and building edge computing using kubernetes by
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetesKrishna-Kumar
402 views40 slides
FinalReport by
FinalReportFinalReport
FinalReportJohn Pham
71 views15 slides
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds by
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the CloudsSébastien Mosser
912 views24 slides
Paper444012-4014 by
Paper444012-4014Paper444012-4014
Paper444012-4014saumya yuval
126 views3 slides

Similar to L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds(20)

Delivering Bioinformatics MapReduce Applications in the Cloud by Lukas Forer
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the Cloud
Lukas Forer555 views
The evolution of data center network fabrics by Cisco Canada
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabrics
Cisco Canada1.8K views
cncf overview and building edge computing using kubernetes by Krishna-Kumar
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetes
Krishna-Kumar 402 views
FinalReport by John Pham
FinalReportFinalReport
FinalReport
John Pham71 views
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds by Sébastien Mosser
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Sébastien Mosser912 views
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi... by OW2
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
OW2452 views
ClassCloud: switch your PC Classroom into Cloud Testbed by Jazz Yao-Tsung Wang
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
A Novel Approach for Workload Optimization and Improving Security in Cloud Co... by IOSR Journals
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
IOSR Journals107 views
ZCloud Consensus on Hardware for Distributed Systems by Gokhan Boranalp
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp394 views
Predicting Space Weather with Docker by Docker, Inc.
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
Docker, Inc.238 views
Access security on cloud computing implemented in hadoop system by João Gabriel Lima
Access security on cloud computing implemented in hadoop systemAccess security on cloud computing implemented in hadoop system
Access security on cloud computing implemented in hadoop system
João Gabriel Lima597 views
Google Cloud Networking Deep Dive by Michelle Holley
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep Dive
Michelle Holley17.4K views
Dataservices - Processing Big Data The Microservice Way by Josef Adersberger
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger1.4K views
云计算及其应用 by lantianlcdx
云计算及其应用云计算及其应用
云计算及其应用
lantianlcdx780 views
FIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing by FIWARE
FIWARE Global Summit - FogFlow, a new GE for IoT Edge ComputingFIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE1K views
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ... by Edge AI and Vision Alliance
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
CPaaS.io Y1 Review Meeting - Cloud & Edge Programming by Stephan Haller
CPaaS.io Y1 Review Meeting - Cloud & Edge ProgrammingCPaaS.io Y1 Review Meeting - Cloud & Edge Programming
CPaaS.io Y1 Review Meeting - Cloud & Edge Programming
Stephan Haller105 views
Deep Learning Neural Networks in the Cloud by IJAEMSJORNAL
Deep Learning Neural Networks in the CloudDeep Learning Neural Networks in the Cloud
Deep Learning Neural Networks in the Cloud
IJAEMSJORNAL7 views

More from Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation by
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
1.6K views34 slides
Visual Analytics in Omics - why, what, how? by
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
2.5K views62 slides
Visual Analytics in Omics: why, what, how? by
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Jan Aerts
1.1K views49 slides
Visual Analytics talk at ISMB2013 by
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
979 views35 slides
Visualizing the Structural Variome (VMLS-Eurovis 2013) by
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
945 views26 slides
Humanizing Data Analysis by
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
712 views23 slides

More from Jan Aerts(20)

VIZBI 2014 - Visualizing Genomic Variation by Jan Aerts
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
Jan Aerts1.6K views
Visual Analytics in Omics - why, what, how? by Jan Aerts
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
Jan Aerts2.5K views
Visual Analytics in Omics: why, what, how? by Jan Aerts
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
Jan Aerts1.1K views
Visual Analytics talk at ISMB2013 by Jan Aerts
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
Jan Aerts979 views
Visualizing the Structural Variome (VMLS-Eurovis 2013) by Jan Aerts
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Jan Aerts945 views
Humanizing Data Analysis by Jan Aerts
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
Jan Aerts712 views
Intro to data visualization by Jan Aerts
Intro to data visualizationIntro to data visualization
Intro to data visualization
Jan Aerts6.7K views
L Fu - Dao: a novel programming language for bioinformatics by Jan Aerts
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
Jan Aerts2.1K views
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module... by Jan Aerts
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
Jan Aerts907 views
S Cain - GMOD in the cloud by Jan Aerts
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts542 views
B Temperton - The Bioinformatics Testing Consortium by Jan Aerts
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
Jan Aerts924 views
J Goecks - The Galaxy Visual Analysis Framework by Jan Aerts
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
Jan Aerts1.5K views
S Cain - GMOD in the cloud by Jan Aerts
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts545 views
B Chapman - Toolkit for variation comparison and analysis by Jan Aerts
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
Jan Aerts1.3K views
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu... by Jan Aerts
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
Jan Aerts1.4K views
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg... by Jan Aerts
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
Jan Aerts816 views
S Cheng - eagle-i: development and expansion of a scientific resource discove... by Jan Aerts
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
Jan Aerts446 views
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi... by Jan Aerts
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
Jan Aerts920 views
A Kalderimis - InterMine: Embeddable datamining components by Jan Aerts
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
Jan Aerts516 views
E Afgan - Zero to a bioinformatics analysis platform in four minutes by Jan Aerts
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Jan Aerts772 views

Recently uploaded

"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell by
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M SnellFwdays
14 views30 slides
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023 by
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023BookNet Canada
44 views19 slides
The Power of Generative AI in Accelerating No Code Adoption.pdf by
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdfSaeed Al Dhaheri
39 views18 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
84 views20 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
65 views27 slides
CryptoBotsAI by
CryptoBotsAICryptoBotsAI
CryptoBotsAIchandureddyvadala199
42 views5 slides

Recently uploaded(20)

"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell by Fwdays
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
Fwdays14 views
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023 by BookNet Canada
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
BookNet Canada44 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri39 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE84 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty65 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li91 views
GDSC GLAU Info Session.pptx by gauriverrma4
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptx
gauriverrma415 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro35 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
Cocktail of Environments. How to Mix Test and Development Environments and St... by Aleksandr Tarasov
Cocktail of Environments. How to Mix Test and Development Environments and St...Cocktail of Environments. How to Mix Test and Development Environments and St...
Cocktail of Environments. How to Mix Test and Development Environments and St...
Discover Aura Workshop (12.5.23).pdf by Neo4j
Discover Aura Workshop (12.5.23).pdfDiscover Aura Workshop (12.5.23).pdf
Discover Aura Workshop (12.5.23).pdf
Neo4j15 views
Mobile Core Solutions & Successful Cases.pdf by IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks14 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays34 views

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

  • 1. Cloudgene - an execution platform for MapReduce programs in public and private clouds Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner University of Innsbruck, Austria Medical University Innsbruck, Austria BOSC 2012
  • 2. MapReduce cluster Serial approach Parallel approach cloud private public How to support scientists when using (our) MapReduce programs? Simplify the execution of MapReduce programs including data management Simplify access to a working MapReduce cluster Maintain data sensitivity 2 MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
  • 3. MapReduce in Genetics CloudBurst highly sensitive read mapping with MapReduce; Schatz, 2009 Crossbow Searching for SNPs with cloud computing; Langmead et al., 2009 MyRNA Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010 Seal a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012 Hadoop BAM directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012 CloudBioLinux CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 2012 3
  • 4. Difficulties with MapReduce Additional steps, when setting up a cluster in a public environment Required steps when cluster is up and running, Hadoop installed 4
  • 5. Approaches Possible approaches Program specific approach Implement a GUI for every program Redundant work for the developer Heterogeneity Workflow systems Galaxy, Taverna, Mobyle Possible, but no HDFS support, blackbox Our approach for Hadoop MapReduce One GUI for different programs Feedback, Standardized Import/Export Integration of programs via a plugin interface 5
  • 6. What is Cloudgene? Open-source platform to improve the usability of Hadoop MapReduce jobs Provides a graphical web interface for their execution Programs can be integrated by writing a simple configuration file Public cloud & private cloud Setting up a cluster in the cloud, installs all data on it History of executed jobs with defined input/output parameters Runs in your browser Myrna CloudBurst Seal Crossbow CloudBioLinux Cloudgene 6
  • 8. Features Integration of programs easily possible standard MapReduce programs (Java -> CloudBurst) streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna) command line programs (e.g. using Pydoop -> Seal) Data can be imported from different sources S3 / HTTP / FTP Import of huge datasets Export results to S3 (public cloud) Connect different MapReduce programs to a pipeline Install additional programs via a web repository 8
  • 9. Features Cloudgene can be used on private and public clusters sensitive data local data } private cloud data on S3 no in-house cluster } public cloud available Open source 9
  • 11. Cloudgene in Action How to integrate a new program in Cloudgene 1. Implement the program (or use existing) 2. Write plugin configuration file 11
  • 12. Cloudgene in Action Step 1 - Implement a program, executable via the command line e.g: FastQ pre-processing with MapReduce base quality / sequence quality / duplication levels / length distribution hadoop jar exomePreprocessing.jar -input exomeData -step baseJob -encoding 0 -output resultsOutput 12
  • 13. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 1 – General information: 13
  • 14. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 2 – Public cloud information: 14
  • 15. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 3 – MapReduce information: 15
  • 20. Cloudgene in Action Different application – different GUI 20
  • 21. Technologies Apache Hadoop http://hadoop.apache.org Apache Whirr http://whirr.apache.org Restlet http://www.restlet.org ExtJS http://www.sencha.com H2 http://www.h2database.com 21
  • 22. Evaluation 4000 sec Amazon Elastic MapReduce (EMR) 3500 sec 3000 sec Graphical execution for MapReduce programs 2500 sec Export Excellent solution for public clouds 2000 sec Calculation Import Combination with S3 1500 sec Setup but 1000 sec data sensitivity 500 sec Reproducibility 0 sec Additional costs Cloudgene Amazon EMR 22
  • 23. Integrated programs Wordcount, Grep, etc. http://sourceforge.net/apps/medihouse in awiki/cloudburst- bio/nfs/project/c/cl/cloudburst- Exome Preprocessing bio/7/70/MediaWikiSidebarLogo .png Finding SNPs 23
  • 24. Acknowledgements Project-Website: Sebastian Schönherr Lukas Forer Hansi Weissensteiner http://cloudgene.uibk.ac.at Source Code: http://github.com/genepi Thanks to the Open Source Anita Kloss-Brandstätter Florian Kronenberg Günther Specht Community 24