Successfully reported this slideshow.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Delivering Bioinformatics MapReduce Applications in the Cloud

  1. 1. Delivering Bioinformatics MapReduce Applications in the Cloud Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*, Davor Davidović**, Florian Kronenberg* and Enis Afgan** * Division of Genetic Epidemiology, Medical University of Innsbruck, Austria ** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
  2. 2. Page 2 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  3. 3. Page 3 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis!
  4. 4. Page 4 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”
  5. 5. Page 5 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr MapReduce in Bioinformatics Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  6. 6. Page 6 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 1: Complex Analysis Pipelines • Bioinformatics MapReduce Applications – available only on a per-tool basis – cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science Needed: System which enables building MapReduce workflows
  7. 7. Page 7 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Overview • Cloudgene – MapReduce based Workflow System – assists scientists in executing and monitoring workflows graphically – data management (import/export) – workflow parameter track→ reproducibility • Supports Apache’s Big-Data stack S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
  8. 8. Page 8 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Reuse existing Apache Hadoop programs • No adaptations in source code needed • All meta data about tasks and workflows are defined in one single file – the manifest file MapReduce program Manifest File for Cloudgene
  9. 9. Page 9 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Manifest file defines the workflow
  10. 10. Page 10 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Interface Used Parameters Download links to result files
  11. 11. Page 11 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 2: Missing Infrastructure • Cloudgene requires a functional compatible cluster – Small/Medium sized research institutes can hardly afford own clusters • Possible Solution: Cloud Computing – Rent computer hardware from different providers (e.g. AWS, HP) – Use resources and services on demand Needed: System which enables delivering MapReduce clusters in the cloud
  12. 12. Page 12 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan: Overview • Enables launching/managing a analysis platform on a cloud infrastructure via a web browser – Delivers a scalable cluster-in-the-cloud • Preconfigure a number of applications – Workflow System: Galaxy – Batch scheduler: Sun Grid Engine (SGE) – MapReduce using Hadoop and SGE integration Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics 2012;
  13. 13. Page 13 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Configure the Cluster
  14. 14. Page 14 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Manage the Cluster
  15. 15. Page 15 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Use the Cluster
  16. 16. Page 16 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan-as-a-Platform • CloudMan implements a service dependency management framework • Enables creation of user-specific cloud platforms Idea: Implementing Cloudgene as a CloudMan service
  17. 17. Page 17 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene meets CloudMan Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan platform OpenNebulaOpenNebula AWS OpenStack Eucalyptus SGE Hadoop HTCondor ... GalaxyCloudgene MapReduce tools Traditional tools Batch Cluster jobs
  18. 18. Page 18 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr New Opportunities • One single data analysis cloud platform – MapReduce and SGE integrated – Additionally, more big data computational models can be supported in future CloudMan CloudMan
  19. 19. Page 19 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudgene – as a possible MapReduce Workflow Manager • CloudMan – as a possible Cloud Manager • Implementing Cloudgene as a CloudMan service – Enables domain experts to incorporate and make use of additional big data models
  20. 20. Page 20 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Tomislav Lipić – Davor Davidović • Cloudgene – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

Editor's Notes

  • Welcome to my presentation.

    I am happy to be here and to have the opportunity to talk about our approach which enables to deliver bioinf mapred apps in the cloud

    I am happy to be here and to have the opportunity to talk about our cooperation project, which is called „Delivering Bioinformatics MapReduce Applications in the Cloud”
  • The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb.

    The aim of the project is to develop a bioinformatics platform for the cloud

    feasibility study
  • I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome.

    This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses.

    This has the consequence that more and more data will produce.

    So the bottleneck is no longer the data production in the lab, but ist analysis.
  • the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows.

    And to analyze this data we need batch processing.

    One possible candidate to anaylze this large data efficiently is mapreduce.

    So there is a high potential for mapreduce in genomics.
  • This table shows that there exist already several Mapreduce apps in the field of mapreduce.

    For example for mapping shot reads to reference ot for rna analysis.
  • But the problem of such pproaches is that thei are available only on a per-tool basis

    To analyze data we need large pipeline and the avaialble tools cover only one apsect.

    Moreover, for biologists without background in cs it is very hard to use them

    Most ov popular workflow systems such as galaxy enable this abstraction only for tradional tools and not for mappreduce.

    So a system which enable building such mapreduce workflows is missing.
  • To overcome this issue we developed cloudgene, a mpreduce based workflow system

    Which assts scients in executing and monitoring worklfows

    Moreover, we provide a simple data management in order to import and exporrt datasets.

    Finally, all parameters are tracked which improves the reproducibility of experiments.

    At the moment we support the whole apache big data stack, that menans mapred, pig and hdfs
  • To build workflow existing hadoop programs can be reused and no adaption in source code is needed

    All meta data are defined in a single manifest file

    For example, here we have an existing tool called cloudburst, and here we have the correspong manifest file which enables to integrate it into cloudgene
  • Based on this manifest file, we create automatically a userinterface which can be used to submit the job with different parameters and datasets.
  • When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  • A workflow manager like cloudgene requires a compatible cluster

    Especially for smale research institutes it can be hard to afford and mantain their own cluster

    So possible solution is cloud computing which enables to rent compoter hardware from different providers for example amazon.

    So they can use the rented resources on demand.

    But the problem here is a missing system which enables devlivering mapreduce clusters in the cloud
  • To overcome this issues, our cooperation partners from croatia developed cloudman which solves this problem

    It enables to launch and manage an analysis paltofrm on a cloud infratrsture.

    So through the browser a scalable cluster-in-the-cloud can be delivered.

    Moreover, cloudman preconfigures this new cluster with a lot of applications

    For example galaxy, a batach scheduler, mapreduce or HTCondor.
  • So in a first step, the user configures the cluster through a simple web wizard.

    For example the type of the cluster can be selected or the size of the data volume
  • In a second step, the users can manage it cluster.

    That means she or he can add new nodes, remove nodes or finally terminates the clusters.

    Moreover, cloudman provides a very nice feature which enables an automatically scaling.
  • After that, cloudman installs the required application on the the new cluster.

    For example, in this case galaxy was automatically installed and can be used by the user to execute non-hadoop workflows.
  • Cloudman implements a service dependency management framework which spepates different modules.

    That means, the infrastruce layer is separted by the application execution environement and from the applications and the datasets.

    This features enable to switch this modules with alternative service and implementation in order to create a user pecific cloud platform.


    So we can implement cloudgene as a cloudman service in order to provide a mpareduce bioinformatics platform in the cloud.
  • So the overall picture is to use cloudman as the cloudmanager to create a cluster with the need envirnoment.

    An then we use cloudgene on top of this cluster to execute and manage mapreduce workflows.

    This results in a new platform which provides several tools for bioinformatic usecases.

    So we have the user which uses cloudgene or galaxy on top of the clusters create by cloudman.

    Moreover, through a cloudabstraction this platform can be instantiated on different coud providers.
  • So we have one single data analysis platform which integrates MapReduce and SGE.

    At the moment cloudgene uses only mapreduce.

    But hadoop 2 introduced a new layer bettwen mapreduce and hadoop. This layer is called yarn and is jobmanager which enables to integrate other parallelizing models like sparkl or giraph.

    This enables as to use other big data computational modesl

    Moreover, the platform integratesalready several bioformatics workflows.
  • To conclude my presentation, we saw that mapred is a simple but effective programming mode.

    It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science.

    We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager.

    Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
  • Cloudman was developed by enis, tamislav and davor from croatia.

    And cloudgene was developed by sebm, hansi, florian and me.

    Now i am at the end of my presentation so thank you very much for your attention.
  • ×