Successfully reported this slideshow.

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

1

Share

Upcoming SlideShare
Taylor bosc2010
Taylor bosc2010
Loading in …3
×
1 of 17
1 of 17

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

1

Share

Cloudflow is a MapReduce pipeline framework, which is based on a similar concept as JavaFlume or Apache Crunch. In contrast to these existing approaches, Cloudflow was developed to simplify the pipeline creation in biomedical research, especially in the field of Genetics. For that purpose Cloudflow supports a variety of NGS data formats and contains a rich collection of built-in operations for analyzing such kind of datasets (e.g. quality checks, mapping reads or variation calling).

Cloudflow is a MapReduce pipeline framework, which is based on a similar concept as JavaFlume or Apache Crunch. In contrast to these existing approaches, Cloudflow was developed to simplify the pipeline creation in biomedical research, especially in the field of Genetics. For that purpose Cloudflow supports a variety of NGS data formats and contains a rich collection of built-in operations for analyzing such kind of datasets (e.g. quality checks, mapping reads or variation calling).

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

  1. 1. Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research Lukas Forer1,2, Enis Afgan3,4, Hansi Weißensteiner1,2, Davor Davidović3, Günther Specht2, Florian Kronenberg1, Sebastian Schönherr1,2 1 Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria 2 Institute of Computer Science, Research Group Databases and Information Systems, Innsbruck, Austria 3 Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia 4 Department of Biology, Johns Hopkins University, Baltimore MD, USA
  2. 2. Page 2 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  3. 3. Page 3 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis! Bottleneck is no longer the data production in the laboratory, but the analysis!
  4. 4. Page 4 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”“High potential for MapReduce in Genomics”
  5. 5. Page 5 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Challenges using MapReduce • writing a MapReduce job can be a challenging task – Hard to write – Hard to maintain – Hard to test • high-level languages on top of Hadoop (e.g. PIG) – allow writing MapReduce jobs in form of queries – SeqPig or BioPig provide scripts for NGS data – Not developed to combine different scripts to a pipelines Prevents Bioinformatics from using MapReducePrevents Bioinformatics from using MapReduce
  6. 6. Page 6 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow • a MapReduce pipeline framework • Provides a rich and highly abstracted Java API • Comparable with FlumeJava and Apache Crunch • For Big Data in Genetics – provides a several built-in operations for NGS • Cloudflow pipelines are easy to integrate into graphical execution engines like Cloudgene
  7. 7. Page 7 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Main Concept • Main concept is to break complex data analysis steps into basic operations – Transform Operation – Summarize Operation • Operations are independent from Hadoop MapReduce Easy to write, reuse and test!Easy to write, reuse and test!
  8. 8. Page 8 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Transform Operation • Example TransformTransform 0..n Records0..n Records1 Record1 Record Split by WordSplit by Word1 Sentence1 Sentence 1..n Words1..n Words
  9. 9. Page 9 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Summarize Operation • Example SummarizeSummarize Grouped Records Grouped Records 0..n Records0..n Records CountCountWordsWords FrequenceFrequence
  10. 10. Page 10 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Pipelines • Pipeline • Example TransformTransform SummarizeSummarizeTransformTransform TransformTransform SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  11. 11. Page 11 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution • Execute the same pipeline – On your local development machine (testdata) – On a remote Hadoop Cluster (with big-data) SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  12. 12. Page 12 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution as MapReduce jobs
  13. 13. Page 13 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Supported Data Formats and Operations
  14. 14. Page 14 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Deploying Pipelines as a Service Cloudgene A Bioinformatics MapReduce workflow framework Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan A cloud manager for delivering application execution environments Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research
  15. 15. Page 15 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Evaluation
  16. 16. Page 16 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudflow – a MapReduce pipeline framework – Provides a rich and highly abstracted Java API – Comparable with FlumeJava and Apache Crunch – For Big Data in Genetics – https://github.com/genepi/cloudflow
  17. 17. Page 17 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Davor Davidović – Tomislav Lipić • Cloudgene and Cloudflow – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

Editor's Notes

  • The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb.
    The aim of the project is to develop a bioinformatics platform for the cloud
    feasibility study
  • I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome.
    This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses.
    This has the consequence that more and more data will produce.
    So the bottleneck is no longer the data production in the lab, but ist analysis.
  • the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows.
    And to analyze this data we need batch processing.
    One possible candidate to anaylze this large data efficiently is mapreduce.
    So there is a high potential for mapreduce in genomics.
  • To conclude my presentation, we saw that mapred is a simple but effective programming mode.
    It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science.
    We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager.
    Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
  • Cloudman was developed by enis, tamislav and davor from croatia.
    And cloudgene was developed by sebm, hansi, florian and me.
    Now i am at the end of my presentation so thank you very much for your attention.
  • ×