Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

280 views

Published on

Cloudflow is a MapReduce pipeline framework, which is based on a similar concept as JavaFlume or Apache Crunch. In contrast to these existing approaches, Cloudflow was developed to simplify the pipeline creation in biomedical research, especially in the field of Genetics. For that purpose Cloudflow supports a variety of NGS data formats and contains a rich collection of built-in operations for analyzing such kind of datasets (e.g. quality checks, mapping reads or variation calling).

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

  1. 1. Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research Lukas Forer1,2, Enis Afgan3,4, Hansi Weißensteiner1,2, Davor Davidović3, Günther Specht2, Florian Kronenberg1, Sebastian Schönherr1,2 1 Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria 2 Institute of Computer Science, Research Group Databases and Information Systems, Innsbruck, Austria 3 Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia 4 Department of Biology, Johns Hopkins University, Baltimore MD, USA
  2. 2. Page 2 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  3. 3. Page 3 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis! Bottleneck is no longer the data production in the laboratory, but the analysis!
  4. 4. Page 4 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”“High potential for MapReduce in Genomics”
  5. 5. Page 5 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Challenges using MapReduce • writing a MapReduce job can be a challenging task – Hard to write – Hard to maintain – Hard to test • high-level languages on top of Hadoop (e.g. PIG) – allow writing MapReduce jobs in form of queries – SeqPig or BioPig provide scripts for NGS data – Not developed to combine different scripts to a pipelines Prevents Bioinformatics from using MapReducePrevents Bioinformatics from using MapReduce
  6. 6. Page 6 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow • a MapReduce pipeline framework • Provides a rich and highly abstracted Java API • Comparable with FlumeJava and Apache Crunch • For Big Data in Genetics – provides a several built-in operations for NGS • Cloudflow pipelines are easy to integrate into graphical execution engines like Cloudgene
  7. 7. Page 7 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Main Concept • Main concept is to break complex data analysis steps into basic operations – Transform Operation – Summarize Operation • Operations are independent from Hadoop MapReduce Easy to write, reuse and test!Easy to write, reuse and test!
  8. 8. Page 8 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Transform Operation • Example TransformTransform 0..n Records0..n Records1 Record1 Record Split by WordSplit by Word1 Sentence1 Sentence 1..n Words1..n Words
  9. 9. Page 9 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Summarize Operation • Example SummarizeSummarize Grouped Records Grouped Records 0..n Records0..n Records CountCountWordsWords FrequenceFrequence
  10. 10. Page 10 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Pipelines • Pipeline • Example TransformTransform SummarizeSummarizeTransformTransform TransformTransform SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  11. 11. Page 11 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution • Execute the same pipeline – On your local development machine (testdata) – On a remote Hadoop Cluster (with big-data) SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  12. 12. Page 12 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution as MapReduce jobs
  13. 13. Page 13 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Supported Data Formats and Operations
  14. 14. Page 14 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Deploying Pipelines as a Service Cloudgene A Bioinformatics MapReduce workflow framework Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan A cloud manager for delivering application execution environments Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research
  15. 15. Page 15 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Evaluation
  16. 16. Page 16 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudflow – a MapReduce pipeline framework – Provides a rich and highly abstracted Java API – Comparable with FlumeJava and Apache Crunch – For Big Data in Genetics – https://github.com/genepi/cloudflow
  17. 17. Page 17 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Davor Davidović – Tomislav Lipić • Cloudgene and Cloudflow – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

×