Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

Cloudflow – A Framework for MapReduce
Pipeline Development in Biomedical Research
Lukas Forer1,2, Enis Afgan3,4, Hansi Weißensteiner1,2, Davor Davidović3, Günther Specht2,
Florian Kronenberg1, Sebastian Schönherr1,2
1 Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
2 Institute of Computer Science, Research Group Databases and Information Systems, Innsbruck, Austria
3 Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
4 Department of Biology, Johns Hopkins University, Baltimore MD, USA
Page 2
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Introduction
• Cooperation between
– Division of Genetic Epidemiology, Innsbruck,
Austria
– Ruđer Bošković Institute (RBI), Zagreb, Croatia
• Project
• Aim: Developing a Bioinformatics Platform for the Cloud
Page 3
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Bioinformatics
• More and more data is produced
• Next-Generation Sequencing (NGS)
– Allows us to sequence the complete human
genome (3.2 billion positions)
– Decreasing Costs  Increasing Data
Production
Bottleneck is no longer the data production in the
laboratory, but the analysis!
Bottleneck is no longer the data production in the
laboratory, but the analysis!
Page 4
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Next Generation Sequencing
• NGS Data Characteristics
– Data size in terabyte scale
– Independent text data rows
– Batch processing required for analysis files
• MapReduce
– One possible candidate for batch processing
– scalable approach to manage and process large data
efficiently
“High potential for MapReduce in Genomics”“High potential for MapReduce in Genomics”
Page 5
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Challenges using MapReduce
• writing a MapReduce job can be a challenging task
– Hard to write
– Hard to maintain
– Hard to test
• high-level languages on top of Hadoop (e.g. PIG)
– allow writing MapReduce jobs in form of queries
– SeqPig or BioPig provide scripts for NGS data
– Not developed to combine different scripts to a pipelines
Prevents Bioinformatics from using MapReducePrevents Bioinformatics from using MapReduce
Page 6
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow
• a MapReduce pipeline framework
• Provides a rich and highly abstracted Java API
• Comparable with FlumeJava and Apache Crunch
• For Big Data in Genetics
– provides a several built-in operations for NGS
• Cloudflow pipelines are easy to integrate into
graphical execution engines like Cloudgene
Page 7
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Main Concept
• Main concept is to break complex data analysis
steps into basic operations
– Transform Operation
– Summarize Operation
• Operations are independent from Hadoop
MapReduce
Easy to write, reuse and test!Easy to write, reuse and test!
Page 8
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow: Operations
• Transform Operation
• Example
TransformTransform 0..n Records0..n Records1 Record1 Record
Split by WordSplit by Word1 Sentence1 Sentence 1..n Words1..n Words
Page 9
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow: Operations
• Summarize Operation
• Example
SummarizeSummarize
Grouped
Records
Grouped
Records 0..n Records0..n Records
CountCountWordsWords FrequenceFrequence
Page 10
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow: Pipelines
• Pipeline
• Example
TransformTransform SummarizeSummarizeTransformTransform TransformTransform
SentenceSentence CountCount
Split by
Word
Split by
Word
FrequenceFrequenceFilterFilter
Page 11
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow: Execution
• Execute the same pipeline
– On your local development machine (testdata)
– On a remote Hadoop Cluster (with big-data)
SentenceSentence CountCount
Split by
Word
Split by
Word
FrequenceFrequenceFilterFilter
Page 12
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudflow: Execution as MapReduce jobs
Page 13
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Supported Data Formats and Operations
Page 14
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Deploying Pipelines as a Service
Cloudgene
A Bioinformatics MapReduce
workflow framework
Cloudgene
A Bioinformatics MapReduce
workflow framework
CloudMan
A cloud manager for
delivering application
execution environments
CloudMan
A cloud manager for
delivering application
execution environments
Cloudflow
A Framework for MapReduce
Pipeline Development in
Biomedical Research
Cloudflow
A Framework for MapReduce
Pipeline Development in
Biomedical Research
Page 15
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Evaluation
Page 16
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Conclusion
• MapReduce
– simple but effective programming model
– well suited for the Cloud
– still limited to a small number of highly qualified domain experts
• Cloudflow
– a MapReduce pipeline framework
– Provides a rich and highly abstracted Java API
– Comparable with FlumeJava and Apache Crunch
– For Big Data in Genetics
– https://github.com/genepi/cloudflow
Page 17
28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Acknowledge
• CloudMan
– Enis Afgan
– Davor Davidović
– Tomislav Lipić
• Cloudgene and Cloudflow
– Lukas Forer
– Sebastian Schönherr
– Hansi Weißensteiner
– Florian Kronenberg
1 of 17

Recommended

Delivering Bioinformatics MapReduce Applications in the Cloud by
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudLukas Forer
555 views20 slides
Inspire hack 2017-linked-data by
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-dataRaul Palma
212 views18 slides
Comparing the Performance of OAI-PMH with ResourceSync by
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein
1.4K views35 slides
Sharing Big Data - Bob Jones by
Sharing Big Data - Bob JonesSharing Big Data - Bob Jones
Sharing Big Data - Bob JonesHelix Nebula The Science Cloud
112 views13 slides
HNSciCloud update @ the World LHC Computing Grid deployment board by
HNSciCloud update @ the World LHC Computing Grid deployment board  HNSciCloud update @ the World LHC Computing Grid deployment board
HNSciCloud update @ the World LHC Computing Grid deployment board Helix Nebula The Science Cloud
168 views23 slides
Data analytics and downscaling for climate research in a big data world by
Data analytics and downscaling for climate research in a big data worldData analytics and downscaling for climate research in a big data world
Data analytics and downscaling for climate research in a big data worldBigData_Europe
884 views22 slides

More Related Content

What's hot

Cybertools stork-2009-cybertools allhandmeeting-poster by
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterbalmanme
428 views1 slide
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex... by
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...ExtremeEarth
13 views13 slides
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data by
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth
19 views15 slides
Reranking based-recommender-system-with-deep-learning by
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningAhmed Saleh
338 views12 slides
Collections as Data National Forum (Elings) by
Collections as Data National Forum (Elings)Collections as Data National Forum (Elings)
Collections as Data National Forum (Elings)Mary Elings
85 views10 slides
From Semantic Grid To Knowledge Service by
From Semantic Grid To Knowledge ServiceFrom Semantic Grid To Knowledge Service
From Semantic Grid To Knowledge Serviceguest10cfd4d
317 views30 slides

What's hot(13)

Cybertools stork-2009-cybertools allhandmeeting-poster by balmanme
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-poster
balmanme428 views
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex... by ExtremeEarth
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
ExtremeEarth13 views
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data by ExtremeEarth
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth19 views
Reranking based-recommender-system-with-deep-learning by Ahmed Saleh
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learning
Ahmed Saleh338 views
Collections as Data National Forum (Elings) by Mary Elings
Collections as Data National Forum (Elings)Collections as Data National Forum (Elings)
Collections as Data National Forum (Elings)
Mary Elings85 views
From Semantic Grid To Knowledge Service by guest10cfd4d
From Semantic Grid To Knowledge ServiceFrom Semantic Grid To Knowledge Service
From Semantic Grid To Knowledge Service
guest10cfd4d317 views
NJ Wildlife Habitat Finder by Dan Ford
NJ Wildlife Habitat FinderNJ Wildlife Habitat Finder
NJ Wildlife Habitat Finder
Dan Ford57 views
Official resume titash_mandal_ by Titash Mandal
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_
Titash Mandal82 views
Databases by alban2012
DatabasesDatabases
Databases
alban2012117 views
Towards an Incremental Schema-level Index for Distributed Linked Open Data G... by Till Blume
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume185 views
Aggregating Linked Sensor Data by staschc
Aggregating Linked Sensor DataAggregating Linked Sensor Data
Aggregating Linked Sensor Data
staschc693 views
Intact danish workshop_20171001 by Dirk Pieper
Intact danish workshop_20171001Intact danish workshop_20171001
Intact danish workshop_20171001
Dirk Pieper69 views
Classification of Big Data Use Cases by different Facets by Geoffrey Fox
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
Geoffrey Fox2.9K views

Similar to Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central by
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
1.3K views22 slides
Data-intensive applications on cloud computing resources: Applications in lif... by
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
433 views45 slides
Bonazzi commons bd2 k ahm 2016 v2 by
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
738 views55 slides
Sgci esip-7-20-18 by
Sgci esip-7-20-18Sgci esip-7-20-18
Sgci esip-7-20-18Nancy Wilkins-Diehr
60 views30 slides
MawereC- Ubuntunet paper publication 2015 by
MawereC- Ubuntunet paper publication 2015MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015CEPHAS MAWERE
155 views14 slides
Cloudgene - A MapReduce based Workflow Management System by
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
938 views22 slides

Similar to Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research(20)

Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central by Paolo Missier
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Paolo Missier1.3K views
Data-intensive applications on cloud computing resources: Applications in lif... by Ola Spjuth
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
Ola Spjuth433 views
Bonazzi commons bd2 k ahm 2016 v2 by Vivien Bonazzi
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
Vivien Bonazzi738 views
MawereC- Ubuntunet paper publication 2015 by CEPHAS MAWERE
MawereC- Ubuntunet paper publication 2015MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015
CEPHAS MAWERE155 views
Cloudgene - A MapReduce based Workflow Management System by Lukas Forer
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management System
Lukas Forer938 views
NSF SI2 program discussion at 2014 SI2 PI meeting by Daniel S. Katz
NSF SI2 program discussion at 2014 SI2 PI meetingNSF SI2 program discussion at 2014 SI2 PI meeting
NSF SI2 program discussion at 2014 SI2 PI meeting
Daniel S. Katz1.4K views
Data-intensive bioinformatics on HPC and Cloud by Ola Spjuth
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth1.4K views
Working towards Sustainable Software for Science (an NSF and community view) by Daniel S. Katz
Working towards Sustainable Software for Science (an NSF and community view)Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)
Daniel S. Katz2K views
Cytoscape: Now and Future by Keiichiro Ono
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
Keiichiro Ono2.4K views
What's New in Cytoscape by Keiichiro Ono
What's New in CytoscapeWhat's New in Cytoscape
What's New in Cytoscape
Keiichiro Ono1.2K views
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro... by CINECAProject
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECAProject40 views
WorDS of Data Science in the Presence of Heterogenous Computing Architectures by Ilkay Altintas, Ph.D.
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o... by Ilkay Altintas, Ph.D.
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Semantic-web-based Decision Support System for Specific Degree Programs by bmake
A Semantic-web-based Decision Support System for Specific Degree ProgramsA Semantic-web-based Decision Support System for Specific Degree Programs
A Semantic-web-based Decision Support System for Specific Degree Programs
bmake265 views
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR) by Blue BRIDGE
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Blue BRIDGE433 views
Ogce Workflow Suite by smarru
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
smarru1.2K views
Team 05 linked data generation by plan4all
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
plan4all1.6K views

Recently uploaded

Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...ShapeBlue
88 views13 slides
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
84 views10 slides
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsShapeBlue
197 views13 slides
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
154 views62 slides
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
176 views20 slides
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITShapeBlue
166 views8 slides

Recently uploaded(20)

Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue88 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue84 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue197 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue154 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue166 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue93 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays53 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu365 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue117 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue85 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc160 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue144 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue88 views

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

  • 1. Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research Lukas Forer1,2, Enis Afgan3,4, Hansi Weißensteiner1,2, Davor Davidović3, Günther Specht2, Florian Kronenberg1, Sebastian Schönherr1,2 1 Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria 2 Institute of Computer Science, Research Group Databases and Information Systems, Innsbruck, Austria 3 Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia 4 Department of Biology, Johns Hopkins University, Baltimore MD, USA
  • 2. Page 2 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  • 3. Page 3 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis! Bottleneck is no longer the data production in the laboratory, but the analysis!
  • 4. Page 4 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”“High potential for MapReduce in Genomics”
  • 5. Page 5 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Challenges using MapReduce • writing a MapReduce job can be a challenging task – Hard to write – Hard to maintain – Hard to test • high-level languages on top of Hadoop (e.g. PIG) – allow writing MapReduce jobs in form of queries – SeqPig or BioPig provide scripts for NGS data – Not developed to combine different scripts to a pipelines Prevents Bioinformatics from using MapReducePrevents Bioinformatics from using MapReduce
  • 6. Page 6 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow • a MapReduce pipeline framework • Provides a rich and highly abstracted Java API • Comparable with FlumeJava and Apache Crunch • For Big Data in Genetics – provides a several built-in operations for NGS • Cloudflow pipelines are easy to integrate into graphical execution engines like Cloudgene
  • 7. Page 7 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Main Concept • Main concept is to break complex data analysis steps into basic operations – Transform Operation – Summarize Operation • Operations are independent from Hadoop MapReduce Easy to write, reuse and test!Easy to write, reuse and test!
  • 8. Page 8 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Transform Operation • Example TransformTransform 0..n Records0..n Records1 Record1 Record Split by WordSplit by Word1 Sentence1 Sentence 1..n Words1..n Words
  • 9. Page 9 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Operations • Summarize Operation • Example SummarizeSummarize Grouped Records Grouped Records 0..n Records0..n Records CountCountWordsWords FrequenceFrequence
  • 10. Page 10 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Pipelines • Pipeline • Example TransformTransform SummarizeSummarizeTransformTransform TransformTransform SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  • 11. Page 11 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution • Execute the same pipeline – On your local development machine (testdata) – On a remote Hadoop Cluster (with big-data) SentenceSentence CountCount Split by Word Split by Word FrequenceFrequenceFilterFilter
  • 12. Page 12 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudflow: Execution as MapReduce jobs
  • 13. Page 13 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Supported Data Formats and Operations
  • 14. Page 14 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Deploying Pipelines as a Service Cloudgene A Bioinformatics MapReduce workflow framework Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan A cloud manager for delivering application execution environments Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research
  • 15. Page 15 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Evaluation
  • 16. Page 16 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudflow – a MapReduce pipeline framework – Provides a rich and highly abstracted Java API – Comparable with FlumeJava and Apache Crunch – For Big Data in Genetics – https://github.com/genepi/cloudflow
  • 17. Page 17 28 May, 2015 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Davor Davidović – Tomislav Lipić • Cloudgene and Cloudflow – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

Editor's Notes

  1. The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb. The aim of the project is to develop a bioinformatics platform for the cloud feasibility study
  2. I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome. This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses. This has the consequence that more and more data will produce. So the bottleneck is no longer the data production in the lab, but ist analysis.
  3. the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows. And to analyze this data we need batch processing. One possible candidate to anaylze this large data efficiently is mapreduce. So there is a high potential for mapreduce in genomics.
  4. To conclude my presentation, we saw that mapred is a simple but effective programming mode. It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science. We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager. Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
  5. Cloudman was developed by enis, tamislav and davor from croatia. And cloudgene was developed by sebm, hansi, florian and me. Now i am at the end of my presentation so thank you very much for your attention.