Delivering Bioinformatics MapReduce Applications in the Cloud

Delivering Bioinformatics MapReduce
Applications in the Cloud
Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*,
Davor Davidović**, Florian Kronenberg* and Enis Afgan**
* Division of Genetic Epidemiology, Medical University of Innsbruck, Austria
** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
Page 2
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Introduction
• Cooperation between
– Division of Genetic Epidemiology, Innsbruck,
Austria
– Ruđer Bošković Institute (RBI), Zagreb, Croatia
• Project
• Aim: Developing a Bioinformatics Platform for the Cloud
Page 3
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Bioinformatics
• More and more data is produced
• Next-Generation Sequencing (NGS)
– Allows us to sequence the complete human
genome (3.2 billion positions)
– Decreasing Costs  Increasing Data
Production
Bottleneck is no longer the data production in the
laboratory, but the analysis!
Page 4
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Next Generation Sequencing
• NGS Data Characteristics
– Data size in terabyte scale
– Independent text data rows
– Batch processing required for analysis files
• MapReduce
– One possible candidate for batch processing
– scalable approach to manage and process large data
efficiently
“High potential for MapReduce in Genomics”
Page 5
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
MapReduce in Bioinformatics
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
Page 6
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 1: Complex Analysis Pipelines
• Bioinformatics MapReduce Applications
– available only on a per-tool basis
– cover one aspect of a larger data analysis
pipeline
– Hard to use for scientists without background
in Computer Science
Needed: System which enables building
MapReduce workflows
Page 7
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Overview
• Cloudgene
– MapReduce based Workflow System
– assists scientists in executing and monitoring
workflows graphically
– data management (import/export)
– workflow parameter track→ reproducibility
• Supports Apache’s Big-Data stack
S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform
for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
Page 8
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Reuse existing Apache Hadoop programs
• No adaptations in source code needed
• All meta data about tasks and workflows
are defined in one single file
– the manifest file
MapReduce
program
Manifest File
for Cloudgene
Page 9
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Manifest file defines the workflow
Page 10
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Interface
Used
Parameters
Download links
to result files
Page 11
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 2: Missing Infrastructure
• Cloudgene requires a functional compatible cluster
– Small/Medium sized research institutes can hardly afford
own clusters
• Possible Solution: Cloud Computing
– Rent computer hardware from different providers (e.g.
AWS, HP)
– Use resources and services on demand
Needed: System which enables delivering
MapReduce clusters in the cloud
Page 12
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan: Overview
• Enables launching/managing a analysis
platform on a cloud infrastructure via a web
browser
– Delivers a scalable cluster-in-the-cloud
• Preconfigure a number of applications
– Workflow System: Galaxy
– Batch scheduler: Sun Grid Engine (SGE)
– MapReduce using Hadoop and SGE
integration
Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics
2012;
Page 13
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Configure the Cluster
Page 14
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Manage the Cluster
Page 15
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Use the Cluster
Page 16
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan-as-a-Platform
• CloudMan implements a service dependency
management framework
• Enables creation of user-specific cloud platforms
Idea: Implementing Cloudgene as a CloudMan
service
Page 17
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene meets CloudMan
Cloudgene
A Bioinformatics
MapReduce workflow
framework
CloudMan
A cloud manager for
delivering application
execution environments
CloudMan platform
OpenNebulaOpenNebula
AWS OpenStack Eucalyptus
SGE Hadoop HTCondor ...
GalaxyCloudgene
MapReduce tools Traditional tools
Batch
Cluster jobs
Page 18
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
New Opportunities
• One single data analysis cloud platform
– MapReduce and SGE integrated
– Additionally, more big data computational
models can be supported in future
CloudMan CloudMan
Page 19
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Conclusion
• MapReduce
– simple but effective programming model
– well suited for the Cloud
– still limited to a small number of highly qualified domain experts
• Cloudgene
– as a possible MapReduce Workflow Manager
• CloudMan
– as a possible Cloud Manager
• Implementing Cloudgene as a CloudMan service
– Enables domain experts to incorporate and make use of
additional big data models
Page 20
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Acknowledge
• CloudMan
– Enis Afgan
– Tomislav Lipić
– Davor Davidović
• Cloudgene
– Lukas Forer
– Sebastian Schönherr
– Hansi Weißensteiner
– Florian Kronenberg
1 of 20

Recommended

C cerin piv2017_c by
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_cBertrand Tavitian
237 views40 slides
Modern Computing: Cloud, Distributed, & High Performance by
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
956 views61 slides
The convergence of HPC and BigData: What does it mean for HPC sysadmins? by
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?inside-BigData.com
1.7K views22 slides
Deep Hybrid DataCloud by
Deep Hybrid DataCloudDeep Hybrid DataCloud
Deep Hybrid DataCloudEOSC-hub project
105 views10 slides
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON by
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
311 views8 slides

More Related Content

What's hot

Hourglass: a Library for Incremental Processing on Hadoop by
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
13.6K views29 slides
Programming Modes and Performance of Raspberry-Pi Clusters by
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
270 views8 slides
GPU accelerated Large Scale Analytics by
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsSuleiman Shehu
605 views11 slides
Summary of the Deployment Scenarios and Functional Requirements by
Summary of the Deployment Scenarios and Functional RequirementsSummary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional RequirementsArchiver
614 views19 slides
DESY / XFEL Deployment Scenarios by
DESY / XFEL Deployment Scenarios  DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios Archiver
726 views13 slides
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS by
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSijccsa
826 views20 slides

What's hot(19)

Hourglass: a Library for Incremental Processing on Hadoop by Matthew Hayes
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
Matthew Hayes13.6K views
Programming Modes and Performance of Raspberry-Pi Clusters by AM Publications
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi Clusters
AM Publications270 views
GPU accelerated Large Scale Analytics by Suleiman Shehu
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
Suleiman Shehu605 views
Summary of the Deployment Scenarios and Functional Requirements by Archiver
Summary of the Deployment Scenarios and Functional RequirementsSummary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional Requirements
Archiver 614 views
DESY / XFEL Deployment Scenarios by Archiver
DESY / XFEL Deployment Scenarios  DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios
Archiver 726 views
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS by ijccsa
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
ijccsa826 views
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial) by Robert Grossman
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Robert Grossman1.1K views
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server by Ahsan Javed Awan
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Ahsan Javed Awan230 views
Gisruk2013 addy edit2 by Addy Pope
Gisruk2013 addy edit2Gisruk2013 addy edit2
Gisruk2013 addy edit2
Addy Pope1.3K views
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se... by Ahsan Javed Awan
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Ahsan Javed Awan293 views
A Survey on Data Mapping Strategy for data stored in the storage cloud 111 by NavNeet KuMar
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar96 views
Apache Big_Data Europe event: "Integrators at work! Real-life applications of... by BigData_Europe
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe676 views
rasdaman: from barebone Arrays to DataCubes by EUDAT
rasdaman: from barebone Arrays to DataCubesrasdaman: from barebone Arrays to DataCubes
rasdaman: from barebone Arrays to DataCubes
EUDAT290 views
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop by IRJET Journal
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET Journal24 views
Archiver omc cern_deployment_scenarios_technical_details by Archiver
Archiver omc cern_deployment_scenarios_technical_detailsArchiver omc cern_deployment_scenarios_technical_details
Archiver omc cern_deployment_scenarios_technical_details
Archiver 410 views
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work by IRJET Journal
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal32 views
IJSRED-V2I3P43 by IJSRED
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
IJSRED27 views
Journées ABES 2014 - Projet CIB - Uwe Rich by ABES
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich
ABES1.9K views
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions by csandit
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit111 views

Viewers also liked

الهوية الرقمية على مواقع التواصل الاجتماعي by
الهوية الرقمية على مواقع التواصل الاجتماعيالهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعيFatma Esa
1.2K views67 slides
Caravane Bio [Mohammed Benbouida, AMBS, Morocco] by
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]UNESCO Venice Office
974 views30 slides
Kallio Chipster Bosc2009 by
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009bosc
813 views16 slides
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) by
 استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)Prof. Tafida Ghanem
1.1K views22 slides
مهارات+1 by
مهارات+1مهارات+1
مهارات+1Mosab-Khayat
622 views15 slides
Supporting bioinformatics applications with hybrid multi-cloud services by
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah
948 views45 slides

Viewers also liked(20)

الهوية الرقمية على مواقع التواصل الاجتماعي by Fatma Esa
الهوية الرقمية على مواقع التواصل الاجتماعيالهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعي
Fatma Esa1.2K views
Kallio Chipster Bosc2009 by bosc
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
bosc813 views
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) by Prof. Tafida Ghanem
 استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
Prof. Tafida Ghanem1.1K views
Supporting bioinformatics applications with hybrid multi-cloud services by Ahmed Abdullah
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud services
Ahmed Abdullah948 views
Lt npsti process-and_forms_april_2011 by Mosab-Khayat
Lt npsti process-and_forms_april_2011Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011
Mosab-Khayat480 views
تسويق خدمات المعلومات by u083125
تسويق خدمات المعلوماتتسويق خدمات المعلومات
تسويق خدمات المعلومات
u0831253.8K views
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م by Prof. Sherif Shaheen
الثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012مالثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
From Sunset To Sunrise by guest11219b
From Sunset To SunriseFrom Sunset To Sunrise
From Sunset To Sunrise
guest11219b270 views
الثقافة التقنية والمواطنة الالكترونية by Nazzal Th. Alenezi
الثقافة التقنية والمواطنة الالكترونيةالثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونية
Nazzal Th. Alenezi1.8K views
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية by e-Marefa
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفيةدور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
e-Marefa283 views
ABT 609 PPT by Jane Awah
ABT 609 PPTABT 609 PPT
ABT 609 PPT
Jane Awah819 views

Similar to Delivering Bioinformatics MapReduce Applications in the Cloud

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese... by
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Lukas Forer
408 views17 slides
Cloudgene - A MapReduce based Workflow Management System by
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
938 views22 slides
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar... by
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...BigData_Europe
808 views40 slides
L Forer - Cloudgene: an execution platform for MapReduce programs in public a... by
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...Jan Aerts
882 views24 slides
BigDataEurope @BDVA Summit2016 2: Societal Pilots by
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigData_Europe
496 views43 slides
How to use NCI's national repository of big spatial data collections by
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
905 views48 slides

Similar to Delivering Bioinformatics MapReduce Applications in the Cloud(20)

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese... by Lukas Forer
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Lukas Forer408 views
Cloudgene - A MapReduce based Workflow Management System by Lukas Forer
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management System
Lukas Forer938 views
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar... by BigData_Europe
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
BigData_Europe808 views
L Forer - Cloudgene: an execution platform for MapReduce programs in public a... by Jan Aerts
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
Jan Aerts882 views
BigDataEurope @BDVA Summit2016 2: Societal Pilots by BigData_Europe
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigData_Europe496 views
How to use NCI's national repository of big spatial data collections by ARDC
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
ARDC905 views
Bonazzi commons bd2 k ahm 2016 v2 by Vivien Bonazzi
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
Vivien Bonazzi738 views
Serverless stream processing of Debezium data change events with Knative | De... by Red Hat Developers
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...
Red Hat Developers3.2K views
What's New in Cytoscape by Keiichiro Ono
What's New in CytoscapeWhat's New in Cytoscape
What's New in Cytoscape
Keiichiro Ono1.2K views
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster by IOSR Journals
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
IOSR Journals342 views
SC1 Workshop 2 General Introduction to BDE by BigData_Europe
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
BigData_Europe1.5K views
Microsoft's Hadoop Story by Michael Rys
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys4K views
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi... by Big Data Value Association
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Platform introduction & Summary by BigData_Europe
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & Summary
BigData_Europe456 views
Big data-hadoop-training-course-content-content by Training Institute
Big data-hadoop-training-course-content-contentBig data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-content
Vol 10 No 1 - February 2014 by ijcsbi
Vol 10 No 1 - February 2014Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014
ijcsbi69 views
BDE SC3.3 Workshop - BDE Platform: Technical overview by BigData_Europe
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
BigData_Europe276 views

Recently uploaded

AI: mind, matter, meaning, metaphors, being, becoming, life values by
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life valuesTwain Liu 刘秋艳
35 views16 slides
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...NUS-ISS
28 views70 slides
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...NUS-ISS
37 views54 slides
Future of Learning - Yap Aye Wee.pdf by
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
41 views11 slides
Top 10 Strategic Technologies in 2024: AI and Automation by
Top 10 Strategic Technologies in 2024: AI and AutomationTop 10 Strategic Technologies in 2024: AI and Automation
Top 10 Strategic Technologies in 2024: AI and AutomationAutomationEdge Technologies
14 views14 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
17 views1 slide

Recently uploaded(20)

AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS28 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS37 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS41 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada121 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk88 views
.conf Go 2023 - Data analysis as a routine by Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk93 views
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price15 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS34 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS41 views
Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291614 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software225 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS16 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex19 views

Delivering Bioinformatics MapReduce Applications in the Cloud

  • 1. Delivering Bioinformatics MapReduce Applications in the Cloud Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*, Davor Davidović**, Florian Kronenberg* and Enis Afgan** * Division of Genetic Epidemiology, Medical University of Innsbruck, Austria ** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
  • 2. Page 2 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  • 3. Page 3 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis!
  • 4. Page 4 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”
  • 5. Page 5 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr MapReduce in Bioinformatics Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  • 6. Page 6 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 1: Complex Analysis Pipelines • Bioinformatics MapReduce Applications – available only on a per-tool basis – cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science Needed: System which enables building MapReduce workflows
  • 7. Page 7 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Overview • Cloudgene – MapReduce based Workflow System – assists scientists in executing and monitoring workflows graphically – data management (import/export) – workflow parameter track→ reproducibility • Supports Apache’s Big-Data stack S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
  • 8. Page 8 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Reuse existing Apache Hadoop programs • No adaptations in source code needed • All meta data about tasks and workflows are defined in one single file – the manifest file MapReduce program Manifest File for Cloudgene
  • 9. Page 9 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Manifest file defines the workflow
  • 10. Page 10 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Interface Used Parameters Download links to result files
  • 11. Page 11 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 2: Missing Infrastructure • Cloudgene requires a functional compatible cluster – Small/Medium sized research institutes can hardly afford own clusters • Possible Solution: Cloud Computing – Rent computer hardware from different providers (e.g. AWS, HP) – Use resources and services on demand Needed: System which enables delivering MapReduce clusters in the cloud
  • 12. Page 12 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan: Overview • Enables launching/managing a analysis platform on a cloud infrastructure via a web browser – Delivers a scalable cluster-in-the-cloud • Preconfigure a number of applications – Workflow System: Galaxy – Batch scheduler: Sun Grid Engine (SGE) – MapReduce using Hadoop and SGE integration Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics 2012;
  • 13. Page 13 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Configure the Cluster
  • 14. Page 14 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Manage the Cluster
  • 15. Page 15 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Use the Cluster
  • 16. Page 16 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan-as-a-Platform • CloudMan implements a service dependency management framework • Enables creation of user-specific cloud platforms Idea: Implementing Cloudgene as a CloudMan service
  • 17. Page 17 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene meets CloudMan Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan platform OpenNebulaOpenNebula AWS OpenStack Eucalyptus SGE Hadoop HTCondor ... GalaxyCloudgene MapReduce tools Traditional tools Batch Cluster jobs
  • 18. Page 18 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr New Opportunities • One single data analysis cloud platform – MapReduce and SGE integrated – Additionally, more big data computational models can be supported in future CloudMan CloudMan
  • 19. Page 19 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudgene – as a possible MapReduce Workflow Manager • CloudMan – as a possible Cloud Manager • Implementing Cloudgene as a CloudMan service – Enables domain experts to incorporate and make use of additional big data models
  • 20. Page 20 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Tomislav Lipić – Davor Davidović • Cloudgene – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

Editor's Notes

  1. Welcome to my presentation. I am happy to be here and to have the opportunity to talk about our approach which enables to deliver bioinf mapred apps in the cloud I am happy to be here and to have the opportunity to talk about our cooperation project, which is called „Delivering Bioinformatics MapReduce Applications in the Cloud”
  2. The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb. The aim of the project is to develop a bioinformatics platform for the cloud feasibility study
  3. I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome. This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses. This has the consequence that more and more data will produce. So the bottleneck is no longer the data production in the lab, but ist analysis.
  4. the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows. And to analyze this data we need batch processing. One possible candidate to anaylze this large data efficiently is mapreduce. So there is a high potential for mapreduce in genomics.
  5. This table shows that there exist already several Mapreduce apps in the field of mapreduce. For example for mapping shot reads to reference ot for rna analysis.
  6. But the problem of such pproaches is that thei are available only on a per-tool basis To analyze data we need large pipeline and the avaialble tools cover only one apsect. Moreover, for biologists without background in cs it is very hard to use them Most ov popular workflow systems such as galaxy enable this abstraction only for tradional tools and not for mappreduce. So a system which enable building such mapreduce workflows is missing.
  7. To overcome this issue we developed cloudgene, a mpreduce based workflow system Which assts scients in executing and monitoring worklfows Moreover, we provide a simple data management in order to import and exporrt datasets. Finally, all parameters are tracked which improves the reproducibility of experiments. At the moment we support the whole apache big data stack, that menans mapred, pig and hdfs
  8. To build workflow existing hadoop programs can be reused and no adaption in source code is needed All meta data are defined in a single manifest file For example, here we have an existing tool called cloudburst, and here we have the correspong manifest file which enables to integrate it into cloudgene
  9. Based on this manifest file, we create automatically a userinterface which can be used to submit the job with different parameters and datasets.
  10. When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  11. A workflow manager like cloudgene requires a compatible cluster Especially for smale research institutes it can be hard to afford and mantain their own cluster So possible solution is cloud computing which enables to rent compoter hardware from different providers for example amazon. So they can use the rented resources on demand. But the problem here is a missing system which enables devlivering mapreduce clusters in the cloud
  12. To overcome this issues, our cooperation partners from croatia developed cloudman which solves this problem It enables to launch and manage an analysis paltofrm on a cloud infratrsture. So through the browser a scalable cluster-in-the-cloud can be delivered. Moreover, cloudman preconfigures this new cluster with a lot of applications For example galaxy, a batach scheduler, mapreduce or HTCondor.
  13. So in a first step, the user configures the cluster through a simple web wizard. For example the type of the cluster can be selected or the size of the data volume
  14. In a second step, the users can manage it cluster. That means she or he can add new nodes, remove nodes or finally terminates the clusters. Moreover, cloudman provides a very nice feature which enables an automatically scaling.
  15. After that, cloudman installs the required application on the the new cluster. For example, in this case galaxy was automatically installed and can be used by the user to execute non-hadoop workflows.
  16. Cloudman implements a service dependency management framework which spepates different modules. That means, the infrastruce layer is separted by the application execution environement and from the applications and the datasets. This features enable to switch this modules with alternative service and implementation in order to create a user pecific cloud platform. So we can implement cloudgene as a cloudman service in order to provide a mpareduce bioinformatics platform in the cloud.
  17. So the overall picture is to use cloudman as the cloudmanager to create a cluster with the need envirnoment. An then we use cloudgene on top of this cluster to execute and manage mapreduce workflows. This results in a new platform which provides several tools for bioinformatic usecases. So we have the user which uses cloudgene or galaxy on top of the clusters create by cloudman. Moreover, through a cloudabstraction this platform can be instantiated on different coud providers.
  18. So we have one single data analysis platform which integrates MapReduce and SGE. At the moment cloudgene uses only mapreduce. But hadoop 2 introduced a new layer bettwen mapreduce and hadoop. This layer is called yarn and is jobmanager which enables to integrate other parallelizing models like sparkl or giraph. This enables as to use other big data computational modesl Moreover, the platform integratesalready several bioformatics workflows.
  19. To conclude my presentation, we saw that mapred is a simple but effective programming mode. It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science. We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager. Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
  20. Cloudman was developed by enis, tamislav and davor from croatia. And cloudgene was developed by sebm, hansi, florian and me. Now i am at the end of my presentation so thank you very much for your attention.