SlideShare a Scribd company logo
1 of 17
Download to read offline
BigBWA: approaching the Burrows–Wheeler
aligner to Big Data technologies
Dongseo University
Division of Computer & Information Engineering
Machine Learning Research Lab
Presented by:
Ahmed A. Absi
Bioinformatics Advance Access published September 5, 2015
Outline
• Introduction
• Motivation
• Proposed work
• Performance Results
• Conclusion
• My opinion
• Current Progress
 Evolving scientific instruments and the rapid sophistication of
computing systems have resulted in large-scale scientific
simulations and data analysis workflows.
 As more and more scientific data is generated, our ability to
effectively manage and process such data also needs to evolve.
 Genomics has become heavily dependent on the use of
sequence alignment tools which is computationally intensive.
Introduction
Introduction
Retrieved on 22nd Nov, 2015 from http://epilepsygenetics.net/2014/06/27/when-will-we-have-the-1000-epilepsy-genome/
• Widely used similarity search tool
• Heuristic approach method seed-and extend
• Uses “look-up” tables to shorten search time
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment tool
Burrows–Wheeler Aligner (BWA)
 Use Burrows-Wheeler Transform to “index” the human genome and allow
memory-efficient and fast string matching between sequence read and
reference genome.
 BWA: Short-read algorithm, alter the read sequence such that it matches
the reference exactly.
 BWA-SW: Long-read algorithm, sample reference subsequences and
perform Smith-Waterman alignment between the subsequences and the
read.
 BWA-MEM: - Similar features to BWA-SW
- Long-read alignment
- Seed and extend with SW
- Finds larger gaps
- Faster! Generally supersedes BWA-SW
Burrows–Wheeler Aligner (BWA) S/W Package
Motivation
 The amount of sequence data is growing rapidly. Such rapid
growth of sequence data will create obstacle for next-generation
sequence processing.
 Sequence alignment is a very time-consuming process. This
problem becomes even more noticeable as millions and billions
of reads need to be aligned.
 Therefore, NGS professionals demand scalable solutions to
boost the performance of the aligners in order to obtain the
results in reasonable time.
Proposed Approach: BigBWA
 BigBWA, a new tool that takes advantage of Hadoop as Big Data
technology to increase the performance of BWA. The main advantages of
our tool are the following:
 The alignment process is performed in parallel which reduces the
execution times
 BigBWA is fault tolerant, exploiting the fault tolerance capabilities of
the underlying Big Data technology on which it is based.
 No modifications to BWA are required to use BigBWA. As a
consequence, any release of BWA (future or legacy) will be
compatible with BigBWA.
Proposed Approach: BigBWA
 BigBWA divides the computation into Map and Reduce phases.
 In the Map phase, BigBWA splits the reads into subsets, mapping
each subset to a mapper process. Each mapper is responsible for
applying the considered BWA algorithm using as input the reads
assigned by BigBWA.
 In case any of the mappers fails, BigBWA would automatically launch
another identical mapper process to replace the faulty one.
 In the reducer phase those files are merged into one unique solution.
 SEAL (Pireddu et al., 2011) : uses Pydoop, a Python implementation of the
MapReduce programming model that runs on the top of Hadoop. It allows
users to write their programs in Python, calling BWA methods.
 pBWA (Peters et al., 2012) : pBWA uses a standard parallel programming
paradigm to parallelize BWA. pBWA lacks fault tolerant mechanisms.
 The more important differences between these tools and BigBWA are:
 SEAL and pBWA only work with a particular modified version of BWA, whereas BigBWA
works directly with the original BWA implementation keeping the compatibility with future
and legacy BWA versions.
 both SEAL and pBWA are based on BWA version, which does not include the new BWA-
MEM algorithm. Therefore, to the best of our knowledge, BigBWA is the first tool to handle
the parallelization of the BWA-MEM algorithm using Big Data technologies.
BigBWA Similar Approaches
 Experimental Configuration
 Configuration
 Setup
 5 Nodes: 16 Amazon AWS cluster, Intel Xeon CPUs at 2.5 GHz
 488 GB RAM
 r3.4xlarge instance type
 Hadoop version 2.6.0.
 1000 Genomes Project Datasets: 3.9, 13.4, and 54.7 GB.
Evaluation Performance
Evaluation Performance: Datasets
 Experimental Configuration
Evaluation Results
 Comparison of the performance for the BWA algorithm
Evaluation Results
 Comparison of the performance for the BWA-MEM algorithm
Conclusion
 This paper introduce up-to-date long read sequence
alignment algorithms in bioinformatics.
 BigBWA is a new tool that uses the Big Data technology
Hadoop to boost the performance of the Burrows–Wheeler
aligner (BWA).
 Important reductions in the execution times were observed
when using this tool. In addition, BigBWA is fault tolerant
and it does not require any modification of the original BWA
source code.
My opinion
Q & A
Thank You!

More Related Content

What's hot

A Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingA Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingMohd Hairey
 
Improving resource utilisation in the cloud environment using multivariate pr...
Improving resource utilisation in the cloud environment using multivariate pr...Improving resource utilisation in the cloud environment using multivariate pr...
Improving resource utilisation in the cloud environment using multivariate pr...Shrabanee Swagatika
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingRamandeep Kaur
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Qutub-ud- Din
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentSwapnil Shahade
 
Distributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queriesDistributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queriesieeepondy
 
Optimization of energy consumption in cloud computing datacenters
Optimization of energy consumption in cloud computing datacenters Optimization of energy consumption in cloud computing datacenters
Optimization of energy consumption in cloud computing datacenters IJECEIAES
 
task scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmtask scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmSwathi Rampur
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computingijujournal
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Databricks
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingDIGVIJAY SHINDE
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
Fast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisFast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisIRJET Journal
 
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...IJERA Editor
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentAlexander Decker
 
Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Ramandeep Kaur
 

What's hot (20)

A Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingA Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud Computing
 
Improving resource utilisation in the cloud environment using multivariate pr...
Improving resource utilisation in the cloud environment using multivariate pr...Improving resource utilisation in the cloud environment using multivariate pr...
Improving resource utilisation in the cloud environment using multivariate pr...
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud Computing
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
 
Distributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queriesDistributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queries
 
Shubhankar pawade resume
Shubhankar pawade resumeShubhankar pawade resume
Shubhankar pawade resume
 
Optimization of energy consumption in cloud computing datacenters
Optimization of energy consumption in cloud computing datacenters Optimization of energy consumption in cloud computing datacenters
Optimization of energy consumption in cloud computing datacenters
 
task scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmtask scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithm
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Fast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisFast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data Analysis
 
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Scheduling in cloud
Scheduling in cloudScheduling in cloud
Scheduling in cloud
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
 
Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.
 

Similar to Ahmed Absi slides bigbwa

Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
Analyzing & Visualizing Cloud Data With Power BI
Analyzing & Visualizing Cloud Data With Power BIAnalyzing & Visualizing Cloud Data With Power BI
Analyzing & Visualizing Cloud Data With Power BIRichard Harbridge
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Gargee Hiray
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsSuleiman Shehu
 
DOC Power-Bi-Guidance.pdf
DOC Power-Bi-Guidance.pdfDOC Power-Bi-Guidance.pdf
DOC Power-Bi-Guidance.pdfssusere8fdd1
 
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...IEEEBEBTECHSTUDENTSPROJECTS
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009HawaDia
 

Similar to Ahmed Absi slides bigbwa (20)

Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
Analyzing & Visualizing Cloud Data With Power BI
Analyzing & Visualizing Cloud Data With Power BIAnalyzing & Visualizing Cloud Data With Power BI
Analyzing & Visualizing Cloud Data With Power BI
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...Implementing load balancing algorithm in middleware system of volunteer cloud...
Implementing load balancing algorithm in middleware system of volunteer cloud...
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
 
DOC Power-Bi-Guidance.pdf
DOC Power-Bi-Guidance.pdfDOC Power-Bi-Guidance.pdf
DOC Power-Bi-Guidance.pdf
 
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
 
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
 
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
2014 IEEE JAVA SERVICE COMPUTING PROJECT Decentralized enactment of bpel proc...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009
 

Recently uploaded

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Ahmed Absi slides bigbwa

  • 1. BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies Dongseo University Division of Computer & Information Engineering Machine Learning Research Lab Presented by: Ahmed A. Absi Bioinformatics Advance Access published September 5, 2015
  • 2. Outline • Introduction • Motivation • Proposed work • Performance Results • Conclusion • My opinion • Current Progress
  • 3.  Evolving scientific instruments and the rapid sophistication of computing systems have resulted in large-scale scientific simulations and data analysis workflows.  As more and more scientific data is generated, our ability to effectively manage and process such data also needs to evolve.  Genomics has become heavily dependent on the use of sequence alignment tools which is computationally intensive. Introduction
  • 4. Introduction Retrieved on 22nd Nov, 2015 from http://epilepsygenetics.net/2014/06/27/when-will-we-have-the-1000-epilepsy-genome/
  • 5. • Widely used similarity search tool • Heuristic approach method seed-and extend • Uses “look-up” tables to shorten search time • Performs both Global and Local alignment • Fastest and most frequently used sequence alignment tool Burrows–Wheeler Aligner (BWA)
  • 6.  Use Burrows-Wheeler Transform to “index” the human genome and allow memory-efficient and fast string matching between sequence read and reference genome.  BWA: Short-read algorithm, alter the read sequence such that it matches the reference exactly.  BWA-SW: Long-read algorithm, sample reference subsequences and perform Smith-Waterman alignment between the subsequences and the read.  BWA-MEM: - Similar features to BWA-SW - Long-read alignment - Seed and extend with SW - Finds larger gaps - Faster! Generally supersedes BWA-SW Burrows–Wheeler Aligner (BWA) S/W Package
  • 7. Motivation  The amount of sequence data is growing rapidly. Such rapid growth of sequence data will create obstacle for next-generation sequence processing.  Sequence alignment is a very time-consuming process. This problem becomes even more noticeable as millions and billions of reads need to be aligned.  Therefore, NGS professionals demand scalable solutions to boost the performance of the aligners in order to obtain the results in reasonable time.
  • 8. Proposed Approach: BigBWA  BigBWA, a new tool that takes advantage of Hadoop as Big Data technology to increase the performance of BWA. The main advantages of our tool are the following:  The alignment process is performed in parallel which reduces the execution times  BigBWA is fault tolerant, exploiting the fault tolerance capabilities of the underlying Big Data technology on which it is based.  No modifications to BWA are required to use BigBWA. As a consequence, any release of BWA (future or legacy) will be compatible with BigBWA.
  • 9. Proposed Approach: BigBWA  BigBWA divides the computation into Map and Reduce phases.  In the Map phase, BigBWA splits the reads into subsets, mapping each subset to a mapper process. Each mapper is responsible for applying the considered BWA algorithm using as input the reads assigned by BigBWA.  In case any of the mappers fails, BigBWA would automatically launch another identical mapper process to replace the faulty one.  In the reducer phase those files are merged into one unique solution.
  • 10.  SEAL (Pireddu et al., 2011) : uses Pydoop, a Python implementation of the MapReduce programming model that runs on the top of Hadoop. It allows users to write their programs in Python, calling BWA methods.  pBWA (Peters et al., 2012) : pBWA uses a standard parallel programming paradigm to parallelize BWA. pBWA lacks fault tolerant mechanisms.  The more important differences between these tools and BigBWA are:  SEAL and pBWA only work with a particular modified version of BWA, whereas BigBWA works directly with the original BWA implementation keeping the compatibility with future and legacy BWA versions.  both SEAL and pBWA are based on BWA version, which does not include the new BWA- MEM algorithm. Therefore, to the best of our knowledge, BigBWA is the first tool to handle the parallelization of the BWA-MEM algorithm using Big Data technologies. BigBWA Similar Approaches
  • 11.  Experimental Configuration  Configuration  Setup  5 Nodes: 16 Amazon AWS cluster, Intel Xeon CPUs at 2.5 GHz  488 GB RAM  r3.4xlarge instance type  Hadoop version 2.6.0.  1000 Genomes Project Datasets: 3.9, 13.4, and 54.7 GB. Evaluation Performance
  • 12. Evaluation Performance: Datasets  Experimental Configuration
  • 13. Evaluation Results  Comparison of the performance for the BWA algorithm
  • 14. Evaluation Results  Comparison of the performance for the BWA-MEM algorithm
  • 15. Conclusion  This paper introduce up-to-date long read sequence alignment algorithms in bioinformatics.  BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows–Wheeler aligner (BWA).  Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code.
  • 17. Q & A Thank You!