This document discusses BigBWA, a tool that uses Hadoop's MapReduce framework to parallelize the Burrows-Wheeler Aligner (BWA) and improve its performance for large genomic datasets. BigBWA divides the read alignment process into map and reduce phases to distribute the work across multiple nodes. It provides fault tolerance and is compatible with different versions of BWA without requiring modifications. Evaluation on genome datasets ranging from 3.9GB to 54.7GB showed significant reductions in execution time compared to the serial BWA implementation. BigBWA represents the first approach to parallelize BWA-MEM, an important long read alignment algorithm, using big data technologies.
1. BigBWA: approaching the Burrows–Wheeler
aligner to Big Data technologies
Dongseo University
Division of Computer & Information Engineering
Machine Learning Research Lab
Presented by:
Ahmed A. Absi
Bioinformatics Advance Access published September 5, 2015
3. Evolving scientific instruments and the rapid sophistication of
computing systems have resulted in large-scale scientific
simulations and data analysis workflows.
As more and more scientific data is generated, our ability to
effectively manage and process such data also needs to evolve.
Genomics has become heavily dependent on the use of
sequence alignment tools which is computationally intensive.
Introduction
4. Introduction
Retrieved on 22nd Nov, 2015 from http://epilepsygenetics.net/2014/06/27/when-will-we-have-the-1000-epilepsy-genome/
5. • Widely used similarity search tool
• Heuristic approach method seed-and extend
• Uses “look-up” tables to shorten search time
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment tool
Burrows–Wheeler Aligner (BWA)
6. Use Burrows-Wheeler Transform to “index” the human genome and allow
memory-efficient and fast string matching between sequence read and
reference genome.
BWA: Short-read algorithm, alter the read sequence such that it matches
the reference exactly.
BWA-SW: Long-read algorithm, sample reference subsequences and
perform Smith-Waterman alignment between the subsequences and the
read.
BWA-MEM: - Similar features to BWA-SW
- Long-read alignment
- Seed and extend with SW
- Finds larger gaps
- Faster! Generally supersedes BWA-SW
Burrows–Wheeler Aligner (BWA) S/W Package
7. Motivation
The amount of sequence data is growing rapidly. Such rapid
growth of sequence data will create obstacle for next-generation
sequence processing.
Sequence alignment is a very time-consuming process. This
problem becomes even more noticeable as millions and billions
of reads need to be aligned.
Therefore, NGS professionals demand scalable solutions to
boost the performance of the aligners in order to obtain the
results in reasonable time.
8. Proposed Approach: BigBWA
BigBWA, a new tool that takes advantage of Hadoop as Big Data
technology to increase the performance of BWA. The main advantages of
our tool are the following:
The alignment process is performed in parallel which reduces the
execution times
BigBWA is fault tolerant, exploiting the fault tolerance capabilities of
the underlying Big Data technology on which it is based.
No modifications to BWA are required to use BigBWA. As a
consequence, any release of BWA (future or legacy) will be
compatible with BigBWA.
9. Proposed Approach: BigBWA
BigBWA divides the computation into Map and Reduce phases.
In the Map phase, BigBWA splits the reads into subsets, mapping
each subset to a mapper process. Each mapper is responsible for
applying the considered BWA algorithm using as input the reads
assigned by BigBWA.
In case any of the mappers fails, BigBWA would automatically launch
another identical mapper process to replace the faulty one.
In the reducer phase those files are merged into one unique solution.
10. SEAL (Pireddu et al., 2011) : uses Pydoop, a Python implementation of the
MapReduce programming model that runs on the top of Hadoop. It allows
users to write their programs in Python, calling BWA methods.
pBWA (Peters et al., 2012) : pBWA uses a standard parallel programming
paradigm to parallelize BWA. pBWA lacks fault tolerant mechanisms.
The more important differences between these tools and BigBWA are:
SEAL and pBWA only work with a particular modified version of BWA, whereas BigBWA
works directly with the original BWA implementation keeping the compatibility with future
and legacy BWA versions.
both SEAL and pBWA are based on BWA version, which does not include the new BWA-
MEM algorithm. Therefore, to the best of our knowledge, BigBWA is the first tool to handle
the parallelization of the BWA-MEM algorithm using Big Data technologies.
BigBWA Similar Approaches
15. Conclusion
This paper introduce up-to-date long read sequence
alignment algorithms in bioinformatics.
BigBWA is a new tool that uses the Big Data technology
Hadoop to boost the performance of the Burrows–Wheeler
aligner (BWA).
Important reductions in the execution times were observed
when using this tool. In addition, BigBWA is fault tolerant
and it does not require any modification of the original BWA
source code.