NGBT_poster_v0.4

Improved Needleman-Wunsch algorithm for NGS
alignment in Hadoop Data Clusters
Vineetha V and Achuthsankar S. Nair
Department of Computational Biology & Bioinformatics, University of Kerala,
Thiruvananthapuram, Kerala
Vineetha V
Research Scholar, Dept. of Computational Biology and Bioinformatics, University of Kerala
Technology Architect, Infosys Ltd
Email: vineevishnu@gmail.com
Phone: +919446175215
Contact References
• Comparing two or more sequences to locate a
series of identical characters or character
patterns.
• Also considers spaces or gaps and mismatches
(corresponds to mutations).
Types:
• Global Sequence Alignment
- Finds the best alignment
across the entire sequences.
• Local Sequence Alignment
- Finds regions of high
similarity in parts of the
• Multiple Sequence Alignment (aligning more
than 2 sequences at a time), is one of the most
useful tool and helps almost in every application
of Bioinformatics.
-Phylogenetic Analysis
-Structure Prediction
-For Sequence Similarity
Sequence Alignment
• Input Sequence Files are loaded in to HDFS.
• Sequence File name is taken as the key and the
content as value for the Mapper function.
• Performs pairwise alignment of all possible
combinations.
• Query Sequence File name along with Target File
name is taken as the key and score as the value for the
Reducer function.
• Reducer function combines the result and provides
the final result.
• Large number of NGS short reads can be aligned
without second level of parallelization at pairwise
alignment.
Data Challenges
A parallelized implementation of Needleman-Wunsch
algorithm using Hadoop framework
Needleman Wunsch Algorithm
• Create M*N matrix (M & N - Input sequence size)
• Fill up the matrix based on character similarity
• Trace back to find optimum alignment
Time & Space Complexity –
Pairwise Alignment - M*N
Multiple Sequence Alignment – MN
• Form possible combinations of input sequence pairs
• Perform pairwise alignment of each pair parallel.
• Pairwise alignment can be further divided into
multiple chunks and align parallel where no
dependencies exist. Number of chunks to be same as
number of computing nodes available.
• Merge the result of alignment to get final score and
alignment.
• Hadoop Framework to handle the large amount of
data. Deployed on low cost commodity hardware,
Moves Computation to Data.
• MapReduce programming to split the input for
processing and combine the result. (Parallel
Execution)
• HDFS for storing multiple chunks of data in multiple
copies (Fault tolerant)
Proposed Solution
For N input sequences of size M:
Complexity of Sequential implementation: MN
Both time and space increases exponentially as the
input size increases.
Hadoop based parallel implementation:
The possible pairwise combination of N input
sequences: NC2 = (n(n-1)/2)
Complexity of pairwise alignment when sequences
can be divided into ‘b’ blocks: M2/b
MSA Complexity : (M2/b) * (n(n-1)/2)
(Time taken is reduced to a linear level).
With parallelization at input sequence file level
alone: M2 * (n(n-1)/2)
Where b is equal to the number of nodes available
for processing.
As the number of nodes increases the time required
for processing reduces.
Computational Space requirements remain the
same.
Trade off between time and space exists.
• Hadoop Framework takes care of the
coordination between master and slave nodes.
• HDFS along with MapReduce eases the parallel
execution of tasks.
• Data & Compute parallelism
Analysis
• Highly Scalable solution with additional
commodity hardware.
• Suitable for NGS data due to ability to handle
massive data load.
• Computational Space is still a point of concern
which needs to be addressed by improving the
underlying algorithm.
Conclusion & Future Work
Enhanced sequencing technologies (NGS) produce
sequence data on an unparalleled scale and there is
a need to scale the alignment solutions to be able
to handle huge volume of input data.
Eg; Illumina HiSeqX™ Ten, generate up to 6 billion
sequence reads per run.
MSA is an NP Complete problem – Computational
Steps increase exponentially with size of the
problem.
• Approximation Algorithms – Find approximate
solution reasonably fast. Trade off between
quality and performance. Eg; Progressive, Branch
& Bound
• Parallel Processing – Parallelize the execution to
reduce processing time. User defined
coordination among nodes. Eg; MPI
• Big Data Frameworks – Frameworks designed for
handling huge data volume. Inherent capability
to take care of multi node coordination, Fault
tolerance. Eg; Hadoop, SPARK
Methodology
Result
• 3 node Virtual cluster is used for implementation
• The classic Dynamic Algorithm at the core
ensures accuracy
• Parallel execution in Hadoop framework reduced
the time taken for alignment .
• For small input size, sequential implementation
gives better performance but as the number of
size increases we see the sequential
implementation time increases exponentially
and Hadoop based parallel implementation
provides better performance.
Hadoop - Overview
Challenges
Solution Options
Data Deluge
Storage
Storing the
massive amount
of data being
generated
Processing
Analyzing this huge
data is highly time
& space consuming
Reporting
Reporting result
of analysis need
better
visualization tools
Design
1. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid
sequence of two proteins. Journal of molecular biology, 48(3):443–453, March 1970. ISSN 0022-2836
2. Sudha Sadasivam G Baktavatchalam G. A novel approach to multiple sequence alignment using Hadoop data grids. Int
J Bioinform Res Appl. 2010; 6(5):472-83
3. Sara A.Shehab, Arabi Keshk, Hany Mahgoub Fast Dynamic Algorithm for Sequence Alignment based on Bioinformatics.
International Journal of Computer Applications (0975 – 8887) Volume 37– No.7, January 2012
4. Angana Chakraborty & Sanghamitra Bandyopadhyay, Scientific Reports 3, Article number: 1746, FOGSAA: Fast Optimal
Global Sequence Alignment Algorithm, dx.doi.org/10.1038/srep01746
5. http://hadoop.apache.org/
6. http://www.ibm.com/developerworks/library/j-seqalign/

NGBT_poster_v0.4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NGBT_poster_v0.4

Similar to NGBT_poster_v0.4 (20)

NGBT_poster_v0.4