SlideShare a Scribd company logo
Improved Needleman-Wunsch algorithm for NGS
alignment in Hadoop Data Clusters
Vineetha V and Achuthsankar S. Nair
Department of Computational Biology & Bioinformatics, University of Kerala,
Thiruvananthapuram, Kerala
Vineetha V
Research Scholar, Dept. of Computational Biology and Bioinformatics, University of Kerala
Technology Architect, Infosys Ltd
Email: vineevishnu@gmail.com
Phone: +919446175215
Contact References
• Comparing two or more sequences to locate a
series of identical characters or character
patterns.
• Also considers spaces or gaps and mismatches
(corresponds to mutations).
Types:
• Global Sequence Alignment
- Finds the best alignment
across the entire sequences.
• Local Sequence Alignment
- Finds regions of high
similarity in parts of the
• Multiple Sequence Alignment (aligning more
than 2 sequences at a time), is one of the most
useful tool and helps almost in every application
of Bioinformatics.
-Phylogenetic Analysis
-Structure Prediction
-For Sequence Similarity
Sequence Alignment
• Input Sequence Files are loaded in to HDFS.
• Sequence File name is taken as the key and the
content as value for the Mapper function.
• Performs pairwise alignment of all possible
combinations.
• Query Sequence File name along with Target File
name is taken as the key and score as the value for the
Reducer function.
• Reducer function combines the result and provides
the final result.
• Large number of NGS short reads can be aligned
without second level of parallelization at pairwise
alignment.
Data Challenges
A parallelized implementation of Needleman-Wunsch
algorithm using Hadoop framework
Needleman Wunsch Algorithm
• Create M*N matrix (M & N - Input sequence size)
• Fill up the matrix based on character similarity
• Trace back to find optimum alignment
Time & Space Complexity –
Pairwise Alignment - M*N
Multiple Sequence Alignment – MN
• Form possible combinations of input sequence pairs
• Perform pairwise alignment of each pair parallel.
• Pairwise alignment can be further divided into
multiple chunks and align parallel where no
dependencies exist. Number of chunks to be same as
number of computing nodes available.
• Merge the result of alignment to get final score and
alignment.
• Hadoop Framework to handle the large amount of
data. Deployed on low cost commodity hardware,
Moves Computation to Data.
• MapReduce programming to split the input for
processing and combine the result. (Parallel
Execution)
• HDFS for storing multiple chunks of data in multiple
copies (Fault tolerant)
Proposed Solution
For N input sequences of size M:
Complexity of Sequential implementation: MN
Both time and space increases exponentially as the
input size increases.
Hadoop based parallel implementation:
The possible pairwise combination of N input
sequences: NC2 = (n(n-1)/2)
Complexity of pairwise alignment when sequences
can be divided into ‘b’ blocks: M2/b
MSA Complexity : (M2/b) * (n(n-1)/2)
(Time taken is reduced to a linear level).
With parallelization at input sequence file level
alone: M2 * (n(n-1)/2)
Where b is equal to the number of nodes available
for processing.
As the number of nodes increases the time required
for processing reduces.
Computational Space requirements remain the
same.
Trade off between time and space exists.
• Hadoop Framework takes care of the
coordination between master and slave nodes.
• HDFS along with MapReduce eases the parallel
execution of tasks.
• Data & Compute parallelism
Analysis
• Highly Scalable solution with additional
commodity hardware.
• Suitable for NGS data due to ability to handle
massive data load.
• Computational Space is still a point of concern
which needs to be addressed by improving the
underlying algorithm.
Conclusion & Future Work
Enhanced sequencing technologies (NGS) produce
sequence data on an unparalleled scale and there is
a need to scale the alignment solutions to be able
to handle huge volume of input data.
Eg; Illumina HiSeqX™ Ten, generate up to 6 billion
sequence reads per run.
MSA is an NP Complete problem – Computational
Steps increase exponentially with size of the
problem.
• Approximation Algorithms – Find approximate
solution reasonably fast. Trade off between
quality and performance. Eg; Progressive, Branch
& Bound
• Parallel Processing – Parallelize the execution to
reduce processing time. User defined
coordination among nodes. Eg; MPI
• Big Data Frameworks – Frameworks designed for
handling huge data volume. Inherent capability
to take care of multi node coordination, Fault
tolerance. Eg; Hadoop, SPARK
Methodology
Result
• 3 node Virtual cluster is used for implementation
• The classic Dynamic Algorithm at the core
ensures accuracy
• Parallel execution in Hadoop framework reduced
the time taken for alignment .
• For small input size, sequential implementation
gives better performance but as the number of
size increases we see the sequential
implementation time increases exponentially
and Hadoop based parallel implementation
provides better performance.
Hadoop - Overview
Challenges
Solution Options
Data Deluge
Storage
Storing the
massive amount
of data being
generated
Processing
Analyzing this huge
data is highly time
& space consuming
Reporting
Reporting result
of analysis need
better
visualization tools
Design
1. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid
sequence of two proteins. Journal of molecular biology, 48(3):443–453, March 1970. ISSN 0022-2836
2. Sudha Sadasivam G Baktavatchalam G. A novel approach to multiple sequence alignment using Hadoop data grids. Int
J Bioinform Res Appl. 2010; 6(5):472-83
3. Sara A.Shehab, Arabi Keshk, Hany Mahgoub Fast Dynamic Algorithm for Sequence Alignment based on Bioinformatics.
International Journal of Computer Applications (0975 – 8887) Volume 37– No.7, January 2012
4. Angana Chakraborty & Sanghamitra Bandyopadhyay, Scientific Reports 3, Article number: 1746, FOGSAA: Fast Optimal
Global Sequence Alignment Algorithm, dx.doi.org/10.1038/srep01746
5. http://hadoop.apache.org/
6. http://www.ibm.com/developerworks/library/j-seqalign/

More Related Content

What's hot

Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
Dr. C.V. Suresh Babu
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
jie cao
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Spark Summit
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
The Statistical and Applied Mathematical Sciences Institute
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
Alysson Almeida
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
Pavel Mezentsev
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
Jinseob Kim
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Frederic Desprez
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
Papitha Velumani
 

What's hot (20)

Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 

Similar to NGBT_poster_v0.4

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
Frederic Desprez
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
Vishva Abeyrathne
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
Himanshu Bedi
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
Shantanu Deshpande
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 

Similar to NGBT_poster_v0.4 (20)

Vineetha.ppt
Vineetha.pptVineetha.ppt
Vineetha.ppt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 

NGBT_poster_v0.4

  • 1. Improved Needleman-Wunsch algorithm for NGS alignment in Hadoop Data Clusters Vineetha V and Achuthsankar S. Nair Department of Computational Biology & Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala Vineetha V Research Scholar, Dept. of Computational Biology and Bioinformatics, University of Kerala Technology Architect, Infosys Ltd Email: vineevishnu@gmail.com Phone: +919446175215 Contact References • Comparing two or more sequences to locate a series of identical characters or character patterns. • Also considers spaces or gaps and mismatches (corresponds to mutations). Types: • Global Sequence Alignment - Finds the best alignment across the entire sequences. • Local Sequence Alignment - Finds regions of high similarity in parts of the • Multiple Sequence Alignment (aligning more than 2 sequences at a time), is one of the most useful tool and helps almost in every application of Bioinformatics. -Phylogenetic Analysis -Structure Prediction -For Sequence Similarity Sequence Alignment • Input Sequence Files are loaded in to HDFS. • Sequence File name is taken as the key and the content as value for the Mapper function. • Performs pairwise alignment of all possible combinations. • Query Sequence File name along with Target File name is taken as the key and score as the value for the Reducer function. • Reducer function combines the result and provides the final result. • Large number of NGS short reads can be aligned without second level of parallelization at pairwise alignment. Data Challenges A parallelized implementation of Needleman-Wunsch algorithm using Hadoop framework Needleman Wunsch Algorithm • Create M*N matrix (M & N - Input sequence size) • Fill up the matrix based on character similarity • Trace back to find optimum alignment Time & Space Complexity – Pairwise Alignment - M*N Multiple Sequence Alignment – MN • Form possible combinations of input sequence pairs • Perform pairwise alignment of each pair parallel. • Pairwise alignment can be further divided into multiple chunks and align parallel where no dependencies exist. Number of chunks to be same as number of computing nodes available. • Merge the result of alignment to get final score and alignment. • Hadoop Framework to handle the large amount of data. Deployed on low cost commodity hardware, Moves Computation to Data. • MapReduce programming to split the input for processing and combine the result. (Parallel Execution) • HDFS for storing multiple chunks of data in multiple copies (Fault tolerant) Proposed Solution For N input sequences of size M: Complexity of Sequential implementation: MN Both time and space increases exponentially as the input size increases. Hadoop based parallel implementation: The possible pairwise combination of N input sequences: NC2 = (n(n-1)/2) Complexity of pairwise alignment when sequences can be divided into ‘b’ blocks: M2/b MSA Complexity : (M2/b) * (n(n-1)/2) (Time taken is reduced to a linear level). With parallelization at input sequence file level alone: M2 * (n(n-1)/2) Where b is equal to the number of nodes available for processing. As the number of nodes increases the time required for processing reduces. Computational Space requirements remain the same. Trade off between time and space exists. • Hadoop Framework takes care of the coordination between master and slave nodes. • HDFS along with MapReduce eases the parallel execution of tasks. • Data & Compute parallelism Analysis • Highly Scalable solution with additional commodity hardware. • Suitable for NGS data due to ability to handle massive data load. • Computational Space is still a point of concern which needs to be addressed by improving the underlying algorithm. Conclusion & Future Work Enhanced sequencing technologies (NGS) produce sequence data on an unparalleled scale and there is a need to scale the alignment solutions to be able to handle huge volume of input data. Eg; Illumina HiSeqX™ Ten, generate up to 6 billion sequence reads per run. MSA is an NP Complete problem – Computational Steps increase exponentially with size of the problem. • Approximation Algorithms – Find approximate solution reasonably fast. Trade off between quality and performance. Eg; Progressive, Branch & Bound • Parallel Processing – Parallelize the execution to reduce processing time. User defined coordination among nodes. Eg; MPI • Big Data Frameworks – Frameworks designed for handling huge data volume. Inherent capability to take care of multi node coordination, Fault tolerance. Eg; Hadoop, SPARK Methodology Result • 3 node Virtual cluster is used for implementation • The classic Dynamic Algorithm at the core ensures accuracy • Parallel execution in Hadoop framework reduced the time taken for alignment . • For small input size, sequential implementation gives better performance but as the number of size increases we see the sequential implementation time increases exponentially and Hadoop based parallel implementation provides better performance. Hadoop - Overview Challenges Solution Options Data Deluge Storage Storing the massive amount of data being generated Processing Analyzing this huge data is highly time & space consuming Reporting Reporting result of analysis need better visualization tools Design 1. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, March 1970. ISSN 0022-2836 2. Sudha Sadasivam G Baktavatchalam G. A novel approach to multiple sequence alignment using Hadoop data grids. Int J Bioinform Res Appl. 2010; 6(5):472-83 3. Sara A.Shehab, Arabi Keshk, Hany Mahgoub Fast Dynamic Algorithm for Sequence Alignment based on Bioinformatics. International Journal of Computer Applications (0975 – 8887) Volume 37– No.7, January 2012 4. Angana Chakraborty & Sanghamitra Bandyopadhyay, Scientific Reports 3, Article number: 1746, FOGSAA: Fast Optimal Global Sequence Alignment Algorithm, dx.doi.org/10.1038/srep01746 5. http://hadoop.apache.org/ 6. http://www.ibm.com/developerworks/library/j-seqalign/