Bioinformatics_Sequence Analysis

X W. /-
Sequence Alignment in Bioinformatics:
Introduction:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may indicate functional, structural, or evolutionary
relationships between the sequences. It is an important first step toward structural and
functional analysis of newly determined sequences. The sequence alignment is made between
a known sequence and unknown sequence or between two unknown sequences. The known
sequence is called reference sequence and the unknown sequence is called query sequence.
As new biological sequences are being generated at exponential rate, sequence comparison is
becoming increasingly important to draw functional and evolutionary inference.
Types of Sequence Alignment:
Sequence Alignment is of two types, namely:
1. Global Alignment, and 2. Local Alignment
1. Global Alignment:
Global alignment is a matching of the residues of two sequences across their entire length. It
matches the identical sequences. Global alignment program is based on Needleman-Wunsch
algorithm.
In global alignment, two sequences to be aligned are assumed to be generally similar over their
entire length. Alignment is carried out from beginning to end of both sequences to find the best
possible alignment across the entire length between the two sequences.
Applications of global sequence alignment are: -
 Comparing two genes with same function (in human vs. mouse).
 Comparing two proteins with similar function.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.

2. Local Alignment:
It is a matching between two sequences from regions which have more similarity with each
other. Local alignment program is based on Smith-Waterman algorithm.
Unlike global alignment, local alignment does not assume that the two sequences in question
have similarity over the entire length. It only finds local regions with the highest level of
similarity between the two sequences and aligns these regions without regard for the alignment
of the rest of the sequence regions.
Applications of local sequence alignment are:
 Searching for local similarities in large sequences (e.g., newly sequenced genomes).
 Looking or conserved domains or motifs in two proteins.
Methods of Sequence Alignment:
There are two methods of sequence alignment:
A. Pairwise Sequence Alignment method, and B. Multiple Sequence Alignment Method.
A. Pairwise Sequence Alignment method:
Pairwise sequence alignment methods are used to find the best-matching piecewise (local or
global) alignments of two query sequences.
Pairwise alignments can only be used between two sequences at a time, but they are efficient
to calculate.
The three primary methods of producing Pairwise alignments
1. Dot matrix method
2. The dynamic programming (DP) algorithm (advanced method)
3. Word or k -tuple methods
The three primary methods of producing pairwise sequence alignments are
 Dot-matrix methods (old method),
 Dynamic programming, and
 Word methods.
1. Dot Matrix Method:
A dot matrix is a grid system where the similar nucleotides of two DNA sequences are
represented as dots. It also known as dot plots where the dots appear as colorless dots in the
computer screen.
In dot matrix, nucleotides of one sequence are written from the left to right on the top row and
those of the other sequence are written from the top to bottom on the left side (column) of the
matrix. At every point, where the two nucleotides are the same, a dot in the intersection of row
and column becomes a dark dot. when all these darken dots are connected, it gives a graph
called dot plot. The line found in the dot plot is called recurrence plot. Each dot in the plot
represents a matching nucleotide or amino acid.
BOTMT:604
Prepared By-
Dr. Sangeeta Das.

Dot matrix method is a qualitative method. It is very simple to analyze sequences in this method.
However, it takes much time to analyze large sequences.
Applications of Dot matrix method are:
•Sequence similarity between two nucleotide sequences or two amino acid sequences.
•Insertion of short stretches in DNA or amino acid sequence.
•Deletion of short stretches from a DNA or amino acid sequence.
•Repeats or inserted repeats in a DNA or amino acid sequence
Fig.1: Nucleic acid dot plots.
2. Dynamic Programming Method:
It is the process of solving problems when one needs to find the best decision one after another.
This method was introduced by Richard Bellman in 1940. The word programming here denotes
finding an acceptable plan of action not computer programming. The method compares every
pair of characters in the two sequences and generates an alignment, which is the best or optimal.
It is useful in aligning nucleotide sequence of DNA and amino acid sequence of proteins coded
by that DNA. However, it is a highly computationally demanding method. Each alignments
have its own score and it is essential to recognize that several different alignments may have
nearly identical scores, which is an indication that the dynamic programming methods may
produce more than one optimal alignment. However intelligent manipulation of some
parameters is important and may discriminate the alignments with similar scores.
BOTMT:604
Prepared By-
Dr. Sangeeta Das.

Global alignment program is based on Needleman-Wunsch algorithm and local alignment on
Smith-Waterman. Both algorithms are the derivates from the basic dynamic programming
algorithm.
Dynamic programming is a three step process that involves:
1) Breaking of the problem into small sub-problems.
2) Solving sub-problems using recursive methods.
3) Construction of optimal solutions for original problem using the optimal solutions.
Example:
Alignment: Sequence 1: G A A T T C A G T T A
Sequence 2: G G A T C G A
So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)
A simple scoring scheme is assumed where
 Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position
j of sequence #2 (match score); otherwise
 Si,j = 0 (mismatch score)
 w = 0 (gap penalty).
There are three steps in dynamic programming methods:
1. Initialization 2. Matrix fill (scoring), and 3. Traceback (alignment).
1. Initialization Step:
The first step in the global alignment dynamic programming approach is to create a matrix with
M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be
aligned.
The matrix can be initially filled with 0.
2. Matrix Fill Step:
One possible (inefficient) solution of the matrix fill step finds the maximum global alignment
score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j
for each position in the matrix.
Prepared By-
Dr. Sangeeta Das.
BOTMT:604

After filling in all the values the score matrix is as follows:
3. Traceback Step:
The traceback step determines the actual alignment(s) that result in the maximum score.
Prepared By-
Dr. Sangeeta Das.
BOTMT:604

Giving an alignment of:
3. Word Method or K-Tuple Method:
It is used to find an optimal alignment solution, but is more than dynamic programming. This
method is useful in large-scale database searches to find whether there is significant match
available with the query sequence. This method is used in the database search tools FASTA
and the BLAST. They identify a series of short, non-overlapping subsequences (words) of the
query sequence.
In the FASTA method, the user defines a value k to use as the word length to search the
database. It is slower but more sensitive at lower values of k. They are also preferred for
searches involving a very short query sequence. The BLAST provides a number of algorithms
optimized for particular types of queries, for distantly related sequence matches. It is a good
alternative to FASTA. However, the results are not very accurate. Similar to FASTA, BLAST
uses a word search of length k, but evaluates only the most significant word matches rather
than every word match.
Prepared By-
Dr. Sangeeta Das.
BOTMT:604

B. Multiple Sequence Alignment Method:
In a multiple sequence alignment, homologous residues among a set of sequences are aligned
together in columns. Here, homologous is meant in both the structural and evolutionary sense.
Multiple sequence alignment (MSA) is generally the alignment of three or more biological
sequence (protein or nucleic acid) of similar length. From the output, homology can be inferred
and the evolutionary relationship between the sequences studied.
Types of MSA methods: The following are the multiple sequence alignment methods:
1. Dynamic Programming approach, 2. Progressive method and 3. Iterative method.
1. Dynamic Programming approach:
Dynamic programming is applicable to align any number of sequences. It computes an optimal
alignment for a given score function. But, due to its high running time, it is not typically used
in practice.
2. Progressive method:
In this method, pairwise global alignment is performed for all the possible sequences. These
pairs are aligned together on the basis of their similarity.
The most similar sequences are aligned together and then less related sequences are added to
it progressively one-by-one until a complete multiple query set is obtained. This method is also
called hierarchical method or tree method.
Progressive method is one of the fastest approaches, considerably faster than the adaptation of
pair-wise alignments to multiple sequences. However, it can become a very slow process for
more than a few sequences.
One major disadvantage of this method is the reliance on a good alignment of the first two
sequences. Errors there can propagate throughout the rest of the process. An alternative
approach is iterative method.
Steps involved in Multiple Sequence alignment are as follows:
A. Pairwise sequence alignment:
Prepared By-
Dr. Sangeeta Das.
BOTMT:604

B. Multiple Sequence Alignment following the tree from A.
3. Iterative Method:
This method performs a series of steps to produce successively better approximation to align
many sequences step-by-step. In this method, the pairwise sequence alignment is totally
avoided. Here, the multiple sequence alignment is re-iterated starting with the pair-wise re-
alignment of sequences within subgroups, and then the re-alignment of the subgroups. The
choice of subgroups can be made via sequence relations on the guide tree, random selection,
and so on.
Iterative methods attempt to improve on the weak point of the progressive methods the heavy
dependence on the accuracy of the initial pairwise alignment. Iterative method is an
optimization method and may use machine learning approaches such as genetic algorithms and
Hidden Markov Models. The disadvantages of iterative method are inherited from optimization
methods i.e., the process can get trapped in local minima and can be much slower.
Prepared By-
Dr. Sangeeta Das.
BOTMT:604

Bioinformatics_Sequence Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bioinformatics_Sequence Analysis

Similar to Bioinformatics_Sequence Analysis (20)

More from Sangeeta Das

More from Sangeeta Das (20)

Recently uploaded

Recently uploaded (20)

Bioinformatics_Sequence Analysis