SEQUENCE
ALIGNMENT
Global vs Local
Sequence
• A sequence in biology is the one dimensional ordering of monomers,
covalently linked with a biopolymer.
• May be also referred to as primary structure of a biological
macromolecule.
• In bioinformatics, refers to DNA, RNA or protein sequence.
Sequence alignment
• Procedure of comparing two or more sequences by searching for a
series of individual characters or character patterns that are in the
same order in the sequences.
• Two sequences are aligned by writing them across a page in two rows.
• Identical or similar characters are placed in the same column, and
non-identical characters can either be placed in same column as
mismatch or opposite a gap in the other sequence.
• In an optimal alignment, non-identical characters and gaps are placed
to bring as many identical or similar characters as possible into
vertical register.
• Sequences that can be readily aligned in this manner are said to be
similar.
Two types of sequence alignment:
–Global alignment
–Local alignment
Fig.: Distinction between Global and Local alignment of two sequences
• Global alignment
– Attempts to align the entire sequence using as many characters as possible,
upto both ends of each sequence.
– Sequences that are quite similar and approximately the same length are
suitable candidates for global alignment.
– Needleman-Wunch algorithm is used to produce global alignment between
pairs of DNA or Protein sequences.
• Local alignment
– Stretches of sequence with the highest density of matches are aligned
– Generates one or more islands of matches or subalignments in the aligned
sequences
– Suitable for aligning sequences that are similar along some of their lengths
but dissimilar in others, sequences that differ in length, or sequences that
share conserved region or domain.
– Smith-Waterman algorithm is used to produce local alignments between pairs
of DNA or protein sequences.
DynamicProgramming
• Method for solving a complex problem by breaking it down into a
collection of simpler sub-problems, solving each of these sub-problems
just once and storing their solutions ideally, using a memory based
data structure.
• Then next time the same sub-problem occurs, instead of recomputing
its solution, one simply looks up the previously computed solution,
thereby saving computation time at the expense of a modest
expenditure in storage space.
Three steps in dynamic programming:
• Initialisation
• Matrix fill (scoring)
• Traceback (alignment)
• Initialization:
– Involves creating a matrix with M+1 columns and N+1 rows where
M and N correspond to the size of the sequences to be aligned.
– The first row and the first column are initialized with scores
corresponding to gap penalties.
• Matrix fill (scoring)
– The score at each position is given as:
• Traceback (alignment)
– Traceback starts from the last block and continues till the first
block in the matrix.
Final alignment
Needleman-Wunch algorithm
• Based on dynamic programming.
• The optimal score at each position is calculated by adding the current
match score to previously scored positions and subtracting gap
penalties (if applicable).
• Each matrix position may have a positive or negative score or zero.
• The Needleman-Wunch algorithm will maximize the number of
matches between the sequences along the entire length of the
sequences.
• Trace back starts at the last block and ends at the first block.
Smith-Waterman algorithm
• Based on DP but modified to give high scoring local matches.
• Slightly different from Needleman-Wunch algorithm
• The main differences are:
– The scoring system must include negative scores for mismatches, and
– When a DP scoring matrix value becomes negative it is set to zero, which has
the effect of terminating any alignment up to that point.
• Traceback starts at the highest score and ends at the block containing
zero.
THANK YOU

Sequence alignment

  • 1.
  • 2.
    Sequence • A sequencein biology is the one dimensional ordering of monomers, covalently linked with a biopolymer. • May be also referred to as primary structure of a biological macromolecule. • In bioinformatics, refers to DNA, RNA or protein sequence.
  • 3.
    Sequence alignment • Procedureof comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. • Two sequences are aligned by writing them across a page in two rows. • Identical or similar characters are placed in the same column, and non-identical characters can either be placed in same column as mismatch or opposite a gap in the other sequence. • In an optimal alignment, non-identical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. • Sequences that can be readily aligned in this manner are said to be similar.
  • 4.
    Two types ofsequence alignment: –Global alignment –Local alignment Fig.: Distinction between Global and Local alignment of two sequences
  • 5.
    • Global alignment –Attempts to align the entire sequence using as many characters as possible, upto both ends of each sequence. – Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. – Needleman-Wunch algorithm is used to produce global alignment between pairs of DNA or Protein sequences.
  • 6.
    • Local alignment –Stretches of sequence with the highest density of matches are aligned – Generates one or more islands of matches or subalignments in the aligned sequences – Suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share conserved region or domain. – Smith-Waterman algorithm is used to produce local alignments between pairs of DNA or protein sequences.
  • 7.
    DynamicProgramming • Method forsolving a complex problem by breaking it down into a collection of simpler sub-problems, solving each of these sub-problems just once and storing their solutions ideally, using a memory based data structure. • Then next time the same sub-problem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a modest expenditure in storage space.
  • 8.
    Three steps indynamic programming: • Initialisation • Matrix fill (scoring) • Traceback (alignment)
  • 9.
    • Initialization: – Involvescreating a matrix with M+1 columns and N+1 rows where M and N correspond to the size of the sequences to be aligned. – The first row and the first column are initialized with scores corresponding to gap penalties.
  • 11.
    • Matrix fill(scoring) – The score at each position is given as:
  • 13.
    • Traceback (alignment) –Traceback starts from the last block and continues till the first block in the matrix.
  • 14.
  • 15.
    Needleman-Wunch algorithm • Basedon dynamic programming. • The optimal score at each position is calculated by adding the current match score to previously scored positions and subtracting gap penalties (if applicable). • Each matrix position may have a positive or negative score or zero. • The Needleman-Wunch algorithm will maximize the number of matches between the sequences along the entire length of the sequences. • Trace back starts at the last block and ends at the first block.
  • 16.
    Smith-Waterman algorithm • Basedon DP but modified to give high scoring local matches. • Slightly different from Needleman-Wunch algorithm • The main differences are: – The scoring system must include negative scores for mismatches, and – When a DP scoring matrix value becomes negative it is set to zero, which has the effect of terminating any alignment up to that point. • Traceback starts at the highest score and ends at the block containing zero.
  • 17.