INTRODUCTION TO
SEQUENCE ALIGNMENT
PART 2
Content
• METHOD TO WRITE AN ALIGNMENT
• A MATCH, A GAP, AND INDELS
• REPRESENTATION OF SUBSTITUTION, DELETION,
AND INSERTION IN TRACES
• FEATURS OF GAP
• CAUSES OF GAPS
• OCCURRENCE OF GAPS
• TYPES OF GAPS AND GAP PENALTIES
CONSTANT GAP PENALTY
Linear
Affine
Convex
Profile-based variable gap penalties
• Highlights of gap and gap penalty
• Example of assigning gaps and gap penalties
METHOD TO WRITE AN ALIGNMENT
When two symbolic representations of DNA or protein sequences
are arranged next to one another so that their most similar elements are
juxtaposed they are said to be aligned.
 Alignments are conventionally shown as traces. In a symbolic sequence,
each base or residue monomer in each sequence is represented by a single-
letter codes .
 The convention is to print the for the constituent monomers in order in a
fixed font (from the N-most to C-most end of the protein sequence in
question or from 5' to 3' of a nucleic acid molecule).
 This assumes that the combined monomers evenly spaced along the single
dimension of the molecule’s primary structure.
A MATCH, A GAP, AND INDELS
 Every element in a trace is either a match or a gap.
 A MATCH-Where a residue in one of two aligned sequences is identical to its counterpart in
the other the corresponding amino-acid letter codes in the two sequences are vertically
aligned in the trace.
 A GAP- When a residue in one sequence seems to have been deleted since the assumed
divergence of the sequence from its counterpart, its “absence” is labelled by a dash in
the derived sequence. Since these dashes represent “gaps"(i.e. mutations are annotated as
gaps in the sequences) in one or other sequence.
 THE GAPPING- Action of inserting such spacers is known. A deletion in one sequence is
symmetric with an insertion in the other i.e. when a residue appears to have been inserted
to produce a longer sequence 'A' a dash appears opposite in the unaugmented sequence ‘B’.
 INDELS- Indeed, the two types of mutation are referred to together as. If we imagine that
at some point one of the sequences was identical to its primitive homologue, then a trace
can represent the three ways divergence due to mutation.
REPRESENTATION OF SUBSTITUTION,
DELETION, AND INSERTION IN TRACES
 A trace can represent a substitution (like point accepted mutation;
amino acid V changes to I due to change in genetic code in DNA.)
A trace can represent a deletion. (A residue or subsequence of DNA
is deleted from a sequence; eg. amino acid E is deleted from the
sequence due to absence of its genetic code in DNA.)
VCGED
VCG- D
 A trace can represent an insertion: (A residue or subsequence of
DNA is inserted into a sequence. eg. amino acid L is inserted in the
sequence due to addition of its genetic code in DNA.

FEATURES AND IMPORTANCE OF GAP AND
GAP PENALTY
 A gap is a maximal consecutive run of spaces in a single string of a
given alignment.
 It corresponds to an atomic insertion or deletion of a substring. The
insertions or deletions comprise an entire subsequence and often occur
from a single mutational event.
 Single mutational events can create gaps of different sizes, when
scoring, the gaps need to be scored as a whole when aligning two
sequences of DNA.
 Gap considers all possible alignments and gap positions between two
sequences.
 It creates a global alignment that maximizes the number of matched
residues and minimizes the number and size of gaps.
 A scoring matrix is used to assign values for symbol matches.
 Besides, a gap creation penalty and a gap extension penalty are required to
limit the insertion of gaps into the alignment.
 Gap uses the alignment method of Needleman and Wunsch (1970) that has
been shown to be equivalent to Sellers (1974).
 The algorithm of Needleman and Wunsch is used to find the alignment of two
complete sequences that maximizes the number of matches.
 Considering multiple gaps in a sequence as a larger
single gap reduces the assignment of a high cost to the mutations.
 For instance, two protein sequences may be relatively similar however, may
differ at certain intervals as one protein may have a different subunit
compared to the other. Representing these differing sub-sequences as
gaps will allow us to treat these cases as “good matches” even though
there are long consecutive runs with indel operations in the sequence.
 Therefore, using a good gap penalty model will avoid low scores in
alignments and improve the chances of finding a true alignment
CAUSES OF GAPS
1. A single mutation can create a gap (very common)
2. Error in DNA replication can result in the repetition of strings of
bases.
3. Unequal crossover in meiosis can lead to insertion or deletion of
strings of bases.
4. Translocation of DNA between chromosomes.
5. Retrovirus insertion.
OCCURRENCE OF GAPS-
1-Before the first character of a string eg
2- Inside the string eg
3- After the last character of a string eg –
TYPES OF GAPS AND GAP PENALTIES
Types of gap penalties are as follows-
1. Constant
2. Linear
3. Affine
4. Convex
5. Profile-based variable gap penalties
CONSTANT GAP PENALTY
This is the simplest type of gap penalty: a fixed negative score is
given to every gap, regardless of its length.
Aligning two short DNA sequences, with '-' depicting a gap of one
base pair. If each match was worth 1 point and the gap -1, the total
score:
7 – 1 = 6.
Compared to the constant gap penalty, the linear gap penalty considers
the length (L) of each insertion/deletion in the gap.
Therefore, if the penalty for each inserted/deleted element is B and
the length of the gap L; the total gap penalty would be the product of
the two BL.
This method favors shorter gaps, with total score decreasing with each
additional gap.
Unlike constant gap penalty, the size of the gap is considered. With `a
match with score 1 and gap -1, the score here is (7 – 3 = 4).
LINEAR GAP PENALTY
• The most widely used gap penalty function is the affine gap penalty which
combines the components in both the constant and linear gap penalty, taking
the form A+(B.L).
• This introduces new terms, A is known as the gap opening penalty, B the gap
extension penalty and L the length of the gap.
• Gap opening refers to the cost required to open a gap of any length, and gap
extension the cost to extend the length of an existing gap by 1.
• Affine gap penalty encourages the extension of gaps rather than the
introduction of a new gap.
AFFINE GAP PENALTY
AFFINE GAP PENALTY(cont.)
• Often it is unclear as to what the values A and B should be
as it differs according to purpose. In general, if the interest
is to find closely related matches (e.g. removal of vector
sequence during genome sequencing), a higher gap penalty
should be used to reduce gap openings.
• On the other hand, gap penalty should be lowered when
interested in finding a more distant match.
• The relationship between A and B also influence gap size.
If the size of the gap was important, a small A and large B
(costlier to extend gap) is used and vice versa.
• Using the affine gap penalty requires the assigning of fixed penalty values for
both opening and extending a gap. This can be too rigid for use in a
biological context.
• The logarithmic gap takes the form G(L) = A + ClnL and was proposed as
studies had shown the distribution of indel sizes obey a power law.
• Another proposed issue with the use of affine gaps is the favoritism of
aligning sequences with shorter gaps.
• Logarithmic gap penalty was invented to modify the affine gap so that long
gaps are desirable. However, in contrast to this, it has been found that using
logarithmic models had produced poor alignments when compared to affine
models.
CONVEX GAP
PENALTY
• Profile-profile alignment algorithms are powerful tools for detecting protein
homology relationships with improved alignment accuracy.
• Profile-profile alignments are based on the statistical indel frequency
profiles from multiple sequence alignments generated by PSI-BLAST
searches.
• Rather than using substitution matrices to measure the similarity of amino
acid pairs, profile-profile alignment methods require a profile-based scoring
function to measure the similarity of profile vector pairs.
• Profile-profile alignments employ gap penalty functions.
• The gap information is usually used in the form of indel frequency profiles,
which is more specific for the sequences to be aligned.
PROFILE-BASED VARIABLE GAP PENALTIES
• ClustalW and MAFFT adopted this kind of gap penalty
determination for their multiple sequence alignments.
• Alignment accuracies can be improved using this model,
especially for proteins with low sequence identity.
• Some profile-profile alignment algorithms also run the
secondary structure information as one term in their scoring
functions, which improves alignment accuracy.
PROFILE-BASED VARIABLE GAP PENALTIES(cont
HIGHLIGHTS OF GAP AND GAP PENALTY
1. By insertion of an element into sequence alignment it is
possible to achieve a good residue to residue alignment at some
other neighboring point in the sequence.
2. Insertion of gaps into pairwise sequence alignments allows the
alignment to be extended into regions where one sequence may
have lost or gained sequence characters not found in the other.
3. A penalty is subtracted for each gap introduced into an
alignment because the gap increases uncertainty into an
alignment.
4.The gap penalty is used to help decide if accept or not to
accept a gap.
5.If gap penalty is very lower or not included (gap introduced at
any position) then a sequence alignment score achievable even
between unrelated or random sequences; and this is not desired.
6. Genetically, it is expected that a protein will accept a different
residue in a position rather than having parts of sequences
chopped away or inserted.
7. Gaps or insertions should therefore
be rarer than point mutations or substitution. Still gaps are
introduced in the alignments to optimize the alignment score.
8. It may be concluded that a variety of gap penalties (from zero
to some significant punishment) must be tried and from these one
must determine the effects that this has on the result.
HIGHLIGHTS OF GAP AND GAP PENALTY(cont.)
EXAMPLE OF ASSIGNING GAPS AND GAP
PENALTIES
. This is an extension to the Advanced Dynamic Programming. Scores used is +2
for a match, -2 for a gap, and -1 for a mismatch.
Fig. shows regular gap penalty.
Fig. shows assignment of affine gap penalties to the first alignment.
EXAMPLE OF ASSIGNING GAPS AND GAP
PENALTIES(cont.)
 Fig. regular gap penalty alignment can be written like this too without changing the
score.
 Fig. rescoring of second alignment using affine gap penalties.
REFERENCES-(PART1 & 2)
 1. Point accepted mutation. https://en.wikipedia.org/wiki/
Point_accepted_mutation.
2. Adansonian Classification - Medical Definition from
MediLexicon www.medilexicon.com/dictionary/18016
3. S.C. Rastogi, Namita Mendiratta, Parag.Rastogi.
Bioinformatics concepts, Skills & Applications. CBS
Publishers & distributors. New Delhi. http://www.cbspd.com
4. D.R. Westhead, J.H.,J.H.Parish and R.M. Twyman. .
Instant Notes bioinformatics. Viva books Private Limited.
5. https://en.m.wikipedia.org/wiki/Simple_matching_coefficient
6. Sequence alignment.https://www.bioinformatics.org/wiki/
Sequence_alignment
7. Gap penalty. https://en.wikipedia.org/wiki/Gap_penalty#Types
8. Bioinformatics theory and practice. By N. J. Chikhale and V.S.
Gomase. Himalaya publishing House. www.himpub.com
REFERENCES-(PART1 & 2)cont.

Introduction to sequence alignment partii

  • 1.
  • 2.
    Content • METHOD TOWRITE AN ALIGNMENT • A MATCH, A GAP, AND INDELS • REPRESENTATION OF SUBSTITUTION, DELETION, AND INSERTION IN TRACES • FEATURS OF GAP • CAUSES OF GAPS • OCCURRENCE OF GAPS • TYPES OF GAPS AND GAP PENALTIES CONSTANT GAP PENALTY Linear Affine Convex Profile-based variable gap penalties • Highlights of gap and gap penalty • Example of assigning gaps and gap penalties
  • 3.
    METHOD TO WRITEAN ALIGNMENT When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned.  Alignments are conventionally shown as traces. In a symbolic sequence, each base or residue monomer in each sequence is represented by a single- letter codes .  The convention is to print the for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule).  This assumes that the combined monomers evenly spaced along the single dimension of the molecule’s primary structure.
  • 4.
    A MATCH, AGAP, AND INDELS  Every element in a trace is either a match or a gap.  A MATCH-Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace.  A GAP- When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its “absence” is labelled by a dash in the derived sequence. Since these dashes represent “gaps"(i.e. mutations are annotated as gaps in the sequences) in one or other sequence.  THE GAPPING- Action of inserting such spacers is known. A deletion in one sequence is symmetric with an insertion in the other i.e. when a residue appears to have been inserted to produce a longer sequence 'A' a dash appears opposite in the unaugmented sequence ‘B’.  INDELS- Indeed, the two types of mutation are referred to together as. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence due to mutation.
  • 5.
    REPRESENTATION OF SUBSTITUTION, DELETION,AND INSERTION IN TRACES  A trace can represent a substitution (like point accepted mutation; amino acid V changes to I due to change in genetic code in DNA.) A trace can represent a deletion. (A residue or subsequence of DNA is deleted from a sequence; eg. amino acid E is deleted from the sequence due to absence of its genetic code in DNA.) VCGED VCG- D  A trace can represent an insertion: (A residue or subsequence of DNA is inserted into a sequence. eg. amino acid L is inserted in the sequence due to addition of its genetic code in DNA. 
  • 6.
    FEATURES AND IMPORTANCEOF GAP AND GAP PENALTY  A gap is a maximal consecutive run of spaces in a single string of a given alignment.  It corresponds to an atomic insertion or deletion of a substring. The insertions or deletions comprise an entire subsequence and often occur from a single mutational event.  Single mutational events can create gaps of different sizes, when scoring, the gaps need to be scored as a whole when aligning two sequences of DNA.  Gap considers all possible alignments and gap positions between two sequences.  It creates a global alignment that maximizes the number of matched residues and minimizes the number and size of gaps.
  • 7.
     A scoringmatrix is used to assign values for symbol matches.  Besides, a gap creation penalty and a gap extension penalty are required to limit the insertion of gaps into the alignment.  Gap uses the alignment method of Needleman and Wunsch (1970) that has been shown to be equivalent to Sellers (1974).  The algorithm of Needleman and Wunsch is used to find the alignment of two complete sequences that maximizes the number of matches.  Considering multiple gaps in a sequence as a larger single gap reduces the assignment of a high cost to the mutations.  For instance, two protein sequences may be relatively similar however, may differ at certain intervals as one protein may have a different subunit compared to the other. Representing these differing sub-sequences as gaps will allow us to treat these cases as “good matches” even though there are long consecutive runs with indel operations in the sequence.  Therefore, using a good gap penalty model will avoid low scores in alignments and improve the chances of finding a true alignment
  • 8.
    CAUSES OF GAPS 1.A single mutation can create a gap (very common) 2. Error in DNA replication can result in the repetition of strings of bases. 3. Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. 4. Translocation of DNA between chromosomes. 5. Retrovirus insertion.
  • 9.
    OCCURRENCE OF GAPS- 1-Beforethe first character of a string eg 2- Inside the string eg 3- After the last character of a string eg –
  • 10.
    TYPES OF GAPSAND GAP PENALTIES Types of gap penalties are as follows- 1. Constant 2. Linear 3. Affine 4. Convex 5. Profile-based variable gap penalties
  • 11.
    CONSTANT GAP PENALTY Thisis the simplest type of gap penalty: a fixed negative score is given to every gap, regardless of its length. Aligning two short DNA sequences, with '-' depicting a gap of one base pair. If each match was worth 1 point and the gap -1, the total score: 7 – 1 = 6.
  • 12.
    Compared to theconstant gap penalty, the linear gap penalty considers the length (L) of each insertion/deletion in the gap. Therefore, if the penalty for each inserted/deleted element is B and the length of the gap L; the total gap penalty would be the product of the two BL. This method favors shorter gaps, with total score decreasing with each additional gap. Unlike constant gap penalty, the size of the gap is considered. With `a match with score 1 and gap -1, the score here is (7 – 3 = 4). LINEAR GAP PENALTY
  • 13.
    • The mostwidely used gap penalty function is the affine gap penalty which combines the components in both the constant and linear gap penalty, taking the form A+(B.L). • This introduces new terms, A is known as the gap opening penalty, B the gap extension penalty and L the length of the gap. • Gap opening refers to the cost required to open a gap of any length, and gap extension the cost to extend the length of an existing gap by 1. • Affine gap penalty encourages the extension of gaps rather than the introduction of a new gap. AFFINE GAP PENALTY
  • 14.
    AFFINE GAP PENALTY(cont.) •Often it is unclear as to what the values A and B should be as it differs according to purpose. In general, if the interest is to find closely related matches (e.g. removal of vector sequence during genome sequencing), a higher gap penalty should be used to reduce gap openings. • On the other hand, gap penalty should be lowered when interested in finding a more distant match. • The relationship between A and B also influence gap size. If the size of the gap was important, a small A and large B (costlier to extend gap) is used and vice versa.
  • 15.
    • Using theaffine gap penalty requires the assigning of fixed penalty values for both opening and extending a gap. This can be too rigid for use in a biological context. • The logarithmic gap takes the form G(L) = A + ClnL and was proposed as studies had shown the distribution of indel sizes obey a power law. • Another proposed issue with the use of affine gaps is the favoritism of aligning sequences with shorter gaps. • Logarithmic gap penalty was invented to modify the affine gap so that long gaps are desirable. However, in contrast to this, it has been found that using logarithmic models had produced poor alignments when compared to affine models. CONVEX GAP PENALTY
  • 16.
    • Profile-profile alignmentalgorithms are powerful tools for detecting protein homology relationships with improved alignment accuracy. • Profile-profile alignments are based on the statistical indel frequency profiles from multiple sequence alignments generated by PSI-BLAST searches. • Rather than using substitution matrices to measure the similarity of amino acid pairs, profile-profile alignment methods require a profile-based scoring function to measure the similarity of profile vector pairs. • Profile-profile alignments employ gap penalty functions. • The gap information is usually used in the form of indel frequency profiles, which is more specific for the sequences to be aligned. PROFILE-BASED VARIABLE GAP PENALTIES
  • 17.
    • ClustalW andMAFFT adopted this kind of gap penalty determination for their multiple sequence alignments. • Alignment accuracies can be improved using this model, especially for proteins with low sequence identity. • Some profile-profile alignment algorithms also run the secondary structure information as one term in their scoring functions, which improves alignment accuracy. PROFILE-BASED VARIABLE GAP PENALTIES(cont
  • 18.
    HIGHLIGHTS OF GAPAND GAP PENALTY 1. By insertion of an element into sequence alignment it is possible to achieve a good residue to residue alignment at some other neighboring point in the sequence. 2. Insertion of gaps into pairwise sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other. 3. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. 4.The gap penalty is used to help decide if accept or not to accept a gap.
  • 19.
    5.If gap penaltyis very lower or not included (gap introduced at any position) then a sequence alignment score achievable even between unrelated or random sequences; and this is not desired. 6. Genetically, it is expected that a protein will accept a different residue in a position rather than having parts of sequences chopped away or inserted. 7. Gaps or insertions should therefore be rarer than point mutations or substitution. Still gaps are introduced in the alignments to optimize the alignment score. 8. It may be concluded that a variety of gap penalties (from zero to some significant punishment) must be tried and from these one must determine the effects that this has on the result. HIGHLIGHTS OF GAP AND GAP PENALTY(cont.)
  • 20.
    EXAMPLE OF ASSIGNINGGAPS AND GAP PENALTIES . This is an extension to the Advanced Dynamic Programming. Scores used is +2 for a match, -2 for a gap, and -1 for a mismatch. Fig. shows regular gap penalty. Fig. shows assignment of affine gap penalties to the first alignment.
  • 21.
    EXAMPLE OF ASSIGNINGGAPS AND GAP PENALTIES(cont.)  Fig. regular gap penalty alignment can be written like this too without changing the score.  Fig. rescoring of second alignment using affine gap penalties.
  • 22.
    REFERENCES-(PART1 & 2) 1. Point accepted mutation. https://en.wikipedia.org/wiki/ Point_accepted_mutation. 2. Adansonian Classification - Medical Definition from MediLexicon www.medilexicon.com/dictionary/18016 3. S.C. Rastogi, Namita Mendiratta, Parag.Rastogi. Bioinformatics concepts, Skills & Applications. CBS Publishers & distributors. New Delhi. http://www.cbspd.com 4. D.R. Westhead, J.H.,J.H.Parish and R.M. Twyman. . Instant Notes bioinformatics. Viva books Private Limited.
  • 23.
    5. https://en.m.wikipedia.org/wiki/Simple_matching_coefficient 6. Sequencealignment.https://www.bioinformatics.org/wiki/ Sequence_alignment 7. Gap penalty. https://en.wikipedia.org/wiki/Gap_penalty#Types 8. Bioinformatics theory and practice. By N. J. Chikhale and V.S. Gomase. Himalaya publishing House. www.himpub.com REFERENCES-(PART1 & 2)cont.