Course: B.Sc Biochemistry
Subject: Basic of Bioinformatics
Unit: III
OUTLINE
 Sequence Alignment
 Scoring Alignments and Substitution Matrices
 Inserting Gaps
 Dynamic Programming
 Database Searches
Sequence Alignment
 Comparing sequences for
– Similarity
– Homology
 Prediction of function of genes and proteins
 Construction of phylogeny
 Finding motifs
Sequence Alignment - HOMOLOGY
 Orthologues : any gene pairwise relation
where the ancestor node is a speciation
event. Often have similar function
 Paralogues : any gene pairwise relation
where the ancestor node is a duplication
event. Paralogs tend to have different
functions
Sequence Alignment - HOMOLOGY
1.
Sequence Alignment - HOMOLOGY
2.
Sequence Alignment - PHYLOGENY
3.
Sequence Alignment – PROTEIN
FUNCTIONS
4.
Scoring Alignments and Substitution
Matrices
 The quality of an alignment is measured by
giving it a quantitative score
 The simplest way of quatifying similarity
between two sequences is percentage
identity.
– Simply measured by counting the number of
identical bases or amino acids matched
between the aligned sequences.
Scoring Alignments and Substitution
Matrices
 The dot-plot
gives a visual
assesment of
similarity based
on identity.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.
Scoring Alignments and Substitution
Matrices
 Percentage identity is a relatively crude
measure and does bot give a complete
picture of the degree of similarity of two
sequences.
 Scoring identical matches 1 and mismatches
as 0 ignores the fact that the type of amino
acids involved is highly significant.
Scoring Alignments and Substitution
Matrices
 Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
Scoring Alignments and Substitution
Matrices
 Scoring pairs of amino acids:
– with similar properties  higher scores
– With different properties  lower scores
Scoring Alignments and Substitution
Matrices
 To assign scores for alignmens use
SUBSTITUTION MATRICES
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
5.
Scoring Alignments and Substitution
Matrices
 Different types of substitution matrices are
being used based on:
– The number of mutations required for
convertion of one amino acid to the other
– Similarities in physicochemical properties.
Scoring Alignments and Substitution
Matrices
 PAM substitution matrices:
– Use closely related protein sequences to
derive substitution frequencies
– Accepted Point Mutations per 100 residues
 250 PAM  250 mutation on 100 residues
Scoring Alignments and Substitution
Matrices
 BLOSUM substitution matrices:
– BLOcks of Amino Acid SUbstitution Matrix
– Use mutation data from highly conserved
local regions
– BLOSUM 62  62% identity
Scoring Alignments and Substitution
Matrices
 Which matrix to use ?
– Depends on the problem properties,
– Distantly related sequences : PAM 250 –
BLOSUM 50
– Closely related sequences: PAM 120,
BLOSUM 80
Scoring Alignments and Substitution
Matrices
 Which matrix to use ?
– Some special purpose matrices (SLIM and
PHAT are designed for membrane proteins)
– The length of the sequende is important
 Short sequences  PAM 40 or BLOSUM 80
 Long sequences  PAM 250 or BLOSUM 50
Scoring Alignments and Substitution
Matrices
 BLOSUM – 62 and PAM 120
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.
Inserting Gaps
 Gap insertion requires a scoring penalty (gap
penalty).
 To achieve correct matches gaps are
required
 Alignment programs use gap penalties to
limit the introduction of gaps in the
alignments
Inserting Gaps
 Insertions tend to be several residues long
rather than just a single residue long
– Fewer insertions and deletions occur in sequences
of structural importance
– Smaller penalty on lengthening an existing gap
(gap extension penalty) than introducing a new
gap
– Gap penaly is high  the number of gaps will be
decreased
– Gap penalty is low  more and large gaps will be
inserted.
Inserting Gaps
 Choosing gap penalties:
– Linear
– Affine
 Gap open penalty
 Gap extension penlty
Dynamic Programming
 Global and Local alignments
 Pairwise and Multiple alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.
 For a pair of sequences there is a large
number of possible alignments.
 2 sequences of length 1000 have
appriximately 10600
different alignments.
Dynamic Programming
 Dynamic Programming:
– Problem can be divided into many smaller parts.
– Optimal alignment will not contain parts that are
not themselves optimal.
– Start from sufficiently short sub-sequences.
– Alignement is additive:
Dynamic Programming
 Needleman and Wunsch were the first to
propose this method.
 Find optimal global alignments.
 Align sequences:
– Seq1: x (x1x2x3…xm)
– Seq1: y (y1y2y3…yn)
Dynamic Programming
 s(a,b) = score of aligning a and b
 F(i,j) = optimal similarity of X(1:i) and Y(1:j)
 Recurrence relation:
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
– Assume linear gap penalty
Dynamic Programming
Dynamic Programming
 Matrix S of optimal scores of sub-sequence
alignments.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
9.
Dynamic Programming
S(I, T) = -1,
10.
Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
11.
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
12.
Dynamic Programming
–Linear gap penalty (E=4)
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
13.
Dynamic Programming
 Semi – global alignment:
– When we treat terminal gaps differently than
internal gaps
– How to modify dynamic programming to be able
to make semi – global alignment ?
Dynamic Programming
 Local alignment:
– If we compare a sequence to whole genome
– Find sub-strings whose optimal global
alignment value is maximum
Dynamic Programming
 What is the difference between global and
local alignment ?
 Can we define the recuernce relation of local
alignment similar to global alignment ?
Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch)
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Recurrence relation of LOCAL ALIGNMENT:
(Smith-Waterman)
– F(i,0) = 0
– F(0,j) = 0
– F(i,j) = max [ 0,
F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Database Searches
 FASTA and BLAST
 Use some heuristics
 Dynamic Programming Complexity
– Time O(n*m)
– Space O(n*m)
Database Searches FASTA
 Good local alignment should have some
exact match subsequence.
 Find all k-tuples. (k=1-2 for proteins, 3-6 for
DNA sequences)
 Protein k – tuples  nc, sp, … (k = 2)
 Nucleotide k – tuples  TAAA, CTCC,…(k = 4)
Database Searches FASTA
 If k = 3 for nucleotide sequences.
– There will be 64 possible k – tuples
– Assign a number e( ):
 e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3
 Each 3 – tuples are represented as  xi xi+1xi+2
 Assign a number to each 3 – tuple
– Ci = e(xi)42
+ e(xi+1)41
+ e(xi+2)40
– For example: AAA 
 AAA 042
+ 041
+ 040
= 0
 CAA 142
+ 041
+ 040
= 16
Database Searches FASTA
 Find each occurance of k – tuples in the
sequences.
 Chaining  Look – Up Tables
 Consider TAAAACTCTAAC (if k = 3):
3 - tuples Position
AAA (0) 2, 3
AAC (1) 4, 10
AAG (2) 0
AAT (3) 0
… …
Database Searches BLAST
 Use short words to search the database
sequence.
 Searches for k – mers that will score above a
threshold (T) value when aligned with query k -
mer (Remember FASTA looks for k – tuples
which are identical).
 Use a scheme based on finite state automata
(Remember FASTA use hashing and chaining
fot rapid identification of k - tuples)
Database Searches BLAST
 From Query Sequence, create query words
(for protein sequences word size is 3)
Database Searches BLAST
 Blast uses a list of high scoring words created
from words similar to query words. Considers
the words with a score bigger than a threshold
value.
Database Searches BLAST
 Scan each database sequence for an exact
match to the list of words.
 Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
Database Searches BLAST
 Keep only the extended matches that have a
score at least S.
 Determine statistical significance of each
remaining match.
Database Searches BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi
1.
14.
Database Searches BLAST
15.
Database Searches BLAST
16.
Database Searches BLAST
17.
Database Searches BLAST
18.
Database Searches HISTORY
 1970: NW
 1980: SW
 1985: FASTA
 1989: BLAST
Books and Web References
 Books Name :
1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.
 http://en.wikipedia.org/wiki/Sequence_alignment
 http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf
 http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics-
tutorial/bioinformatics.pdf
 M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008,
Garland Science
 Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A
practical guide to the analysis of genes and proteins”, 2001,
Wiley.54
Images References
 1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg
 2.http://www.ensembl.org/info/genome/compara/tree_exa
mple1.png
 3.http://www.nature.com/nature/journal/v496/n7445/imag
es/nature12027-f1.2.jpg
 4.
http://upload.wikimedia.org/wikipedia/commons/e/e6/Spo
mbe_Pop2p_protein_structure_rainbow.png
 5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi

B.sc biochem i bobi u 3.1 sequence alignment

  • 1.
    Course: B.Sc Biochemistry Subject:Basic of Bioinformatics Unit: III
  • 2.
    OUTLINE  Sequence Alignment Scoring Alignments and Substitution Matrices  Inserting Gaps  Dynamic Programming  Database Searches
  • 3.
    Sequence Alignment  Comparingsequences for – Similarity – Homology  Prediction of function of genes and proteins  Construction of phylogeny  Finding motifs
  • 4.
    Sequence Alignment -HOMOLOGY  Orthologues : any gene pairwise relation where the ancestor node is a speciation event. Often have similar function  Paralogues : any gene pairwise relation where the ancestor node is a duplication event. Paralogs tend to have different functions
  • 5.
  • 6.
  • 7.
  • 8.
    Sequence Alignment –PROTEIN FUNCTIONS 4.
  • 9.
    Scoring Alignments andSubstitution Matrices  The quality of an alignment is measured by giving it a quantitative score  The simplest way of quatifying similarity between two sequences is percentage identity. – Simply measured by counting the number of identical bases or amino acids matched between the aligned sequences.
  • 10.
    Scoring Alignments andSubstitution Matrices  The dot-plot gives a visual assesment of similarity based on identity. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.
  • 11.
    Scoring Alignments andSubstitution Matrices  Percentage identity is a relatively crude measure and does bot give a complete picture of the degree of similarity of two sequences.  Scoring identical matches 1 and mismatches as 0 ignores the fact that the type of amino acids involved is highly significant.
  • 12.
    Scoring Alignments andSubstitution Matrices  Genuine matches may not be identical: Seq1: T H I S I S A S E Q U E N C E Seq1: T H A T _ _ _ S E Q U E N C E Isoleucine – Alanine: both hydrophobic Serine – Threonine : both polar
  • 13.
    Scoring Alignments andSubstitution Matrices  Scoring pairs of amino acids: – with similar properties  higher scores – With different properties  lower scores
  • 14.
    Scoring Alignments andSubstitution Matrices  To assign scores for alignmens use SUBSTITUTION MATRICES [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 5.
  • 15.
    Scoring Alignments andSubstitution Matrices  Different types of substitution matrices are being used based on: – The number of mutations required for convertion of one amino acid to the other – Similarities in physicochemical properties.
  • 16.
    Scoring Alignments andSubstitution Matrices  PAM substitution matrices: – Use closely related protein sequences to derive substitution frequencies – Accepted Point Mutations per 100 residues  250 PAM  250 mutation on 100 residues
  • 17.
    Scoring Alignments andSubstitution Matrices  BLOSUM substitution matrices: – BLOcks of Amino Acid SUbstitution Matrix – Use mutation data from highly conserved local regions – BLOSUM 62  62% identity
  • 18.
    Scoring Alignments andSubstitution Matrices  Which matrix to use ? – Depends on the problem properties, – Distantly related sequences : PAM 250 – BLOSUM 50 – Closely related sequences: PAM 120, BLOSUM 80
  • 19.
    Scoring Alignments andSubstitution Matrices  Which matrix to use ? – Some special purpose matrices (SLIM and PHAT are designed for membrane proteins) – The length of the sequende is important  Short sequences  PAM 40 or BLOSUM 80  Long sequences  PAM 250 or BLOSUM 50
  • 20.
    Scoring Alignments andSubstitution Matrices  BLOSUM – 62 and PAM 120 [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.
  • 21.
    Inserting Gaps  Gapinsertion requires a scoring penalty (gap penalty).  To achieve correct matches gaps are required  Alignment programs use gap penalties to limit the introduction of gaps in the alignments
  • 22.
    Inserting Gaps  Insertionstend to be several residues long rather than just a single residue long – Fewer insertions and deletions occur in sequences of structural importance – Smaller penalty on lengthening an existing gap (gap extension penalty) than introducing a new gap – Gap penaly is high  the number of gaps will be decreased – Gap penalty is low  more and large gaps will be inserted.
  • 23.
    Inserting Gaps  Choosinggap penalties: – Linear – Affine  Gap open penalty  Gap extension penlty
  • 24.
    Dynamic Programming  Globaland Local alignments  Pairwise and Multiple alignments [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.
  • 25.
     For apair of sequences there is a large number of possible alignments.  2 sequences of length 1000 have appriximately 10600 different alignments. Dynamic Programming
  • 26.
     Dynamic Programming: –Problem can be divided into many smaller parts. – Optimal alignment will not contain parts that are not themselves optimal. – Start from sufficiently short sub-sequences. – Alignement is additive: Dynamic Programming
  • 27.
     Needleman andWunsch were the first to propose this method.  Find optimal global alignments.  Align sequences: – Seq1: x (x1x2x3…xm) – Seq1: y (y1y2y3…yn) Dynamic Programming
  • 28.
     s(a,b) =score of aligning a and b  F(i,j) = optimal similarity of X(1:i) and Y(1:j)  Recurrence relation: – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] – Assume linear gap penalty Dynamic Programming
  • 29.
    Dynamic Programming  MatrixS of optimal scores of sub-sequence alignments. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 9.
  • 30.
  • 31.
    Dynamic Programming S(I, H)= -3, S(I, gap) = -8, S(gap, H) = -8 Recurrence relation: F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 11.
  • 32.
  • 33.
    Dynamic Programming –Linear gappenalty (E=4) [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 13.
  • 34.
    Dynamic Programming  Semi– global alignment: – When we treat terminal gaps differently than internal gaps – How to modify dynamic programming to be able to make semi – global alignment ?
  • 35.
    Dynamic Programming  Localalignment: – If we compare a sequence to whole genome – Find sub-strings whose optimal global alignment value is maximum
  • 36.
    Dynamic Programming  Whatis the difference between global and local alignment ?  Can we define the recuernce relation of local alignment similar to global alignment ?
  • 37.
    Recurrence relation ofGLOBAL ALIGNMENT: (Needleman & Wunsch) – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] Dynamic Programming
  • 38.
    Recurrence relation ofLOCAL ALIGNMENT: (Smith-Waterman) – F(i,0) = 0 – F(0,j) = 0 – F(i,j) = max [ 0, F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] Dynamic Programming
  • 39.
    Database Searches  FASTAand BLAST  Use some heuristics  Dynamic Programming Complexity – Time O(n*m) – Space O(n*m)
  • 40.
    Database Searches FASTA Good local alignment should have some exact match subsequence.  Find all k-tuples. (k=1-2 for proteins, 3-6 for DNA sequences)  Protein k – tuples  nc, sp, … (k = 2)  Nucleotide k – tuples  TAAA, CTCC,…(k = 4)
  • 41.
    Database Searches FASTA If k = 3 for nucleotide sequences. – There will be 64 possible k – tuples – Assign a number e( ):  e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3  Each 3 – tuples are represented as  xi xi+1xi+2  Assign a number to each 3 – tuple – Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40 – For example: AAA   AAA 042 + 041 + 040 = 0  CAA 142 + 041 + 040 = 16
  • 42.
    Database Searches FASTA Find each occurance of k – tuples in the sequences.  Chaining  Look – Up Tables  Consider TAAAACTCTAAC (if k = 3): 3 - tuples Position AAA (0) 2, 3 AAC (1) 4, 10 AAG (2) 0 AAT (3) 0 … …
  • 43.
    Database Searches BLAST Use short words to search the database sequence.  Searches for k – mers that will score above a threshold (T) value when aligned with query k - mer (Remember FASTA looks for k – tuples which are identical).  Use a scheme based on finite state automata (Remember FASTA use hashing and chaining fot rapid identification of k - tuples)
  • 44.
    Database Searches BLAST From Query Sequence, create query words (for protein sequences word size is 3)
  • 45.
    Database Searches BLAST Blast uses a list of high scoring words created from words similar to query words. Considers the words with a score bigger than a threshold value.
  • 46.
    Database Searches BLAST Scan each database sequence for an exact match to the list of words.  Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S".
  • 47.
    Database Searches BLAST Keep only the extended matches that have a score at least S.  Determine statistical significance of each remaining match.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    Database Searches HISTORY 1970: NW  1980: SW  1985: FASTA  1989: BLAST
  • 54.
    Books and WebReferences  Books Name : 1. Introduction To Bioinformatics by T. K. Attwood 2. BioInformatics by Sangita 3. Basic Bioinformatics by S.Ignacimuthu, s.j.  http://en.wikipedia.org/wiki/Sequence_alignment  http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf  http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics- tutorial/bioinformatics.pdf  M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland Science  Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide to the analysis of genes and proteins”, 2001, Wiley.54
  • 55.
    Images References  1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg 2.http://www.ensembl.org/info/genome/compara/tree_exa mple1.png  3.http://www.nature.com/nature/journal/v496/n7445/imag es/nature12027-f1.2.jpg  4. http://upload.wikimedia.org/wikipedia/commons/e/e6/Spo mbe_Pop2p_protein_structure_rainbow.png  5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.  7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.  14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi