B.sc biochem i bobi u 3.1 sequence alignment

Course: B.Sc Biochemistry
Subject: Basic of Bioinformatics
Unit: III

OUTLINE
 Sequence Alignment
 Scoring Alignments and Substitution Matrices
 Inserting Gaps
 Dynamic Programming
 Database Searches

Sequence Alignment
 Comparing sequences for
– Similarity
– Homology
 Prediction of function of genes and proteins
 Construction of phylogeny
 Finding motifs

Sequence Alignment - HOMOLOGY
 Orthologues : any gene pairwise relation
where the ancestor node is a speciation
event. Often have similar function
 Paralogues : any gene pairwise relation
where the ancestor node is a duplication
event. Paralogs tend to have different
functions

1.

2.

Sequence Alignment - PHYLOGENY
3.

Sequence Alignment – PROTEIN
FUNCTIONS
4.

Scoring Alignments and Substitution
Matrices
 The quality of an alignment is measured by
giving it a quantitative score
 The simplest way of quatifying similarity
between two sequences is percentage
identity.
– Simply measured by counting the number of
identical bases or amino acids matched
between the aligned sequences.

Matrices
 The dot-plot
gives a visual
assesment of
similarity based
on identity.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.

Matrices
 Percentage identity is a relatively crude
measure and does bot give a complete
picture of the degree of similarity of two
sequences.
 Scoring identical matches 1 and mismatches
as 0 ignores the fact that the type of amino
acids involved is highly significant.

Matrices
 Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar

Matrices
 Scoring pairs of amino acids:
– with similar properties  higher scores
– With different properties  lower scores

Matrices
 To assign scores for alignmens use
SUBSTITUTION MATRICES
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
5.

Matrices
 Different types of substitution matrices are
being used based on:
– The number of mutations required for
convertion of one amino acid to the other
– Similarities in physicochemical properties.

Matrices
 PAM substitution matrices:
– Use closely related protein sequences to
derive substitution frequencies
– Accepted Point Mutations per 100 residues
 250 PAM  250 mutation on 100 residues

Matrices
 BLOSUM substitution matrices:
– BLOcks of Amino Acid SUbstitution Matrix
– Use mutation data from highly conserved
local regions
– BLOSUM 62  62% identity

Matrices
 Which matrix to use ?
– Depends on the problem properties,
– Distantly related sequences : PAM 250 –
BLOSUM 50
– Closely related sequences: PAM 120,
BLOSUM 80

Matrices
 Which matrix to use ?
– Some special purpose matrices (SLIM and
PHAT are designed for membrane proteins)
– The length of the sequende is important
 Short sequences  PAM 40 or BLOSUM 80
 Long sequences  PAM 250 or BLOSUM 50

Matrices
 BLOSUM – 62 and PAM 120
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.

Inserting Gaps
 Gap insertion requires a scoring penalty (gap
penalty).
 To achieve correct matches gaps are
required
 Alignment programs use gap penalties to
limit the introduction of gaps in the
alignments

Inserting Gaps
 Insertions tend to be several residues long
rather than just a single residue long
– Fewer insertions and deletions occur in sequences
of structural importance
– Smaller penalty on lengthening an existing gap
(gap extension penalty) than introducing a new
gap
– Gap penaly is high  the number of gaps will be
decreased
– Gap penalty is low  more and large gaps will be
inserted.

Inserting Gaps
 Choosing gap penalties:
– Linear
– Affine
 Gap open penalty
 Gap extension penlty

Dynamic Programming
 Global and Local alignments
 Pairwise and Multiple alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.

 For a pair of sequences there is a large
number of possible alignments.
 2 sequences of length 1000 have
appriximately 10600
different alignments.
Dynamic Programming

 Dynamic Programming:
– Problem can be divided into many smaller parts.
– Optimal alignment will not contain parts that are
not themselves optimal.
– Start from sufficiently short sub-sequences.
– Alignement is additive:
Dynamic Programming

 Needleman and Wunsch were the first to
propose this method.
 Find optimal global alignments.
 Align sequences:
– Seq1: x (x1x2x3…xm)
– Seq1: y (y1y2y3…yn)
Dynamic Programming

 s(a,b) = score of aligning a and b
 F(i,j) = optimal similarity of X(1:i) and Y(1:j)
 Recurrence relation:
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
– Assume linear gap penalty
Dynamic Programming

Dynamic Programming
 Matrix S of optimal scores of sub-sequence
alignments.
9.

Dynamic Programming
S(I, T) = -1,
10.

Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
11.

Dynamic Programming
12.

Dynamic Programming
–Linear gap penalty (E=4)
13.

Dynamic Programming
 Semi – global alignment:
– When we treat terminal gaps differently than
internal gaps
– How to modify dynamic programming to be able
to make semi – global alignment ?

Dynamic Programming
 Local alignment:
– If we compare a sequence to whole genome
– Find sub-strings whose optimal global
alignment value is maximum

Dynamic Programming
 What is the difference between global and
local alignment ?
 Can we define the recuernce relation of local
alignment similar to global alignment ?

Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch)
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming

Recurrence relation of LOCAL ALIGNMENT:
(Smith-Waterman)
– F(i,0) = 0
– F(0,j) = 0
– F(i,j) = max [ 0,
F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming

Database Searches
 FASTA and BLAST
 Use some heuristics
 Dynamic Programming Complexity
– Time O(n*m)
– Space O(n*m)

Database Searches FASTA
 Good local alignment should have some
exact match subsequence.
 Find all k-tuples. (k=1-2 for proteins, 3-6 for
DNA sequences)
 Protein k – tuples  nc, sp, … (k = 2)
 Nucleotide k – tuples  TAAA, CTCC,…(k = 4)

 If k = 3 for nucleotide sequences.
– There will be 64 possible k – tuples
– Assign a number e( ):
 e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3
 Each 3 – tuples are represented as  xi xi+1xi+2
 Assign a number to each 3 – tuple
– Ci = e(xi)42
+ e(xi+1)41
+ e(xi+2)40
– For example: AAA 
 AAA 042
+ 041
+ 040
= 0
 CAA 142
+ 041
+ 040
= 16

 Find each occurance of k – tuples in the
sequences.
 Chaining  Look – Up Tables
 Consider TAAAACTCTAAC (if k = 3):
3 - tuples Position
AAA (0) 2, 3
AAC (1) 4, 10
AAG (2) 0
AAT (3) 0
… …

Database Searches BLAST
 Use short words to search the database
sequence.
 Searches for k – mers that will score above a
threshold (T) value when aligned with query k -
mer (Remember FASTA looks for k – tuples
which are identical).
 Use a scheme based on finite state automata
(Remember FASTA use hashing and chaining
fot rapid identification of k - tuples)

 From Query Sequence, create query words
(for protein sequences word size is 3)

 Blast uses a list of high scoring words created
from words similar to query words. Considers
the words with a score bigger than a threshold
value.

 Scan each database sequence for an exact
match to the list of words.
 Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".

 Keep only the extended matches that have a
score at least S.
 Determine statistical significance of each
remaining match.

http://blast.ncbi.nlm.nih.gov/Blast.cgi
1.
14.

Database Searches HISTORY
 1970: NW
 1980: SW
 1985: FASTA
 1989: BLAST

Books and Web References
 Books Name :
1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.
 http://en.wikipedia.org/wiki/Sequence_alignment
 http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf
 http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics-
tutorial/bioinformatics.pdf
 M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008,
Garland Science
 Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A
practical guide to the analysis of genes and proteins”, 2001,
Wiley.54

Images References
 1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg
 2.http://www.ensembl.org/info/genome/compara/tree_exa
mple1.png
 3.http://www.nature.com/nature/journal/v496/n7445/imag
es/nature12027-f1.2.jpg
 4.
http://upload.wikimedia.org/wikipedia/commons/e/e6/Spo
mbe_Pop2p_protein_structure_rainbow.png
 5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
 14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi

B.sc biochem i bobi u 3.1 sequence alignment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to B.sc biochem i bobi u 3.1 sequence alignment

Similar to B.sc biochem i bobi u 3.1 sequence alignment (20)

More from Rai University

More from Rai University (20)

B.sc biochem i bobi u 3.1 sequence alignment