SlideShare a Scribd company logo
1 of 60
CS177 Lecture 4
Sequence Alignment
Tom Madej 10.04.04
Overview
• The alignment problem.
• Homology, divergence, convergence.
• Amino acid substitution matrices.
• Pairwise sequence alignment algorithms: dynamic
programming, BLAST.
• Multiple sequence alignment.
• PSI-BLAST, position specific score matrices.
• Databases of multiple sequence alignments.
What is a sequence alignment?
• A linear, one-to-one correspondence between some of
the symbols in one sequence with some of the symbols
in another sequence.
• May be DNA or protein sequences.
• A (sequence) alignment can also be derived from a
superposition of two protein structures, it is then
sometimes called a structure alignment.
Exercise!
• From http://www.ncbi.nlm.nih.gov/Structure/ go to the
MMDB summary page for 1npq.
• Click on the colored bar for chain “A”.
• Scroll down and select the VAST neighbor 1KR7 A.
• Click on “View 3D Structure” at the top to view the
structural superposition using Cn3D.
• The sequence alignment will be shown in a separate
window.
Divergence and Convergence
• Two proteins may be similar because of divergence from
a common ancestor (i.e. they are homologous).
• … or, perhaps two proteins may be similar because they
perform similar functions and are thereby constrained,
even though they arose independently (functional
convergence hypothesis, they are then called
analogous).
Divergence vs. Convergence
• Convergence can occur; e.g. there exist enzymes with
different overall structures but remarkably similar
arrangements of active site residues to carry out a
similar function.
• So how can one establish homology?
Homology
“… whenever statistically significant sequence or structural
similarity between proteins or protein domains is
observed, this is an indication of their divergent evolution
from a common ancestor or, in other words, evidence of
homology.”
E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function,
Kluwer 2003
Two arguments…
• We see a continuous range of sequence similarity.
Convergence is extremely unlikely for highly similar
protein families. It then appears implausible to invoke it
for less similar families.
• The same or very similar functions may be carried out by
proteins with very different structures (folds). Therefore,
functional constraints cannot force convergence of the
sequence or structure between proteins.
Sequence identity for VAST neighbors of
1NQP A (a globin)
Global and local alignment
• Alignment programs may be modified, e.g. by scoring
parameters, to produce global or local alignments.
• Local alignments tend to be more useful, as it is highly
possible that a significant similarity may encompass only
a portion of one or both sequences being compared.
DNA and proteins
• Much more sensitive comparison is possible between
protein sequences than DNA sequences.
• One reason is that the third codon in the genetic code is
highly redundant, and introduces noise into DNA
comparisons.
• Another reason is that the physico-chemical properties of
the amino acid residues provide information that is highly
relevant to comparison.
Molecular Cell Biology, Lodish et al.
Fig. 3-2.
Molecular Cell Biology, Lodish et. al
Fig. 3-2.
We are faced with several problems…
• How do you quantify amino acid similarity?
• How can you handle gaps in the alignment?
• How do you define the overall similarity score?
• Can you compute an optimal alignment?
• Can you compute an alignment efficiently?
• Can you calculate statistical significance?
Amino acid substitution matrices
• Ab initio approaches; e.g. assign scores based on
number of mutations needed to transform one codon to
another, or on other physico-chemical measures of a.a.
similarity.
• Empirical, i.e. derive statistics about a.a. substitutions
from collections of alignments. (These work the best).
Computation of scores, empirical approach
sij = ln(qij/(pipj))/λ
• sij is the score for substituting amino acid type i by type j.
• qij is the observed frequency of substitutions of a.a. type i
by a.a. type j (in the training set).
• pi is the background frequency for a.a. type i in the
training set.
• λ is a positive constant.
Substitution matrices
• There are many different matrices available.
• The most commonly used are the BLOSUM or PAM
series.
• BLAST defaults to the BLOSUM62 matrix. The
BLOSUM matrices have been shown to be more
sensitive than the PAM matrices for database searches.
Gap penalties
• There is no suitable theory for gap penalties.
• The most common type of gap penalty is the affine gap
penalty: g = a + bx, where a is the gap opening penalty,
b is the gap extension penalty, and x is the number of
gapped-out residues.
• Typical values, e.g. a = 10 and b = 1 for BLAST.
Dynamic programming overview
• Dynamic programming; a computer algorithmic
technique invented in the 1940’s.
• Has applications to many types of problems.
• Key properties of problems solvable with DP: the optimal
solution typically contains optimal solutions to
subproblems, and only a “small” number of subproblems
are needed for the optimal solution. (T.H. Cormen et al.,
Introduction to Algorithms, McGraw-Hill 1990).
Dynamic programming algorithm for computing the
score of the optimal alignment
For a sequence S = a1, a2, …, an let Sj = a1, a2, …, aj
Align(Si,S’
j) = the score of the highest scoring alignment between S1
i,S2
j
S(ai, a’j)= similarity score between amino acids ai and a’j
given by a scoring matrix like PAM, BLOSUM
g – gap penalty
Align(Si,S’
j)= max
Align(Si-1,S’
j-1)+ S(ai, a’j)
Align(Si,S’
j-1) - g
Align(Si-1,S’
j) - g
{
Organizing the computation – dynamic programming table
Align(Si-1,S’
j-1)+ s(ai, a’j)
Align(Si-1,S’
j) - g
Align(Si,S’
j-1) - g
Align(Si,S’
j) = max
{
j
i
Align[i, j] =
Align
+s(ai,aj)
max
Example of DP computation with
g = 0; match = 1; mismatch = 0
Maximal Common Subsequence
A T T G C G C G C A T
A
T
G
C
T
T
A
A
C
C
A
+1
max
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 2 2 2 2 2 2 2
0 1 2 2 3
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
initialization
Example of DP computation with
g = 2; match = 2; mismatch = -1
A T T G C G C G C A T
A
T
G
C
T
T
A
A
C
C
A
+2
max
-2
-2
Initialization (penalty for starting with a gap)
0 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22
-2 2 0 -2 -4
-4 0 4 2
-6 -2 2 3
-8 -4
-10
-12
-14
-16
-18
-20
-22
The iterative algorithm
m = |S|; n = |S’|
for i  0 to m do A[i,0]- i * g
for j  0 to n do A[0,j] - j * g
for i  0 to m do
for j  0 to n
A[i,j]max (
A[i-1,j] – g
A[i-1,j-1] + s(i,j)
A[i,j-1] – g
)
return(A[m,n])
Complexity of the DP algorithm
• Time O(nm); space O(nm) where n, m are the lengths of
the two sequences.
• Space complexity can be reduced to O(n) by not storing
the entries of dynamic programming table that are no
longer needed for the computation (keep current row and
the previous row only).
DP technicalities…
• One uses a separate table to trace back the computation
and determine the actual optimal alignment.
• If the gap penalty has different opening and extension
costs, then the algorithm becomes a little more
complicated (cf. Chapter 3 in Mount).
Exercise!
• Copy and paste the sequences for 1nqpA and 1kr7A into
a notepad.
• Go to the web site: http://pir.georgetown.edu
• Select “SSearch Alignment” from the “Search and
Retrieval” pulldown menu.
• Copy in your sequences (use FASTA format and remove
numbers) and then run SSEARCH (Smith-Waterman
algorithm, a DP alignment method).
SSEARCH (pir.georgetown.edu)
BLAST (Basic Local Alignment Search Tool)
• Extremely fast, can be on the order of 50-100 times
faster than Smith-Waterman.
• Method of choice for database searches.
• Statistical theory for significance of results.
• Heuristic; does not guarantee optimal results.
• Many variants, e.g. PHI-, PSI-, RPS-BLAST.
Why database searches?
• Gene finding.
• Assigning likely function to a gene.
• Identifying regulatory elements.
• Understanding genome evolution.
• Assisting in sequence assembly.
• Finding relations between genes.
Issues in database searches
• Speed.
• Relevance of the search results (selectivity).
• Recovering all information of interest (sensitivity).
– The results depend on the search parameters, e.g. gap
penalty, scoring matrix.
– Sometimes searches with more than one matrix should be
performed.
Main idea of BLAST
• Homologous sequences are likely to contain a short high
scoring similarity region, a hit. Each hit gives a seed that
BLAST tries to extend on both sides.
Some BLAST terminology
word – substring of a sequence
word pair – pair of words of the same length
score of a word pair – score of the gapless alignment of the two
words:
V A L M R
V A K N S score = 4 + 4 – 2 – 2 - 1 = 3 (BLOSUM62)
HSP – high scoring word pair
Main steps of BLAST
• Parameters: w = length of a hit; T = min. score of a hit (for
proteins: w = 3, T = 13 (BLOSUM62)).
• Step 1: Given query sequence Q, compile the list of possible
words which form with words in Q high scoring word pairs.
• Step 2: Scan database for exact matching with the list of words
compiled in step 1.
• Step 3: Extend the hits from step 2.
• Step 4: Evaluate significance of extended hits from step 3.
Step 1: Find high scoring words
• For every word x of length w in Q make a list of words that
when aligned to x score at least T.
• Example: Let x = AIV then the score for AIA is 4+4+0
(dropped) and for AII 4+4+3 (taken).
• The number of words in the list depends on w and T, and
is usually much less than 203 (typically about 50).
Step 2 – Finding hits
• Scan database for exact matching with the list of
words compiled in step1 :
• Two techniques.
– Hash table.
– Keyword tree (there is a finite-automaton based method for
exact matching with a set of patterns represented as a tree).
Step 3: Extending hits
• Parameter: X (controlled by a user).
• Extend the hits in both ways along diagonal (ungapped
alignment) until score drops more than X relative to the
best score yet attained.
• Return the score of the highest scoring segment pair
(HSP).
extensions
E-values, P-values
• E-value, Expectation value; this is the expected number
of hits of at least the given score, that you would expect
by random chance for the search database.
• P-value, Probability value; this is the probability that a hit
would attain at least the given score, by random chance
for the search database.
• E-values are easier to interpret than P-values.
• If the E-value is small enough, e.g. no more than 0.10,
then it is essentially a P-value.
Karlin-Altschul statistics
• Expected number of HSPs with score ≥ S is:
E = KmNe –λS
• m = length of query sequence
• N = database size in residues
• K scales the search space size
• λ a scale for the scoring system
The bit score
• This formula “normalizes” a raw score:
S’ = (λS – ln K)/ln 2
• The E-value is then given by:
E = mN 2 –S’
• Allows direct comparison of E-values, even with differing
scoring matrices.
Karlin and Altschul provided a theory for computing statistical
significance
• Assumptions:
– The scoring matrix M must be such that the score for a random
alignment is negative.
– n, m (lengths of the aligned sequences) are large.
– The amino acid distribution p(x) is in the query sequence and the
data base is the same.
– Positive score is possible (i.e. M has at least one positive entry).
P(S<x) = exp (-e –l(x-m) ) thus:
P(S>=x) =1- exp (-e –l(x-m)) )
Score of high scoring sequence pairs follows extreme value
distribution
normal
Extreme values
l – decay constant
u – value of the peak
Extreme value distribution for sequence alignment
Property of extreme value distribution:
P(S<x) = exp(-e –l(x-m)) 
P(S>=x) =1- exp(-e –l(x-m))
m – location (zero in the fig from last slide), l scale;
For random sequence alignment:
m = ln Kmn/ l
K- constant that depends on p(x) and scoring matrix M
Since 1-exp(-x) ~ x and substituting for m and s:
P(S>=x) ~ e –l(x-m) = Kmn e –lx
E=value-expected number of random scores above
x
• E-value = KNme–lx
(Expected number of sequences scoring at least x observed
by chance, it is approximately same as p value for p value <
0.1 )
Refinement of the basic algorithm-the two hit method
• Observation: HSPs of interest are long and can
contain multiple hits a relatively short distance
apart.
• Central idea: Look for non-overlapping pairs of
hits that are of distance at most d on the same
diagonal.
• Benefits:
– Can reduce word size w from 3 two 2 without losing
sensitivity (actually sensitivity of two-hit BLAST is
higher).
– Since extending a hit requires a diagonal partner,
many fewer hits are extended, resulting in
increased speed.
Gapped BLAST
• Find two non-overlapping hits of score at least T and
distance at most d from one another.
• Invoke ungapped extension.
• If the HSP generated has normalized score at least Sg
bits then start gapped extension.
• Report resulting alignment if it has sufficiently
significant (very small) E-value.
Gapped BLAST statistics
• Theory does not work.
• Simulations indicate that for the best hits scores for local
alignments follow an extreme value distribution.
• Method approximates l and m to match experimental distribution -
l and m can be computed from the median and variation of the
experimental distribution.
• BLAST approach – simulate the distribution for set of scoring
matrices and a number of gap penalties. BLAST offers a choice of
parameters from this pre-computed set.
Multiple Sequence Alignment
• A multiple alignment of sequences from a protein family
makes the conserved features much more apparent.
• Much more difficult to compute than pairwise alignments.
• The most commonly used and newer programs use the
“progressive alignment strategy”.
• There are also important databases of multiple
alignments for protein families.
Multiple alignment of globins from CDD
Progressive alignment
• Determine an (approximate) phylogenetic tree for the
sequences.
• Construct the multiple alignment by merging pairwise
alignments based on the phylogenetic tree, the most
closely related sequences first, etc.
• The idea is, if you are careless about the order and
merge distantly related sequences too soon in the
process, then errors in the alignment may occur and
propagate.
Multiple sequence alignment programs
• CLUSTALW, Thompson et al. NAR 1994.
• T-Coffee (Tree-based Consistency Objective Function for
alignment Evaluation), C. Notredame et al. JMB 2000.
• MUSCLE (Multiple sequence comparison by log
expectation), R. Edgar NAR 2004.
PSI-BLAST
• Position Specific Iterated BLAST
• As a first step runs a (regular) BLAST.
• Hits that cross the threshold are used to construct a
position specific score matrix (PSSM).
• A new search is done using the PSSM to find more
remotely related sequences.
• The last two steps are iterated until convergence.
PSSM (Position Specific Score Matrix)
• One column per residue in the query sequence.
• Per-column residue frequencies are computed so that
log-odds scores may be assigned to each residue type in
each column.
• There are difficulties; e.g. pseudo-counts are needed if
there are not a lot of sequences, the sequences must be
weighted to compensate for redundancy.
Exercise!
• Go to the NCBI BLAST web site.
• Click on the link to PHI- and PSI- BLAST.
• Choose the search database to be “pdb”.
• Enter the sequence for 1h1bA, Human Neutrophil
Elastase.
• How many iterations does it take before 1dleB (Factor B
Serine Protease domain) shows up as a significant hit?
Modifying thresholds…
• On occasion, it can prove useful to modify (increase) the
inclusion threshold parameter.
• The user can also manually select hits to include in the
PSSM that do not meet the threshold, e.g. if one is
certain they are homologs to the query.
• Of course, one must always be extremely careful when
doing so!
HMMs
• Hidden Markov Models, a type of statistical model.
• Have numerous applications (including outside of
bioinformatics).
• HMMs were used to construct Pfam, a database of
multiple alignments for protein families (HMMer).
CDD Search
• Conserved Domain Database (CDD) at NCBI.
• Contains alignments from Smart, Pfam, COG, KOG, and
LOAD databases.
• Many multiple alignments are manually curated.
• PSSMs derived from multiple alignments may be
searched with RPS-BLAST (Reverse Position Specific
BLAST).
Thank you to Dr. Teresa Przytycka for
slides about dynamic programming
and BLAST.

More Related Content

Similar to lecture4.ppt Sequence Alignmentaldf sdfsadf

B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekingeProf. Wim Van Criekinge
 
2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekingeProf. Wim Van Criekinge
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfsriaisvariyasundar
 
Blast fasta
Blast fastaBlast fasta
Blast fastayaghava
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. localbenazeer fathima
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
 
Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Sheng Wang
 

Similar to lecture4.ppt Sequence Alignmentaldf sdfsadf (20)

B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
02-alignment.pdf
02-alignment.pdf02-alignment.pdf
02-alignment.pdf
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge
 
2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
 
Blast fasta
Blast fastaBlast fasta
Blast fasta
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013
 

More from alizain9604

Cell Communication4.ppt123457899987523412
Cell Communication4.ppt123457899987523412Cell Communication4.ppt123457899987523412
Cell Communication4.ppt123457899987523412alizain9604
 
proteinfolding-170226165229.pptx12345747
proteinfolding-170226165229.pptx12345747proteinfolding-170226165229.pptx12345747
proteinfolding-170226165229.pptx12345747alizain9604
 
13-miller-chap-7a-lecture.ppt1234578904578
13-miller-chap-7a-lecture.ppt123457890457813-miller-chap-7a-lecture.ppt1234578904578
13-miller-chap-7a-lecture.ppt1234578904578alizain9604
 
13-miller-chap-15-lecture (2).ppt1234578
13-miller-chap-15-lecture (2).ppt123457813-miller-chap-15-lecture (2).ppt1234578
13-miller-chap-15-lecture (2).ppt1234578alizain9604
 
signalling.ppt12345789009875431234578754345
signalling.ppt12345789009875431234578754345signalling.ppt12345789009875431234578754345
signalling.ppt12345789009875431234578754345alizain9604
 
Regulation of gene expression.ppt234578w3e45
Regulation of gene expression.ppt234578w3e45Regulation of gene expression.ppt234578w3e45
Regulation of gene expression.ppt234578w3e45alizain9604
 
1589353475-fermentation.ppt12345789934578
1589353475-fermentation.ppt123457899345781589353475-fermentation.ppt12345789934578
1589353475-fermentation.ppt12345789934578alizain9604
 
celltocellcommunication-101021235148-phpapp01.ppt
celltocellcommunication-101021235148-phpapp01.pptcelltocellcommunication-101021235148-phpapp01.ppt
celltocellcommunication-101021235148-phpapp01.pptalizain9604
 
bicatalysispresentation1-211210145704.pptx
bicatalysispresentation1-211210145704.pptxbicatalysispresentation1-211210145704.pptx
bicatalysispresentation1-211210145704.pptxalizain9604
 
13-miller-chap-15-lecture (1).ppt23457834
13-miller-chap-15-lecture (1).ppt2345783413-miller-chap-15-lecture (1).ppt23457834
13-miller-chap-15-lecture (1).ppt23457834alizain9604
 
signalling (1).ppt12345777788888885555554
signalling (1).ppt12345777788888885555554signalling (1).ppt12345777788888885555554
signalling (1).ppt12345777788888885555554alizain9604
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234alizain9604
 
transcriptionfactor-180830142612345.pptx
transcriptionfactor-180830142612345.pptxtranscriptionfactor-180830142612345.pptx
transcriptionfactor-180830142612345.pptxalizain9604
 
Ch15 Cell Signaling and Communication.ppt
Ch15 Cell Signaling and Communication.pptCh15 Cell Signaling and Communication.ppt
Ch15 Cell Signaling and Communication.pptalizain9604
 
13-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332
13-miller-chap-15-lecture.ppt123457890000000000000000987544433333333333333333213-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332
13-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332alizain9604
 
1ystr-211201195417.pptx212345783457890345
1ystr-211201195417.pptx2123457834578903451ystr-211201195417.pptx212345783457890345
1ystr-211201195417.pptx212345783457890345alizain9604
 
mixed Sample and LCN.pptx2w345789o345789o
mixed Sample and LCN.pptx2w345789o345789omixed Sample and LCN.pptx2w345789o345789o
mixed Sample and LCN.pptx2w345789o345789oalizain9604
 
Y_Workshop_WI_planz (3).ppt12345789999987543
Y_Workshop_WI_planz (3).ppt12345789999987543Y_Workshop_WI_planz (3).ppt12345789999987543
Y_Workshop_WI_planz (3).ppt12345789999987543alizain9604
 
Production of Vaccine.pptx123457888889999990000
Production of Vaccine.pptx123457888889999990000Production of Vaccine.pptx123457888889999990000
Production of Vaccine.pptx123457888889999990000alizain9604
 
Bill Holmberg.ppt1234578999999234578912345
Bill Holmberg.ppt1234578999999234578912345Bill Holmberg.ppt1234578999999234578912345
Bill Holmberg.ppt1234578999999234578912345alizain9604
 

More from alizain9604 (20)

Cell Communication4.ppt123457899987523412
Cell Communication4.ppt123457899987523412Cell Communication4.ppt123457899987523412
Cell Communication4.ppt123457899987523412
 
proteinfolding-170226165229.pptx12345747
proteinfolding-170226165229.pptx12345747proteinfolding-170226165229.pptx12345747
proteinfolding-170226165229.pptx12345747
 
13-miller-chap-7a-lecture.ppt1234578904578
13-miller-chap-7a-lecture.ppt123457890457813-miller-chap-7a-lecture.ppt1234578904578
13-miller-chap-7a-lecture.ppt1234578904578
 
13-miller-chap-15-lecture (2).ppt1234578
13-miller-chap-15-lecture (2).ppt123457813-miller-chap-15-lecture (2).ppt1234578
13-miller-chap-15-lecture (2).ppt1234578
 
signalling.ppt12345789009875431234578754345
signalling.ppt12345789009875431234578754345signalling.ppt12345789009875431234578754345
signalling.ppt12345789009875431234578754345
 
Regulation of gene expression.ppt234578w3e45
Regulation of gene expression.ppt234578w3e45Regulation of gene expression.ppt234578w3e45
Regulation of gene expression.ppt234578w3e45
 
1589353475-fermentation.ppt12345789934578
1589353475-fermentation.ppt123457899345781589353475-fermentation.ppt12345789934578
1589353475-fermentation.ppt12345789934578
 
celltocellcommunication-101021235148-phpapp01.ppt
celltocellcommunication-101021235148-phpapp01.pptcelltocellcommunication-101021235148-phpapp01.ppt
celltocellcommunication-101021235148-phpapp01.ppt
 
bicatalysispresentation1-211210145704.pptx
bicatalysispresentation1-211210145704.pptxbicatalysispresentation1-211210145704.pptx
bicatalysispresentation1-211210145704.pptx
 
13-miller-chap-15-lecture (1).ppt23457834
13-miller-chap-15-lecture (1).ppt2345783413-miller-chap-15-lecture (1).ppt23457834
13-miller-chap-15-lecture (1).ppt23457834
 
signalling (1).ppt12345777788888885555554
signalling (1).ppt12345777788888885555554signalling (1).ppt12345777788888885555554
signalling (1).ppt12345777788888885555554
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
transcriptionfactor-180830142612345.pptx
transcriptionfactor-180830142612345.pptxtranscriptionfactor-180830142612345.pptx
transcriptionfactor-180830142612345.pptx
 
Ch15 Cell Signaling and Communication.ppt
Ch15 Cell Signaling and Communication.pptCh15 Cell Signaling and Communication.ppt
Ch15 Cell Signaling and Communication.ppt
 
13-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332
13-miller-chap-15-lecture.ppt123457890000000000000000987544433333333333333333213-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332
13-miller-chap-15-lecture.ppt1234578900000000000000009875444333333333333333332
 
1ystr-211201195417.pptx212345783457890345
1ystr-211201195417.pptx2123457834578903451ystr-211201195417.pptx212345783457890345
1ystr-211201195417.pptx212345783457890345
 
mixed Sample and LCN.pptx2w345789o345789o
mixed Sample and LCN.pptx2w345789o345789omixed Sample and LCN.pptx2w345789o345789o
mixed Sample and LCN.pptx2w345789o345789o
 
Y_Workshop_WI_planz (3).ppt12345789999987543
Y_Workshop_WI_planz (3).ppt12345789999987543Y_Workshop_WI_planz (3).ppt12345789999987543
Y_Workshop_WI_planz (3).ppt12345789999987543
 
Production of Vaccine.pptx123457888889999990000
Production of Vaccine.pptx123457888889999990000Production of Vaccine.pptx123457888889999990000
Production of Vaccine.pptx123457888889999990000
 
Bill Holmberg.ppt1234578999999234578912345
Bill Holmberg.ppt1234578999999234578912345Bill Holmberg.ppt1234578999999234578912345
Bill Holmberg.ppt1234578999999234578912345
 

Recently uploaded

College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...anamikaraghav4
 
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...aamir
 
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...aamir
 
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service Asansol
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service AsansolVIP Call Girls Asansol Ananya 8250192130 Independent Escort Service Asansol
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service AsansolRiya Pathan
 
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call Girl
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call GirlFun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call Girl
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call GirlApsara Of India
 
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...anamikaraghav4
 
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7Riya Pathan
 
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...anamikaraghav4
 
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...Riya Pathan
 
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girl Nashik Saloni 7001305949 Independent Escort Service Nashik
Call Girl Nashik Saloni 7001305949 Independent Escort Service NashikCall Girl Nashik Saloni 7001305949 Independent Escort Service Nashik
Call Girl Nashik Saloni 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...Riya Pathan
 
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal Escorts
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal EscortsCall Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal Escorts
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal EscortsApsara Of India
 
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...Apsara Of India
 
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment Booking
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment BookingAir-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment Booking
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment BookingRiya Pathan
 
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.com
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.comKolkata Call Girls Service +918240919228 - Kolkatanightgirls.com
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.comKolkata Call Girls
 
Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130Suhani Kapoor
 
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Serviceanamikaraghav4
 

Recently uploaded (20)

College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
 
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...
Dakshineswar Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Se...
 
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...
Nayabad Call Girls ✔ 8005736733 ✔ Hot Model With Sexy Bhabi Ready For Sex At ...
 
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service Asansol
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service AsansolVIP Call Girls Asansol Ananya 8250192130 Independent Escort Service Asansol
VIP Call Girls Asansol Ananya 8250192130 Independent Escort Service Asansol
 
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call Girl
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call GirlFun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call Girl
Fun Call Girls In Goa 7028418221 Escort Service In Morjim Beach Call Girl
 
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...
Verified Call Girls Esplanade - [ Cash on Delivery ] Contact 8250192130 Escor...
 
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7
Kolkata Call Girl Bagbazar 👉 8250192130 ❣️💯 Available With Room 24×7
 
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(DIVYA) Dhanori Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...
VIP Call Girls Sonagachi - 8250192130 Escorts Service 50% Off with Cash ON De...
 
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...
Housewife Call Girls Sonagachi - 8250192130 Booking and charges genuine rate ...
 
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
 
Call Girl Nashik Saloni 7001305949 Independent Escort Service Nashik
Call Girl Nashik Saloni 7001305949 Independent Escort Service NashikCall Girl Nashik Saloni 7001305949 Independent Escort Service Nashik
Call Girl Nashik Saloni 7001305949 Independent Escort Service Nashik
 
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...
Low Rate Call Girls Gulbarga Anika 8250192130 Independent Escort Service Gulb...
 
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal Escorts
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal EscortsCall Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal Escorts
Call Girls In Karnal O8860008073 Sector 6 7 8 9 Karnal Escorts
 
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...
Contact:- 8860008073 Call Girls in Karnal Escort Service Available at Afforda...
 
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment Booking
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment BookingAir-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment Booking
Air-Hostess Call Girls Shobhabazar | 8250192130 At Low Cost Cash Payment Booking
 
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.com
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.comKolkata Call Girls Service +918240919228 - Kolkatanightgirls.com
Kolkata Call Girls Service +918240919228 - Kolkatanightgirls.com
 
Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
 
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130
VIP Call Girls Service Banjara Hills Hyderabad Call +91-8250192130
 
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
 

lecture4.ppt Sequence Alignmentaldf sdfsadf

  • 1. CS177 Lecture 4 Sequence Alignment Tom Madej 10.04.04
  • 2. Overview • The alignment problem. • Homology, divergence, convergence. • Amino acid substitution matrices. • Pairwise sequence alignment algorithms: dynamic programming, BLAST. • Multiple sequence alignment. • PSI-BLAST, position specific score matrices. • Databases of multiple sequence alignments.
  • 3. What is a sequence alignment? • A linear, one-to-one correspondence between some of the symbols in one sequence with some of the symbols in another sequence. • May be DNA or protein sequences. • A (sequence) alignment can also be derived from a superposition of two protein structures, it is then sometimes called a structure alignment.
  • 4. Exercise! • From http://www.ncbi.nlm.nih.gov/Structure/ go to the MMDB summary page for 1npq. • Click on the colored bar for chain “A”. • Scroll down and select the VAST neighbor 1KR7 A. • Click on “View 3D Structure” at the top to view the structural superposition using Cn3D. • The sequence alignment will be shown in a separate window.
  • 5. Divergence and Convergence • Two proteins may be similar because of divergence from a common ancestor (i.e. they are homologous). • … or, perhaps two proteins may be similar because they perform similar functions and are thereby constrained, even though they arose independently (functional convergence hypothesis, they are then called analogous).
  • 6. Divergence vs. Convergence • Convergence can occur; e.g. there exist enzymes with different overall structures but remarkably similar arrangements of active site residues to carry out a similar function. • So how can one establish homology?
  • 7. Homology “… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.” E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003
  • 8. Two arguments… • We see a continuous range of sequence similarity. Convergence is extremely unlikely for highly similar protein families. It then appears implausible to invoke it for less similar families. • The same or very similar functions may be carried out by proteins with very different structures (folds). Therefore, functional constraints cannot force convergence of the sequence or structure between proteins.
  • 9. Sequence identity for VAST neighbors of 1NQP A (a globin)
  • 10. Global and local alignment • Alignment programs may be modified, e.g. by scoring parameters, to produce global or local alignments. • Local alignments tend to be more useful, as it is highly possible that a significant similarity may encompass only a portion of one or both sequences being compared.
  • 11. DNA and proteins • Much more sensitive comparison is possible between protein sequences than DNA sequences. • One reason is that the third codon in the genetic code is highly redundant, and introduces noise into DNA comparisons. • Another reason is that the physico-chemical properties of the amino acid residues provide information that is highly relevant to comparison.
  • 12. Molecular Cell Biology, Lodish et al. Fig. 3-2.
  • 13. Molecular Cell Biology, Lodish et. al Fig. 3-2.
  • 14. We are faced with several problems… • How do you quantify amino acid similarity? • How can you handle gaps in the alignment? • How do you define the overall similarity score? • Can you compute an optimal alignment? • Can you compute an alignment efficiently? • Can you calculate statistical significance?
  • 15. Amino acid substitution matrices • Ab initio approaches; e.g. assign scores based on number of mutations needed to transform one codon to another, or on other physico-chemical measures of a.a. similarity. • Empirical, i.e. derive statistics about a.a. substitutions from collections of alignments. (These work the best).
  • 16. Computation of scores, empirical approach sij = ln(qij/(pipj))/λ • sij is the score for substituting amino acid type i by type j. • qij is the observed frequency of substitutions of a.a. type i by a.a. type j (in the training set). • pi is the background frequency for a.a. type i in the training set. • λ is a positive constant.
  • 17. Substitution matrices • There are many different matrices available. • The most commonly used are the BLOSUM or PAM series. • BLAST defaults to the BLOSUM62 matrix. The BLOSUM matrices have been shown to be more sensitive than the PAM matrices for database searches.
  • 18.
  • 19. Gap penalties • There is no suitable theory for gap penalties. • The most common type of gap penalty is the affine gap penalty: g = a + bx, where a is the gap opening penalty, b is the gap extension penalty, and x is the number of gapped-out residues. • Typical values, e.g. a = 10 and b = 1 for BLAST.
  • 20. Dynamic programming overview • Dynamic programming; a computer algorithmic technique invented in the 1940’s. • Has applications to many types of problems. • Key properties of problems solvable with DP: the optimal solution typically contains optimal solutions to subproblems, and only a “small” number of subproblems are needed for the optimal solution. (T.H. Cormen et al., Introduction to Algorithms, McGraw-Hill 1990).
  • 21. Dynamic programming algorithm for computing the score of the optimal alignment For a sequence S = a1, a2, …, an let Sj = a1, a2, …, aj Align(Si,S’ j) = the score of the highest scoring alignment between S1 i,S2 j S(ai, a’j)= similarity score between amino acids ai and a’j given by a scoring matrix like PAM, BLOSUM g – gap penalty Align(Si,S’ j)= max Align(Si-1,S’ j-1)+ S(ai, a’j) Align(Si,S’ j-1) - g Align(Si-1,S’ j) - g {
  • 22. Organizing the computation – dynamic programming table Align(Si-1,S’ j-1)+ s(ai, a’j) Align(Si-1,S’ j) - g Align(Si,S’ j-1) - g Align(Si,S’ j) = max { j i Align[i, j] = Align +s(ai,aj) max
  • 23. Example of DP computation with g = 0; match = 1; mismatch = 0 Maximal Common Subsequence A T T G C G C G C A T A T G C T T A A C C A +1 max 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 2 2 2 2 2 2 2 0 1 2 2 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 initialization
  • 24. Example of DP computation with g = 2; match = 2; mismatch = -1 A T T G C G C G C A T A T G C T T A A C C A +2 max -2 -2 Initialization (penalty for starting with a gap) 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22 -2 2 0 -2 -4 -4 0 4 2 -6 -2 2 3 -8 -4 -10 -12 -14 -16 -18 -20 -22
  • 25. The iterative algorithm m = |S|; n = |S’| for i  0 to m do A[i,0]- i * g for j  0 to n do A[0,j] - j * g for i  0 to m do for j  0 to n A[i,j]max ( A[i-1,j] – g A[i-1,j-1] + s(i,j) A[i,j-1] – g ) return(A[m,n])
  • 26. Complexity of the DP algorithm • Time O(nm); space O(nm) where n, m are the lengths of the two sequences. • Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only).
  • 27. DP technicalities… • One uses a separate table to trace back the computation and determine the actual optimal alignment. • If the gap penalty has different opening and extension costs, then the algorithm becomes a little more complicated (cf. Chapter 3 in Mount).
  • 28. Exercise! • Copy and paste the sequences for 1nqpA and 1kr7A into a notepad. • Go to the web site: http://pir.georgetown.edu • Select “SSearch Alignment” from the “Search and Retrieval” pulldown menu. • Copy in your sequences (use FASTA format and remove numbers) and then run SSEARCH (Smith-Waterman algorithm, a DP alignment method).
  • 30. BLAST (Basic Local Alignment Search Tool) • Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman. • Method of choice for database searches. • Statistical theory for significance of results. • Heuristic; does not guarantee optimal results. • Many variants, e.g. PHI-, PSI-, RPS-BLAST.
  • 31. Why database searches? • Gene finding. • Assigning likely function to a gene. • Identifying regulatory elements. • Understanding genome evolution. • Assisting in sequence assembly. • Finding relations between genes.
  • 32. Issues in database searches • Speed. • Relevance of the search results (selectivity). • Recovering all information of interest (sensitivity). – The results depend on the search parameters, e.g. gap penalty, scoring matrix. – Sometimes searches with more than one matrix should be performed.
  • 33. Main idea of BLAST • Homologous sequences are likely to contain a short high scoring similarity region, a hit. Each hit gives a seed that BLAST tries to extend on both sides.
  • 34. Some BLAST terminology word – substring of a sequence word pair – pair of words of the same length score of a word pair – score of the gapless alignment of the two words: V A L M R V A K N S score = 4 + 4 – 2 – 2 - 1 = 3 (BLOSUM62) HSP – high scoring word pair
  • 35. Main steps of BLAST • Parameters: w = length of a hit; T = min. score of a hit (for proteins: w = 3, T = 13 (BLOSUM62)). • Step 1: Given query sequence Q, compile the list of possible words which form with words in Q high scoring word pairs. • Step 2: Scan database for exact matching with the list of words compiled in step 1. • Step 3: Extend the hits from step 2. • Step 4: Evaluate significance of extended hits from step 3.
  • 36. Step 1: Find high scoring words • For every word x of length w in Q make a list of words that when aligned to x score at least T. • Example: Let x = AIV then the score for AIA is 4+4+0 (dropped) and for AII 4+4+3 (taken). • The number of words in the list depends on w and T, and is usually much less than 203 (typically about 50).
  • 37. Step 2 – Finding hits • Scan database for exact matching with the list of words compiled in step1 : • Two techniques. – Hash table. – Keyword tree (there is a finite-automaton based method for exact matching with a set of patterns represented as a tree).
  • 38. Step 3: Extending hits • Parameter: X (controlled by a user). • Extend the hits in both ways along diagonal (ungapped alignment) until score drops more than X relative to the best score yet attained. • Return the score of the highest scoring segment pair (HSP). extensions
  • 39. E-values, P-values • E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database. • P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database. • E-values are easier to interpret than P-values. • If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.
  • 40. Karlin-Altschul statistics • Expected number of HSPs with score ≥ S is: E = KmNe –λS • m = length of query sequence • N = database size in residues • K scales the search space size • λ a scale for the scoring system
  • 41. The bit score • This formula “normalizes” a raw score: S’ = (λS – ln K)/ln 2 • The E-value is then given by: E = mN 2 –S’ • Allows direct comparison of E-values, even with differing scoring matrices.
  • 42. Karlin and Altschul provided a theory for computing statistical significance • Assumptions: – The scoring matrix M must be such that the score for a random alignment is negative. – n, m (lengths of the aligned sequences) are large. – The amino acid distribution p(x) is in the query sequence and the data base is the same. – Positive score is possible (i.e. M has at least one positive entry).
  • 43. P(S<x) = exp (-e –l(x-m) ) thus: P(S>=x) =1- exp (-e –l(x-m)) ) Score of high scoring sequence pairs follows extreme value distribution normal Extreme values l – decay constant u – value of the peak
  • 44. Extreme value distribution for sequence alignment Property of extreme value distribution: P(S<x) = exp(-e –l(x-m))  P(S>=x) =1- exp(-e –l(x-m)) m – location (zero in the fig from last slide), l scale; For random sequence alignment: m = ln Kmn/ l K- constant that depends on p(x) and scoring matrix M Since 1-exp(-x) ~ x and substituting for m and s: P(S>=x) ~ e –l(x-m) = Kmn e –lx
  • 45. E=value-expected number of random scores above x • E-value = KNme–lx (Expected number of sequences scoring at least x observed by chance, it is approximately same as p value for p value < 0.1 )
  • 46. Refinement of the basic algorithm-the two hit method • Observation: HSPs of interest are long and can contain multiple hits a relatively short distance apart. • Central idea: Look for non-overlapping pairs of hits that are of distance at most d on the same diagonal. • Benefits: – Can reduce word size w from 3 two 2 without losing sensitivity (actually sensitivity of two-hit BLAST is higher). – Since extending a hit requires a diagonal partner, many fewer hits are extended, resulting in increased speed.
  • 47. Gapped BLAST • Find two non-overlapping hits of score at least T and distance at most d from one another. • Invoke ungapped extension. • If the HSP generated has normalized score at least Sg bits then start gapped extension. • Report resulting alignment if it has sufficiently significant (very small) E-value.
  • 48. Gapped BLAST statistics • Theory does not work. • Simulations indicate that for the best hits scores for local alignments follow an extreme value distribution. • Method approximates l and m to match experimental distribution - l and m can be computed from the median and variation of the experimental distribution. • BLAST approach – simulate the distribution for set of scoring matrices and a number of gap penalties. BLAST offers a choice of parameters from this pre-computed set.
  • 49. Multiple Sequence Alignment • A multiple alignment of sequences from a protein family makes the conserved features much more apparent. • Much more difficult to compute than pairwise alignments. • The most commonly used and newer programs use the “progressive alignment strategy”. • There are also important databases of multiple alignments for protein families.
  • 50. Multiple alignment of globins from CDD
  • 51. Progressive alignment • Determine an (approximate) phylogenetic tree for the sequences. • Construct the multiple alignment by merging pairwise alignments based on the phylogenetic tree, the most closely related sequences first, etc. • The idea is, if you are careless about the order and merge distantly related sequences too soon in the process, then errors in the alignment may occur and propagate.
  • 52. Multiple sequence alignment programs • CLUSTALW, Thompson et al. NAR 1994. • T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation), C. Notredame et al. JMB 2000. • MUSCLE (Multiple sequence comparison by log expectation), R. Edgar NAR 2004.
  • 53.
  • 54. PSI-BLAST • Position Specific Iterated BLAST • As a first step runs a (regular) BLAST. • Hits that cross the threshold are used to construct a position specific score matrix (PSSM). • A new search is done using the PSSM to find more remotely related sequences. • The last two steps are iterated until convergence.
  • 55. PSSM (Position Specific Score Matrix) • One column per residue in the query sequence. • Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column. • There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.
  • 56. Exercise! • Go to the NCBI BLAST web site. • Click on the link to PHI- and PSI- BLAST. • Choose the search database to be “pdb”. • Enter the sequence for 1h1bA, Human Neutrophil Elastase. • How many iterations does it take before 1dleB (Factor B Serine Protease domain) shows up as a significant hit?
  • 57. Modifying thresholds… • On occasion, it can prove useful to modify (increase) the inclusion threshold parameter. • The user can also manually select hits to include in the PSSM that do not meet the threshold, e.g. if one is certain they are homologs to the query. • Of course, one must always be extremely careful when doing so!
  • 58. HMMs • Hidden Markov Models, a type of statistical model. • Have numerous applications (including outside of bioinformatics). • HMMs were used to construct Pfam, a database of multiple alignments for protein families (HMMer).
  • 59. CDD Search • Conserved Domain Database (CDD) at NCBI. • Contains alignments from Smart, Pfam, COG, KOG, and LOAD databases. • Many multiple alignments are manually curated. • PSSMs derived from multiple alignments may be searched with RPS-BLAST (Reverse Position Specific BLAST).
  • 60. Thank you to Dr. Teresa Przytycka for slides about dynamic programming and BLAST.