Spermiogenesis or Spermateleosis or metamorphosis of spermatid
4. sequence alignment.pptx
1. DNA and Protein sequence
alignments:
Pairwise alignment
Dot Plots
Substitution Matrices (PAM,
BLOSUM)
Computer applications for
Biosciences and Bioinformatics
Module III b
2. Sequence alignment
To align and score a pair of sequences (DNA or
protein)
To find the correspondences between substrings in
the sequences such that the similarity score is
maximized
Why do alignment?
To find out homology: similarity due to descent
from a common ancestor
Often we can infer homology from similarity
Thus we can sometimes infer structure/function
from sequence similarity
3. Sequence analysis tools depending on
pair wise comparison
• Multiple alignments
• Profile and HMM making (used to search for
protein families and domains)
• 3D protein structure prediction
• Phylogenetic analysis
• Construction of certain substitution matrices
• Similarity searches in a database
4. Homology
Members of a family are called homologs or
homologous molecules.
Homologous sequences can be divided into two
groups
– orthologous sequences: sequences that differ
because they are found in different species (e.g.
human α -globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event (e.g. human
α-globin and human β-globin, various versions of
both)
5. Issues in Sequence Alignment
The sequences we are comparing probably differ
in length
There may be only a relatively small region in
the sequences that match
We want to allow partial matches (i.e. some
amino acid pairs are more substitutable than
others)
Variable length regions may have been
inserted/deleted from the common ancestral
sequence
6. Applications
Sequence alignment arises in many fields:
• Molecular biology
• Inexact text matching (e.g. spell checkers; web page search)
• Speech recognition
In general:
• The precise definition of what constitutes an alignment may
vary by field, and even within a field.
• Many different alignments of two sequences are possible, so to
select among them one requires an objective (score) function
on alignments.
• The number of possible alignments of two sequences grows
exponentially with the length of the sequences. Good
algorithms are required.
7.
8. Important questions
Q. What do we want to align and how?
A: Two sequences (nucleotide or protein) through pairwise
alignment
Or To find similar sequences in a database against our query
sequence by multiple sequence alignment
Q. How do we “score” an alignment?
Simple scoring (match= 1, mismatch= 0),
Dot plots (graphical representation)
Substitution matrices (PAM and BLOSUM)[s(a,b)
indicates score of aligning character a with character b;
Also accounts for relative substitutability of amino acid
pairs in the context of evolution]
Gap penalty function: w(k) indicates cost of a gap of
length k
9. Q. How do we find the “best” alignment?
A: Alignment algorithms
An alignment program tries to find the best alignment between
two sequences given the scoring system.
Alignement types
Global Alignment between the complete sequence A and the
complete sequence B
Local Alignment between a sub-sequence of A an a subsequence
of B
Computer implementation (Algorithms)
Dynamic programming
Global: Needleman-Wunsch
Local: Smith-Waterman
Heuristic algorithms (faster but approximate)
BLAST
FASTA
10. Pairwise alignment
The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
There are lots of possible alignments.
Two sequences can always be aligned.
Sequence alignments have to be scored.
Often there is more than one solution with
the same score.
11. Sequence comparison through pairwise
alignments
Goal of pairwise comparison is to find conserved
regions (if any) between two sequences
Extrapolate information about our sequence
using the known characteristics of the other
sequence
12. Evolution of sequences
Sequences evolve through mutation and selection
[Selective pressure is different for each residue
position in a protein (i.e. conservation of active
site, structure, charge,etc.)]
Modular nature of proteins [Nature keeps re-using
domains]
Alignments try to tell the evolutionary story of the
proteins
Relationships
14. Some Definitions
Identity
• Proportion of pairs of identical residues between two aligned sequences.
• Generally expressed as a percentage.
• This value strongly depends on how the two sequences are aligned.
Similarity
• Proportion of pairs of similar residues between two aligned sequences.
• If two residues are similar is determined by a substitution matrix.
• This value also depends strongly on how the two sequences are aligned,
as well as on the substitution matrix used.
Homology
• Two sequences are homologous if and only if they have a common
ancestor.
• There is no such thing as a level of homology ! (It's either yes or no)
Note: Homologous sequences do not necessarily serve the same function...
Nor are they always highly similar: structure may be conserved while
sequence is not
15. Consider a set S (say, globins) and a test t that tries to detect
members of S (for example, through a pairwise comparison
with another globin).
True positive
• A protein is a true positive if it belongs to S and is detected
by t.
True negative
• A protein is a true negative if it does not belong to S and is
not detected by t.
False positive
• A protein is a false positive if it does not belong to S and is
(incorrectly) detected by t.
False negative
• A protein is a false negative if it belongs to S and is not
detected by t (but should be).
16. Example
The set of all globins and a test to identify them
Consider:
A set S (say, globins: G)
A test t that tries to detect members of S (for
example, through a pairwise comparison with
another globin).
17. Concept of a sequence alignment
Pairwise Alignment:
Explicit mapping between the residues of 2
sequences
Tolerant to errors (mismatches, insertion /
deletions or indels)
Evaluation of the alignment in a biological concept
(significance)
18. Number of alignments
There are many ways to align two sequences
Consider the sequence fragments below: a simple
alignment shows some conserved portions
Number of possible alignments for 2 sequences of length 1000 residues:
more than 10 600gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080)
19. What is a good alignment ?
We need a way to evaluate the biological meaning
of a given alignment
Intuitively we "know" that the following alignment:
We can express this notion more rigorously, by using a scoring
system
20. Scoring system
Simple alignment scores
A simple way (but not the best) to score an
alignment is to count 1 for each match and 0 for
each mismatch.
21. Importance of the scoring system
Discrimination of significant biological alignments
Based on physico-chemical properties of amino-acids
Hydrophobicity, acid / base, sterical properties, ...
Scoring system scales are arbitrary
Based on biological sequence information
Substitutions observed in structural or evolutionary alignments of well
studied protein families
Scoring systems have a probabilistic foundation
Substitution matrices
In proteins some mismatches are more acceptable than others
Substitution matrices give a score for each substitution of one amino acid
by another
Dot Plots or Diagonal plots
Produces a graphical representation of similarity regions.
22. Dot Plots
A dot plot gives an overview of all possible alignments
23. In a dot plot, each diagonal corresponds to a
possible (ungapped) alignment
25. Concept of a dot plot
• Produces a graphical representation of similarity regions.
• The horizontal and vertical dimensions correspond to the compared sequences.
• A region of similarity stands out as a diagonal
A Simple example
A dot is placed at each position where two
residues match.
The colour of the dot can be chosen
according to the substitution value in the
substitution matrix
26. Limitations of a dot plot
• It is a visual aid.
• It does not provide an alignment.
• This method produces dot plots with too
much noise to be useful
27. Protein Scoring Systems
Scoring matrices reflect:
• % of mutations to convert one to another
• chemical similarity
• observed mutation frequencies
• the probability of occurrence of each amino acid
28. Substitution Matrices (Log odds matrices)
Two popular sets of matrices for protein
sequences
PAM matrices [Dayhoff et al., 1978]
BLOSUM matrices [Henikoff & Henikoff, 1992]
Both try to capture the relative substitutability of
amino acid pairs in the context of evolution
29. PAM series (Dayhoff M., 1968, 1972, 1978)
PAM (Percent Accepted Mutation ) matrices: Family of
matrices PAM 80, PAM 120,PAM 250
A unit introduced by Dayhoff et al. to quantify the amount of
evolutionary change in a protein sequence.
The number with a PAM matrix represents the evolutionary
distance between the sequences on which the matrix is
based
Greater numbers denote greater distances
The PAM-1 matrix reflects an average change of 1% of all
amino acid positions.
PAM250 = 250 mutations per 100 residues.
Greater numbers mean bigger evolutionary distance
30. Percent Accepted Mutation.
A PAM(x) substitution matrix is a look-up table in
which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins that
have experienced a certain amount (x) of
evolutionary divergence.
Based on 1572 protein sequences from 71 families
Old standard matrix: PAM250
33. BLOSUM matrices
Different BLOSUMn matrices are calculated
independently from BLOCKS (ungapped local
alignments)
BLOSUMn is based on a cluster of BLOCKS of
sequences that share at least n percent identity
BLOSUM62 represents closer sequences than
BLOSUM45
The number in the matrix name (e.g. 62 in
BLOSUM62) refers to the percentage of sequence
identity used to build the matrix.
Greater numbers mean smaller evolutionary
distance.
34. BLOSUM series (Henikoff S. & Henikoff
JG., PNAS, 1992)
Blocks Substitution Matrix.
A substitution matrix in which scores for each position are
derived from observations of the frequencies of substitutions
in blocks of local alignments in related proteins.
Each matrix is tailored to a particular evolutionary distance.
In the BLOSUM62 matrix, for example, the alignment from which
scores were derived was created using sequences sharing no
more than 62% identity.
Sequences more identical than 62% are represented by a single
sequence in the alignment so as to avoid over-weighting
closely related family members.
Based on alignments in the BLOCKS database: Standard matrix:
BLOSUM62
35.
36.
37. TIPS on choosing a scoring matrix
Generally, BLOSUM matrices perform better
than PAM matrices for local similarity searches
(Henikoff & Henikoff, 1993).
When comparing closely related proteins one
should use lower PAM or higher BLOSUM
matrices, for distantly related proteins higher
PAM or lower BLOSUM matrices.
For database searching the commonly used
matrix is BLOSUM62.
38. Limitations of Substitution Matrices
Substitution matrices do not take into account
long range interactions between residues.
They assume that identical residues are equal
(whereas in real life a residue at the active site
has other evolutionary constraints than the
same residue outside of the active site)
They assume evolution rate to be constant.
39. Gaps
Insertions or deletions
Proteins often contain regions where residues have
been inserted or deleted during evolution
There are constraints on where these insertions and
deletions can happen (between structural or
functional elements like: alpha helices, active site,
etc.)
40. Why Gap Penalties?
The optimal alignment of two similar sequences is
usually that which
maximizes the number of matches and
minimizes the number of gaps.
There is a tradeoff between these two - adding gaps
reduces mismatches
Permitting the insertion of arbitrarily many gaps can
lead to high scoring alignments of non-
homologous sequences.
Penalizing gaps forces alignments to have relatively
few gaps.
41. Gap Penalties
How to balance gaps with mismatches?
Gaps must get a steep penalty, or else you’ll end
up with nonsense alignments.
In real sequences, muti-base (or amino acid)
gaps are quite common
genetic insertion/deletion events
“Affine” gap penalties give a big penalty for each
new gap, but a much smaller “gap extension”
penalty.
42. Gap opening and extension penalties
Costs of gaps in alignments
We want to simulate as closely as possible the evolutionary
mechanisms involved in gap occurrence.
Example
Two alignments with identical number of gaps but very different
gap distribution.
We may prefer one large gap to several small ones (e.g. poorly
conserved loops between well-conserved helices)
Gap opening penalty
Counted each time a gap is opened in an alignment (some
programs include the first extension into this penalty)
Gap extension penalty
Counted for each extension of a gap in an alignment
43. Example
With a match score of 1 and a mismatch score of
0
With an opening penalty of 10 and extension
penalty of 1, we have the following score:
44. Statistical evaluation of results
Alignments are evaluated according to their score
• Raw score
It is the sum of the amino acid substitution scores and gap penalties (gap
opening and gap extension)
Depends on the scoring system (substitution matrix, etc.)
Different alignments should not be compared based only on the raw score
It is possible that a "bad" long alignment gets a better raw score than a very
goodshort alignment.
We need a normalised score to compare alignments
We need to evaluate the biological meaning of the score (p-value, e-value).
• Normalised score
Is independent of the scoring system
Allows the comparison of different alignments
Units: expressed in bits
45. Distribution of alignment scores - Extreme Value
Distribution
Random sequences and alignment scores
Sequence alignment scores between random
sequences are distributed following an extreme
value distribution (EVD).
46. Extreme Value Distribution
High scoring random alignments have a low probability.
The EVD allows us to compute the probability with which
our biological alignment could be due to randomness
(to chance).
Caveat: finding the threshold of significant alignments.
47. Statistics derived from the scores
p-value
Probability that an alignment with this score occurs by
chance in a database of this size
The closer the p-value is towards 0, the better the alignment
e-value
Number of matches with this score one can expect to find by
chance in a database of this size
The closer the e-value is towards 0, the better the alignment
Relationship between e-value and p-value:
In a database containing N sequences, e = p x N