4. Sequence alignment
Alignment: Comparing two (pairwise) or more
(multiple) sequences. Searching for a series of
identical or similar characters in the sequences.
-Similarity : Same Physicochemical properties.
- Identity :- Identical
MVNLTSDEKTAVLALWNKVDVEDCGGE
|| || ||||| ||| || || ||
MVHLTPEEKTAVNALWGKVNVDAVGGE
5. Sequence alignment-why???
• The basis for comparison of proteins and genes
using the similarity of their sequences is that the
the proteins or genes are related by evolution;
they have a common ancestor.
• Random mutations in the sequences accumulate
over time, so that proteins or genes that have a
common ancestor far back in time are not as similar
as proteins or genes that diverged from each other
more recently.
6. Alignment
• A way of arranging the objects or alphabets to
find out the similarity and difference existing
between them.
• In case of bioinformatics, it is the arrangement
of sequence (DNA,RNA or protein) to find out
the regions of similarity and difference by
virtue of which homology can be predicted.
9. Why perform to pair wise sequence
alignment?
Finding homology between two sequences
Example : Protein prediction(Sequence or
Structure).
similar sequence (or structure)
similar function
10. Local Vs. Global
• Global alignment compares through out the sequence
and gives best overall alignment but may fail to find out
the local region of similarity among sequence which
exactly contain the domain and motif information.
• Local alignment find regions of ungapped sequence
with high level of similarity. Best for finding the motif
although two sequences are different.
11. Local alignment – finds regions of high similarity in
parts of the sequences
Global alignment – finds the best alignment across
the entire two sequences
Local vs. Global
12. Three types of nucleotide changes:
1. Substitution – a replacement of one (or more)
sequence characters by another:
2. Insertion - an insertion of one (or more) sequence
characters:
3. Deletion – a deletion of one (or more) sequence
characters:
T
A
Evolutionary changes in sequences
Insertion + Deletion Indel
AAGA AACA
AAG
GA
A
A
13. Choosing an alignment:
• Many different alignments between two
sequences are possible:
AAGCTGAATTCGAA
AGGCTCATTTCTGA
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
How one can determine which is the best alignment?
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
. . .
14. Exercise
• Match: +1
• Mismatch: -2
• Indel: -1
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
Compute the scores of each of the following alignments
Scoring scheme:
-2
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
-2
A
C
G
T
A C G T
Substitution matrix
Gap penalty (opening = extending)
15. Open Reading Frames(ORFs)
•6 possible ORFs
–frames 1,2,and 3 in 5’ to 3’direction
–frames 1,2, and 3 in 5’ to 3’ direction
of complimentary strand.
The different reading frames give
entirely different proteins.
Each gene uses a single reading frame, so
once the ribosome gets started, it just has
to count off groups of 3 bases to produce
the proper protein.
16. PAM matrices
• Family of matrices PAM 80, PAM 120, PAM 250, …
• The number with a PAM matrix (the n in PAMn) represents
the evolutionary distance between the sequences on which
the matrix is based
• The (ith,jth) cell in a PAMn matrix denotes the probability that
amino-acid i will be replaced by amino-acid j in time n:
Pi→j,n .
• Greater n numbers denote greater distances
17. BLOSUM matrices
• Different BLOSUMn matrices are calculated independently
from BLOCKS (ungapped, manually created local alignments)
• BLOSUMn is based on a cluster of BLOCKS of sequences
that share at least n percent identity
• The (ith,jth) cell in a BLOSUM matrix denotes the log of odds
of the observed frequency and expected frequency of amino
acids i and j in the same position in the data: log(Pij/qi*qj)
• Higher n numbers denote higher identity between the
sequences on which the matrix is based
18. BLAST
(Basic Local Alignment Search Tool)
• The BLAST program was designed by Eugene
Myers, Stephen Altschul, Warren Gish, David J.
Lipman and Webb Miller at the NIH and was
published in J. Mol. Biol. in 1990.
• OBJECTIVE: Find high scoring ungapped segment
among related sequences
• Most widely used bioinformatics programs as the
algorithm emphasizes speed over sensitivity.
19. • An algorithm for comparing primary biological
sequence information to find out the similarity
existing between these two.
• Emphasizes on regions of local alignment to
detect relationship among sequences which
shares only isolated regions of similarity.
• Not only a tool for visualizing alignment but
also give a view to compare structure and
function.
20. Steps for BLAST
Searches for exact matches of a small fixed length
between query sequence in the database called Seed.
BLAST tries to extend the match in both direction
starting at the seed ungapped alignment occur---- High
Scoring Segment Pair (HSP).
The highest scored HSP’s are presented as final report.
They are called Maximum Scoring Pairing
21. BLAST performs a gapped alignment
between query sequence and database
sequence using a variation of Smith-
Watermann Algorithm statistically
significant alignments are then displayed
to user
22. BLAST PROGRAMS
• BLASTP: protein query sequence against a protein
database, allowing for gaps.
• BLASTN: DNA query sequence against a DNA database,
allowing for gaps.
• BLASTX: DNA query sequence, translated into all six
reading frames, against a protein database, allowing for
gaps.
• TBLASTN: protein query sequence against a DNA
database, translated into all six reading frames, allowing
for gaps.
• TBLASTX: DNA query sequence, translated into all six
reading frames, against a DNA database, translated into
all six reading frames (No gaps allowed)
23. PSI-BLAST
(position-specific scoring matrix)
• Used to find distant relatives of a protein.
• First, a list of all closely related proteins is
created. These proteins are combined into a
general "profile" sequence.
• Now this profile used as a query and again the
search performed to get the more distantly
related sequence.
• PSI-BLAST is much more sensitive in picking
up distant evolutionary relationships than a
standard protein-protein BLAST.
25. Matrix
• A key element in evaluating the quality of a
pairwise sequence alignment is the
"substitution matrix", which assigns a score for
aligning any possible pair of residues.
• BLAST includes BLOSUM & PAM matrix.
28. The Score Matrix
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Gaps
Similarity
Identity
,
i j
X A B
ACDEFGH
HICDYGH
A
B
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
29. Paths in the Score Matrix
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Deletion
Insertion
Matches
O
T
Alignments are in a one-
to-one correspondence
with score matrix paths.
30. Low Complexity Regions
• Amino acid or DNA sequence regions that offer very
low information due to their highly biased content
– histidine-rich domains in amino acids
– poly-A tails in DNA sequences
– poly-G tails in nucleotides
– runs of purines
– runs of pyrimidines
– runs of a single amino acid, etc.
31. E-value
• Depends on database size
• Indicates probability of a database
match expected as result of random
chance
• Lower E-value, more significant
sequence, less likely Db result of
random chance
32. E=m x n x p
E=E-value
m=total no. of residues in Database
n=no. of residues in query sequence
p= probability that high scoring pair is result of
random chance
33. • E-value 0.01 and 10-50 Homology
• E-value 0.01 and 10 not significant to
remote homology
• E-value>10 distantly related
34. Bit Score
• Measure sequence similarity which is independent of
query sequence length and database size but based on Raw
Pairwise Alignment
• High bit score , high significantly match
• S’ (λ S-lnk)/ln2
S’=bit score
λ =grumble distributation constt.
K=constt.associated with scoring matrix
(λ and k are two statistical parameters)
35. Low Complexity Regions (LCR)
Masking:
(I) Hard masking
(II) Soft Masking
Program for Masking
(i) SEG :high frequency region declared LCR
(ii) RepeatMasker: score for a sequence region above
certain threshold region declared LCR. Residue
masked with N’s and X’s
37. BLAST result page
• BLAST result page divided into 3 parts.
• Part1 contains the information regarding version, database
used, reference and length of the query sequence.
• Part-2 is the conserved regions and graphical representation
of the alignment where each line represents the alignment of
query sequence with one database sequence.
• It shows the result in 5 different color depending upon the bit
score.
• Part-3 contains the list of database sequence having
similarity obtained while database search and detail view of
alignment along with bitscore, e-value, identities, positives
and gaps.
42. BLAST Preferred
• BLAST uses substitution matrix to find
matching while FASTA identifies identical
matching words using hashing procedure. By
default FASTA scans smaller window sizes
.Thus it gives more sensitive results than
BLAST with better coverage rates of
homologs but usually slower than BLAST
43. • BLAST use low complexity masking means it
may have higher specificity than FASTA
therefore false positives are reduced
• BLAST sometimes give multiple best scoring
alignments from the same sequence, FASTA
returns only one final alignment
44. REFRENCES
Jin Xiong(2006). Essential Bioinformatics.
Cambridge University Press.
Mount D. W. (2004). Bioinformatics &
Genome Analysis. Cold Spring Harbor
Laboratory Press.
URL:-
WWW.ncbi.nlm.nih.gov