bioinformatics lecture 2.pptx and computational Boilogygy

• The ultimate goal of bioinformatics is to be
able to predict the biological processes in
health and disease.

• Genetic information is stored in the cell in the
form of biological macromolecules, such as
nucleic acids and proteins.

Alignments and phylogenetic
trees

EVOLUTIONARY BASIS OF
SEQUENCE ALIGNMENT
• evolution is defined as “descent with
modification” from a common ancestor.
• At the molecular level, the modification means
changes in DNA and protein sequence, and
corresponding changes in protein function
• As mutations accumulate in sequences derived
from an ancestral sequence,
• The derived sequences diverge from one another
over time, but sections of the sequences may still
retain enough similarity to allow identification of
a common ancestry

SEQUENCE ALIGNMENT
• Evolutionary change in a sequence does not always have
to be large; slight changes in certain crucial sections of a
sequence can have profound functional consequences.
• Expectedly, sequence comparison through sequence
alignment is central to most bioinformatics analysis.
• It is the first step towards understanding the evolutionary
relationship and the pattern of divergence between two
sequences.

SEQUENCE ALIGNMENT
• The relationship between two sequences also
helps predict the potential function of an
unknown sequence, thereby indicating protein
family relationship

SEQUENCE IDENTITY, SIMILARITY,
AND HOMOLOGY
• Sequence identity means the same residues
being present at corresponding positions in
two sequences being compared.
• A measure % of the proportion of exact
matches
• It means the same amino acids (for proteins);
or same bases (Nucleotides)

AND HOMOLOGY
• Sequence similarity means similar residues
being present at corresponding positions in
the two sequences being compared.
• A measure % of identical and similar bases
residues.
• For nucleic acids, sequence similarity and
sequence identity are the same However,

AND HOMOLOGY
• Protein sequence similarity: a measure of
identical and similar amino acid residues.
• for proteins, it involves amino acids with
similar physicochemical and functional
properties. E.g. substitution of lysine and
arginine by one another
• both are positively charged hydrophilic amino
acids

AND HOMOLOGY
• Substitution of isoleucine, leucine, and valine
by one another
• Substitution of serine and threonine by one
another
• Similar substitutions are also referred to as
conservative substitutions

AND HOMOLOGY
• A conservative amino acid substitution is not
expected to disrupt the structural/functional
attributes of the protein.

Sequence Homology
• Sequence homology is an evolutionary term
that been misused in most literature to
denote sequence similarity or identity.
• Sequences are called homologous if they have
a common evolutionary origin that is, if
• they are derived from a common ancestral
sequence.

Sequence Homology
• So, sequences are either homologous or not
homologous and there is no quantitation of
homology.
• However, even now, expressions like “high
homology,” “significant homology,” and even
specifying a “% homology” are very widely used.
• Such usage has no reference to the evolutionary
underpinning of the term homology.

Sequence Homology
• The root of the term homology goes back to the early
evolutionary literature, where organs having similar
structure and anatomical origin but performing
different functions (hence morphologically different)
were called homologous organs.
• Examples of homologous organs are bats’ wings,
whales’ flippers, and human hands;
• these are all mammalian forelimbs that are
morphologically different because they are adapted
to perform different functions

Sequence Homology
• Conversely, organs having different structure
and anatomical origin but performing the
same function (hence morphologically similar)
were called analogous organs.
• Such a character state (analogous organs)
shared by a set of species but not present in
their common ancestor is also called
homoplasy

Sequence Homology
• Examples of analogous organs/homoplasy are
bats’ wings and butterflies’ wings, and
dolphins’ flippers and sharks’ fins.
• Homoplasy is the result of convergent
evolution in which unrelated species develop
similar morphological structures because of
adaptation to the same or a similar
environment

Sequence Homology
• Homologous genes in different species
• performing the same function are called
orthologs
• E.g metallothionein-1 genes in rat and mouse
are also orthologs.

SEQUENCE IDENTITY AND
SEQUENCE SIMILARITY
• Sequence identity and sequence similarity can be
calculated based on the proportion of identical
and similar amino acids, respectively
• % Identity PIDP =
• NO of identical amino acids/NO of total amino
acids *100
• % similarity =
• % of identical aminoacids /No of similar
substitutionsof totalaminoacids *100

• Understanding the concept of sequence alignment: the assignment of residue–residue correspondences.
• • Knowing how to construct and interpret dotplots, and understanding the relationship between dotplots and
• alignments.
• • Being able to define and distinguish the Hamming distance and Levenshtein distance as measures of dissimilarity
of
• character strings.
• • Understanding the basis of scoring schemes for string alignment, including substitution matrices and gap
penalties.
• • Appreciating the difference between global alignments and local alignments, and understanding the use of
• approximate methods for quick screening of databases.
• • Understanding the significance of Z scores, and knowing how to interpret P values and E values returned by
• database searches.
• • Being able to interpret multiple alignments of amino acid sequences, and to make inferences from multiple
sequence
• alignments about protein structures.
• • Being able to define and distinguish the concepts of homology, similarity, clustering, and phylogeny.
• • Becoming expert in the use of PSI-BLAST and related programs.
• • Appreciating the use of profile methods and hidden Markov models in database searching.
• • Understanding the contents and significance of phylogenetic trees, and the methods available for deriving them,
• including maximum parsimony and maximum likelihood; knowing the role and use of an outgroup in derivation of
a
• phylogenetic tree.

Introduction to sequence alignment
Given two or more sequences, we wish to:
• measure their similarity;
• determine the residue–residue
correspondences;
• observe patterns of conservation and
variability;
• infer evolutionary relationships

• If we can do these, we will be in a good
position to go fishing in data banks for related
sequences

• Sequence alignment is the identification of residue–
residue correspondences. i.e. arranging sequences to
identify regions of similarity.
• Suggest relationships involving structure function and
evolution of the sequences from common ancestor
sequence.
• It is the basic tool of bioinformatics.
• Any assignment of correspondences that preserves the
order of the residues within the sequences is an
alignment.
• Gaps may be introduced to make the sequence align

sequence alignment
• Given two text strings
first string a b c d e
second string a c d e f
• A reasonable alignment would be:
a b c d e -
a - c d e f

sequence alignment
• We must define criteria so that an algorithm
can choose the best alignment.
• For the sequences
gctgaacg and ctataatc:

sequence alignment
• An uninformative alignment:
• - - - - - - - g c t g c t a t a a t c
• c t a t a a t c - - - - - - - - - - -
• An alignment without gaps g c t g a a c g
: c t a t a a t c
• An alignment g with gaps g c t g a - a - - c g
- - c t - a t a a t c
• And another: g c t g - a a - c g
- c t a t a a t c -

sequence alignment
• You may think the last of these alignments
are the best. To confirm this, and to decide
whether it is the best of all possibilities,
• we need to compute a score reflecting the
quality of each possible alignment
• Minor variations in the scoring scheme may
change the ranking of alignments,
• ClustalW is a free net-based sequence
alignment program

Types of sequence Alignment
According to the number of sequences.
2.Multiple sequence alignment (MSA)
1. Pairwise sequence alignment (PSA)

Pairwise Sequence Alignment (PSA)
• We compare only two sequences with each
other to find the region of similarity or
dissimilarity.
• It helps in finding the evolutionary history of
the sequences
• Consider two hypothetical protein sequences
• Seq: L I A H G S V M L N
• Seq: L I G H G S A M L P

Multiple Sequence Alignment (MSA)
• We compare more than two sequences with each
other to find out gene families, protein families,
enzyme active site, functional and evolutionary
relatedness etc.
• Seq A: L I – H G S V M L –
• seqA: L I – H G S A – L P
• Seq: – G H – S A – L P
• Multiple sequence alignment of hypothetical
sequence alignment

According to sequence length
• 1. Global sequence alignment
• 2. Local sequence alignment

Global sequence alignment
• A global sequence-alignment method aligns and
compares two sequences along their entire
length,
• comes up with the best alignment that displays
the maximum number of nucleotides or amino
acids aligned.
• works the best when the sequences are similar
in character and length.
• The algorithm that drives global alignment is
• the NeedlemanWunsch.

Global alignment
• Because global alignment displays the best
alignment between two sequences using the
• entire sequence, it may miss a small region of
biological importance.
• This is a trade-off in global alignment
• Two of the available web servers for pairwise
global alignment are EMBL-EBI EMBOSS
(http://www.ebi .ac.uk/Tools/psa/), and NCBI
specialized BLAST.

Local sequence alignment
• is intended to find the most similar regions
in two sequences being aligned.
• The algorithm that drives local alignment is the
SmithWaterman algorithm
• Obviously, local alignment is useful for
• sequences that are not similar in character and
length, yet are suspected to contain small regions
of similarity, such as biologically important
motifs.

Continue
• SeqA: T A G C – G C – G T
• SeqB: T A – C A – C A G T
• Global sequence alignment
• SeqA: C G A T A A C G T A T
• SeqB: - - A T A A C - - - -
• Local sequence alignment

LALIGN pairwise comparison. LALIGN output of pairwise comparison of mlst-1 (BAB03272.1; partial sequence)
and moatp-
2 (BAB12445.1; partial sequence) each containing the hypothetical sequence “THATISGREATANDFANTASTIC.”
LALIGN produces an overall
alignment of two protein sequences and also finds matching subsegments shared by these two input
sequences. Note that in LALIGN the
identities are reported by two dots and similar substitutions are reported by one dot.

NCBI BLAST pairwise alignment. The two partial sequences depicted in Figure 6.5 were also aligned using NCBI
BLAST
pairwise alignment. Like LALIGN, NCBI BLAST pairwise alignment also produces an overall alignment of two
protein sequences, and
also finds matching subsegments shared by these two sequences. The hypothetical sequence
“THATISGREATANDFANTASTIC” has been
identified as a subsegment of 100% identity between the two proteins

Why compare sequences?
• The extent of similarity between sequences
provides information on functional, structural
and evolutionary relationships.
• We can easily predict the structure of a gene
or a regulatory role of nucleotide sequence in
case of similar DNA sequences,
• Predict 3D structure and biochemical function
of proteins can be predicted based on
sequence similarities alone

• Sequences that originate from common ancestor
are termed homologous sequences
• Paralogous: are homologous sequences present
in one organism that perform distinct but related
functions and arose by gene dublication
• Orthologous: homologous sequences present in
different species,presumed to perform the the
same function and arose during speciation.
• Xenolougous: are present in different species and
arose by horizontal or lateral gene transfer.

Quantitative measure of
sequence similarities

Hamming distance
• This is the total number of mismatches
present in two sequence of equal length.
• dH (mn) = total NO of mismatches present in
alignment.
• E.g. Seq. A = T C C G T T A G T
• seq. B = T G C G T C A C T
• The hamming distance b/w A and B = 3 bc
only 3 mismatches are present.

Edit distance or levenshtein distance
1966
• This is the minimun NO of operations needed
to change one sequence into other, where the
edit operations are insertion, deletion or
substitution of a single character.
• E.g. using hypotetical sequences
• X= TGCTGAT and Y = TCCTAAGT; it takes 4
operations to convert X to Y hence the edit
distance is 4

• TGCTGAT
• TCGCTGAT insertion of C at 2nd
• TCGTGAT deletion of C at 4th position
• TCGTAAT substitution of G to A at 5th pos.
• TCGTAAGT insertion of G at 7th position
Edit distance or levenshtein distance

Gaps and Gap penalties
• These are introduced for the sake of getting
best alignment between sequences.
• When gap is introduced the gap penalty must
be introduced in the scoring system.
• Gap penalty may be gap opening or gap
extension penalty.
• Openig are highly penalized than extension

SCORING MATRIX
• A raw alignment score can be calculated based
on the following simple formula:
• Total score for identities+ total score for
mismatches - total gap penalty.
• For both nucleic acids and proteins, the
alignment score is calculated using a scoring
matrix

Nucleotides scoring matrices
• A scoring matrix is a set of values representing
the likelihood of one residue being substituted by
another during sequence divergence through
evolution.
• This is why the scoring matrix is also known as
the substitution matrix.
• A scoring matrix for comparing DNA sequences
• can be simple because there are only four
nucleotides and the mutation frequencies are
assumed to be equal

Nucleotides scoring matrices
• Positive value or high score is given for a match
• Positive value or low score is given for mismatch: e.g
+1 for a match, −1 for a mismatch
• frequency of transition mutations (purine replaced by
purine or pyrimidine replaced by pyrimidine) is higher
than trans version mutations (purine replaced by
pyrimidine or vice versa)
• The observed number of substitutions may be less than
what actually occurred
• scoring matrix is still used,; NUC4.2 and NUC4.4 DNA
scoring matrices. These matrices can be obtained from
the NCBI (ftp://ftp.ncbi.nih.gov/blast/matrices/).

Amino Acid
For proteins, We might group the amino acids
• into classes of similar physicochemical type, and
score +1 for a match within residue class and –1
for residues in different classes
• Scoring matrices for amino-acid substitutions are
more complex, reflecting
1. similarity of physicochemical properties,
2. the likelihood of one amino acid being
substituted by another at a particular position in
homologous proteins.
• The scoring matrices for proteins are 20 X 20
matrices. Two well-known types of scoring
matrices for proteins are PAM and BLOSUM.

PAM Matrices
Margaret Dayhoff and colleagues in 1978
• PAM (point accepted mutation—that is,
accepted
• also called percent accepted mutation)
• A PAM represents a substitution of one amino
acid by another that has been fixed by natural
selection because either it does not alter the
protein function or it is beneficial to the
organism.

• In a PAM1 matrix, which is the original PAM matrix
generated, a PAM unit is an evolutionary time over
which 1% of the amino acids in a sequence are
expected to undergo accepted mutations, resulting
in 1% sequence divergence
• Construction of a PAM1 matrix begins with
alignment of the full-length sequences,
reconstruction of the phylogenetic tree, and
determination of the ancestral sequences for the
internal nodes of the tree

• Each computed ancestral sequence is then used
to calculate the number and frequency of
substitutions in the sequences along each branch
arising from the node
• The values in the matrix represent the probability
that the amino acid in a column will be replaced
by the amino acid in row in a given evolutionary
time (1 PAM unit in a PAM1 matrix).
• From the computed probability, the percent
probability can be determined

• In order to deal with protein sequences that
are more diverged and distantly related, other
PAM matrices, such as PAM100 and PAM250,
were generated.
• These later PAM matrices were generated by
multiplying the PAM1 matrix by itself
hundreds of times

PET91 Matrix
(Jones et al.13 updated the PAM matrix)
• Pair exchange table 1991
• PET91 takes into account the substitutions
that were poorly represented in the original
Dayhoff matrix.
• The overall character of PAM and PET91
matrices is similar

PAM score and percentage sequence
identity

The entries are in alphabetical order of the three-letter amino acid names. Only
the lower triangles of the matrices are shown, as the substitution probabilities are
taken as symmetric. (This is not because we are sure that the rate of any
substitution is the same as the rate of its reverse, but because we cannot
determine the differences between the two rates.) The Dayhoff PAM250 matrix
(MDM78):

BLOSUM
Steven Henikoff and Jorja Henikoff in 1992
• BLOSUM (blocks substitution matrices) scoring
represents an alternative set of scoring
matrices, which are widely used in sequence
alignment algorithms
• Developed based on multiple alignment of 500
groups of related protein sequences, which
yielded 2000 blocks of conserved amino-acid
patterns.
• Blocks are un gapped multiple sequence
alignments corresponding to the most
conserved regions of the proteins involved.

• Different BLOSUM matrices differ in the %
sequence identity used in clustering.
• BLOSUM62 means that the sequences used
to create this matrix have approximately 62%
identity
• Substitution frequencies weigh more heavily
by protein sequences having less than 62%
identity

• BLOSUM62 is useful for aligning and scoring
proteins that show less than 62% identity
• GSFEIGNLLLII
• GSFEMGNLLVIV
• GSFEIGNLLLIV
• GGFEIGNLLVIV
• GGFEIGNLLVIV
• The conserved amino acids are shaded for
identification

BLOSUM62 substitution matrix
made by writing the amino acids in alphabetical order.

Differences b/w PAM and BLOSUM
PAM160.
1. constructed based on an
evolutionary model—that
is, from the estimation of
mutation rates through
constructing phylogenetic
trees and inferring the
ancestral sequence
2. BLOSUM matrices are
suitable for local sequence
alignments
BLOSUM62
1. BLOSUM matrices are
constructed based on
direct observation of
ungapped multiple
alignment-driven sequence
relationships.
2. PAM matrices are often
used for reconstructing
phylogenetic trees.

Differences b/w PAM and BLOSUM
PAM160.
3. PAM matrix construction involves
global alignment of the full-length
sequences consisting of both
conserved and diverged regions
4.More tolerable to hydrophilic-
amino-acid substitution
5. Less tolerant to hydrophobic-
amino-acid Substitution less
tolerable
6. for rare amino acids, such as
cysteine and tryptophan, typically
less tolerant to mismatches.
BLOSUM62
3. BLOSUM matrix construction
involves local sequence
alignment of conserved
sequence blocks.
4. Less tolerant to hydrophilic-
amino-acid substitution,
5. More tolerant to hydrophobic-
amino-acid Substitution
6. for rare amino acids, such as
cysteine and tryptophan,
typically more tolerant to
mismatches

What is an algorithm?
• An algorithm is a sequence of unambiguous
instructions for solving a problem, i.e., for
obtaining a required output for any legitimate
input in a finite amount of time.

METHODS OF SEQUENCE ALIGNMENT
• The methods of sequence alignment works on
nucleotides and protein sequences
• Required to perform both local and global
alignments
• 1. DOT PLOT Or DOT MATRIX
• 2. Dynamic programming
• 3. Heuristic Methods
• FASTA and BLAST

The dot plot
• The dotplot is a simple picture that gives an
overview of the similarities between two
sequences
• It is a graphical way of comparing two sequences
in a two dimensional matrix.
• Not practical when comparing large sequences

The dot plot
1. If a residue match is found a dot is placed
within the graph.
2. line linking the dots in diagonals indicates
sequence alignment
3. a self comparison would results in a main
diagonal for perfect matching
4. Parallel diagonals represent internal
repeats(repititive regions)

Dotplot showing identities between short name (DOROTHYHODGKIN)
and full name (DOROTHYCROWFOOTHODGKIN) of a famous protein
crystallographer

Dotplots showing identities between a repetitive
sequence and itself

The palindrome reveals itself as a stretch of matches perpendicular to the
main diagonal. This This is not just word play: regions in DNA recognized
by transcriptional regulators or restriction enzymes have
sequences related to palindromes, crossing from one strand to the other
Dotplot showing identities between the palindromic sequence MAX I
STAYAWAY AT SIX AM and itself

DOROTHY––––––––HODGKIN
DOROTHYCROWFOOTHODGKIN
Any path through the dotplot from upper left to lower right passes through a succession of cells,
each of which picks out a pair of positions, one from the row and one from the column, that are
matched in the alignment that corresponds to that path; or that indicates a gap in one of the
sequences. The path need not pass through filled-in points only. However, the more filled-in points
on the path, the more matching residues in the alignment.

bioinformatics lecture 2.pptx and computational Boilogygy

bioinformatics lecture 2.pptx and computational Boilogygy

Recommended

Recommended

More Related Content

Similar to bioinformatics lecture 2.pptx and computational Boilogygy

Similar to bioinformatics lecture 2.pptx and computational Boilogygy (20)

More from MUHAMMEDBAWAYUSUF

More from MUHAMMEDBAWAYUSUF (20)

Recently uploaded

Recently uploaded (20)

bioinformatics lecture 2.pptx and computational Boilogygy

Editor's Notes