SlideShare a Scribd company logo
1 of 47
DNA and Protein sequence
alignments:
Pairwise alignment
Dot Plots
Substitution Matrices (PAM,
BLOSUM)
Computer applications for
Biosciences and Bioinformatics
Module III b
Sequence alignment
To align and score a pair of sequences (DNA or
protein)
To find the correspondences between substrings in
the sequences such that the similarity score is
maximized
Why do alignment?
To find out homology: similarity due to descent
from a common ancestor
Often we can infer homology from similarity
Thus we can sometimes infer structure/function
from sequence similarity
Sequence analysis tools depending on
pair wise comparison
• Multiple alignments
• Profile and HMM making (used to search for
protein families and domains)
• 3D protein structure prediction
• Phylogenetic analysis
• Construction of certain substitution matrices
• Similarity searches in a database
Homology
Members of a family are called homologs or
homologous molecules.
Homologous sequences can be divided into two
groups
– orthologous sequences: sequences that differ
because they are found in different species (e.g.
human α -globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event (e.g. human
α-globin and human β-globin, various versions of
both)
Issues in Sequence Alignment
The sequences we are comparing probably differ
in length
There may be only a relatively small region in
the sequences that match
We want to allow partial matches (i.e. some
amino acid pairs are more substitutable than
others)
Variable length regions may have been
inserted/deleted from the common ancestral
sequence
Applications
Sequence alignment arises in many fields:
• Molecular biology
• Inexact text matching (e.g. spell checkers; web page search)
• Speech recognition
In general:
• The precise definition of what constitutes an alignment may
vary by field, and even within a field.
• Many different alignments of two sequences are possible, so to
select among them one requires an objective (score) function
on alignments.
• The number of possible alignments of two sequences grows
exponentially with the length of the sequences. Good
algorithms are required.
Important questions
Q. What do we want to align and how?
A: Two sequences (nucleotide or protein) through pairwise
alignment
Or To find similar sequences in a database against our query
sequence by multiple sequence alignment
Q. How do we “score” an alignment?
 Simple scoring (match= 1, mismatch= 0),
 Dot plots (graphical representation)
 Substitution matrices (PAM and BLOSUM)[s(a,b)
indicates score of aligning character a with character b;
Also accounts for relative substitutability of amino acid
pairs in the context of evolution]
 Gap penalty function: w(k) indicates cost of a gap of
length k
Q. How do we find the “best” alignment?
A: Alignment algorithms
An alignment program tries to find the best alignment between
two sequences given the scoring system.
Alignement types
Global Alignment between the complete sequence A and the
complete sequence B
Local Alignment between a sub-sequence of A an a subsequence
of B
Computer implementation (Algorithms)
Dynamic programming
 Global: Needleman-Wunsch
 Local: Smith-Waterman
Heuristic algorithms (faster but approximate)
 BLAST
 FASTA
Pairwise alignment
The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
There are lots of possible alignments.
Two sequences can always be aligned.
Sequence alignments have to be scored.
Often there is more than one solution with
the same score.
Sequence comparison through pairwise
alignments
Goal of pairwise comparison is to find conserved
regions (if any) between two sequences
Extrapolate information about our sequence
using the known characteristics of the other
sequence
Evolution of sequences
Sequences evolve through mutation and selection
[Selective pressure is different for each residue
position in a protein (i.e. conservation of active
site, structure, charge,etc.)]
Modular nature of proteins [Nature keeps re-using
domains]
Alignments try to tell the evolutionary story of the
proteins
Relationships
Example of Alignment-textual view
Two similar regions of the Drosophila melanogaster
Slit and Notch proteins
Some Definitions
Identity
• Proportion of pairs of identical residues between two aligned sequences.
• Generally expressed as a percentage.
• This value strongly depends on how the two sequences are aligned.
Similarity
• Proportion of pairs of similar residues between two aligned sequences.
• If two residues are similar is determined by a substitution matrix.
• This value also depends strongly on how the two sequences are aligned,
as well as on the substitution matrix used.
Homology
• Two sequences are homologous if and only if they have a common
ancestor.
• There is no such thing as a level of homology ! (It's either yes or no)
Note: Homologous sequences do not necessarily serve the same function...
Nor are they always highly similar: structure may be conserved while
sequence is not
Consider a set S (say, globins) and a test t that tries to detect
members of S (for example, through a pairwise comparison
with another globin).
True positive
• A protein is a true positive if it belongs to S and is detected
by t.
True negative
• A protein is a true negative if it does not belong to S and is
not detected by t.
False positive
• A protein is a false positive if it does not belong to S and is
(incorrectly) detected by t.
False negative
• A protein is a false negative if it belongs to S and is not
detected by t (but should be).
Example
The set of all globins and a test to identify them
Consider:
A set S (say, globins: G)
A test t that tries to detect members of S (for
example, through a pairwise comparison with
another globin).
Concept of a sequence alignment
Pairwise Alignment:
Explicit mapping between the residues of 2
sequences
Tolerant to errors (mismatches, insertion /
deletions or indels)
Evaluation of the alignment in a biological concept
(significance)
Number of alignments
There are many ways to align two sequences
Consider the sequence fragments below: a simple
alignment shows some conserved portions
Number of possible alignments for 2 sequences of length 1000 residues:
more than 10 600gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080)
What is a good alignment ?
We need a way to evaluate the biological meaning
of a given alignment
Intuitively we "know" that the following alignment:
We can express this notion more rigorously, by using a scoring
system
Scoring system
Simple alignment scores
A simple way (but not the best) to score an
alignment is to count 1 for each match and 0 for
each mismatch.
Importance of the scoring system
Discrimination of significant biological alignments
Based on physico-chemical properties of amino-acids
Hydrophobicity, acid / base, sterical properties, ...
Scoring system scales are arbitrary
Based on biological sequence information
Substitutions observed in structural or evolutionary alignments of well
studied protein families
Scoring systems have a probabilistic foundation
Substitution matrices
In proteins some mismatches are more acceptable than others
Substitution matrices give a score for each substitution of one amino acid
by another
Dot Plots or Diagonal plots
Produces a graphical representation of similarity regions.
Dot Plots
A dot plot gives an overview of all possible alignments
In a dot plot, each diagonal corresponds to a
possible (ungapped) alignment
Insertions and deletions in a dot plot
Concept of a dot plot
• Produces a graphical representation of similarity regions.
• The horizontal and vertical dimensions correspond to the compared sequences.
• A region of similarity stands out as a diagonal
A Simple example
A dot is placed at each position where two
residues match.
The colour of the dot can be chosen
according to the substitution value in the
substitution matrix
Limitations of a dot plot
• It is a visual aid.
• It does not provide an alignment.
• This method produces dot plots with too
much noise to be useful
Protein Scoring Systems
Scoring matrices reflect:
• % of mutations to convert one to another
• chemical similarity
• observed mutation frequencies
• the probability of occurrence of each amino acid
Substitution Matrices (Log odds matrices)
Two popular sets of matrices for protein
sequences
PAM matrices [Dayhoff et al., 1978]
BLOSUM matrices [Henikoff & Henikoff, 1992]
Both try to capture the relative substitutability of
amino acid pairs in the context of evolution
PAM series (Dayhoff M., 1968, 1972, 1978)
PAM (Percent Accepted Mutation ) matrices: Family of
matrices PAM 80, PAM 120,PAM 250
A unit introduced by Dayhoff et al. to quantify the amount of
evolutionary change in a protein sequence.
The number with a PAM matrix represents the evolutionary
distance between the sequences on which the matrix is
based
Greater numbers denote greater distances
The PAM-1 matrix reflects an average change of 1% of all
amino acid positions.
PAM250 = 250 mutations per 100 residues.
Greater numbers mean bigger evolutionary distance
Percent Accepted Mutation.
A PAM(x) substitution matrix is a look-up table in
which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins that
have experienced a certain amount (x) of
evolutionary divergence.
Based on 1572 protein sequences from 71 families
Old standard matrix: PAM250
Substitution matrix (PAM 250)
Alignment score
BLOSUM matrices
Different BLOSUMn matrices are calculated
independently from BLOCKS (ungapped local
alignments)
BLOSUMn is based on a cluster of BLOCKS of
sequences that share at least n percent identity
BLOSUM62 represents closer sequences than
BLOSUM45
The number in the matrix name (e.g. 62 in
BLOSUM62) refers to the percentage of sequence
identity used to build the matrix.
Greater numbers mean smaller evolutionary
distance.
BLOSUM series (Henikoff S. & Henikoff
JG., PNAS, 1992)
Blocks Substitution Matrix.
A substitution matrix in which scores for each position are
derived from observations of the frequencies of substitutions
in blocks of local alignments in related proteins.
Each matrix is tailored to a particular evolutionary distance.
In the BLOSUM62 matrix, for example, the alignment from which
scores were derived was created using sequences sharing no
more than 62% identity.
Sequences more identical than 62% are represented by a single
sequence in the alignment so as to avoid over-weighting
closely related family members.
Based on alignments in the BLOCKS database: Standard matrix:
BLOSUM62
TIPS on choosing a scoring matrix
Generally, BLOSUM matrices perform better
than PAM matrices for local similarity searches
(Henikoff & Henikoff, 1993).
When comparing closely related proteins one
should use lower PAM or higher BLOSUM
matrices, for distantly related proteins higher
PAM or lower BLOSUM matrices.
For database searching the commonly used
matrix is BLOSUM62.
Limitations of Substitution Matrices
Substitution matrices do not take into account
long range interactions between residues.
They assume that identical residues are equal
(whereas in real life a residue at the active site
has other evolutionary constraints than the
same residue outside of the active site)
They assume evolution rate to be constant.
Gaps
Insertions or deletions
Proteins often contain regions where residues have
been inserted or deleted during evolution
There are constraints on where these insertions and
deletions can happen (between structural or
functional elements like: alpha helices, active site,
etc.)
Why Gap Penalties?
The optimal alignment of two similar sequences is
usually that which
maximizes the number of matches and
minimizes the number of gaps.
There is a tradeoff between these two - adding gaps
reduces mismatches
Permitting the insertion of arbitrarily many gaps can
lead to high scoring alignments of non-
homologous sequences.
Penalizing gaps forces alignments to have relatively
few gaps.
Gap Penalties
How to balance gaps with mismatches?
Gaps must get a steep penalty, or else you’ll end
up with nonsense alignments.
In real sequences, muti-base (or amino acid)
gaps are quite common
genetic insertion/deletion events
“Affine” gap penalties give a big penalty for each
new gap, but a much smaller “gap extension”
penalty.
Gap opening and extension penalties
Costs of gaps in alignments
We want to simulate as closely as possible the evolutionary
mechanisms involved in gap occurrence.
Example
Two alignments with identical number of gaps but very different
gap distribution.
We may prefer one large gap to several small ones (e.g. poorly
conserved loops between well-conserved helices)
Gap opening penalty
Counted each time a gap is opened in an alignment (some
programs include the first extension into this penalty)
Gap extension penalty
Counted for each extension of a gap in an alignment
Example
With a match score of 1 and a mismatch score of
0
With an opening penalty of 10 and extension
penalty of 1, we have the following score:
Statistical evaluation of results
Alignments are evaluated according to their score
• Raw score
It is the sum of the amino acid substitution scores and gap penalties (gap
opening and gap extension)
Depends on the scoring system (substitution matrix, etc.)
Different alignments should not be compared based only on the raw score
It is possible that a "bad" long alignment gets a better raw score than a very
goodshort alignment.
We need a normalised score to compare alignments
We need to evaluate the biological meaning of the score (p-value, e-value).
• Normalised score
Is independent of the scoring system
Allows the comparison of different alignments
Units: expressed in bits
Distribution of alignment scores - Extreme Value
Distribution
Random sequences and alignment scores
Sequence alignment scores between random
sequences are distributed following an extreme
value distribution (EVD).
Extreme Value Distribution
High scoring random alignments have a low probability.
The EVD allows us to compute the probability with which
our biological alignment could be due to randomness
(to chance).
Caveat: finding the threshold of significant alignments.
Statistics derived from the scores
p-value
Probability that an alignment with this score occurs by
chance in a database of this size
The closer the p-value is towards 0, the better the alignment
e-value
Number of matches with this score one can expect to find by
chance in a database of this size
The closer the e-value is towards 0, the better the alignment
Relationship between e-value and p-value:
In a database containing N sequences, e = p x N

More Related Content

What's hot (20)

Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Kegg databse
Kegg databseKegg databse
Kegg databse
 
RFLP ,RAPD ,AFLP, STS, SCAR ,SSCP & QTL
RFLP ,RAPD ,AFLP, STS, SCAR ,SSCP &  QTLRFLP ,RAPD ,AFLP, STS, SCAR ,SSCP &  QTL
RFLP ,RAPD ,AFLP, STS, SCAR ,SSCP & QTL
 
multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignment
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysis
 
Maximum parsimony
Maximum parsimonyMaximum parsimony
Maximum parsimony
 
Gene mapping
Gene mappingGene mapping
Gene mapping
 
Sequenced taged sites (sts)
Sequenced taged sites (sts)Sequenced taged sites (sts)
Sequenced taged sites (sts)
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Genomic mapping, genetic mapping
Genomic mapping, genetic mappingGenomic mapping, genetic mapping
Genomic mapping, genetic mapping
 
Basic information of s1 nuclease
Basic information of s1 nucleaseBasic information of s1 nuclease
Basic information of s1 nuclease
 
Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1
 
Scop database
Scop databaseScop database
Scop database
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 

Similar to 4. sequence alignment.pptx

Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
How the blast work
How the blast workHow the blast work
How the blast workAtai Rabby
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Phylogenetic analysis in nutshell
Phylogenetic analysis in nutshellPhylogenetic analysis in nutshell
Phylogenetic analysis in nutshellAvinash Kumar
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsseham15
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational BoilogygyMUHAMMEDBAWAYUSUF
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformaticsSumatiHajela
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxericndunek
 

Similar to 4. sequence alignment.pptx (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
How the blast work
How the blast workHow the blast work
How the blast work
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Phylogenetic analysis in nutshell
Phylogenetic analysis in nutshellPhylogenetic analysis in nutshell
Phylogenetic analysis in nutshell
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignments
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygy
 
222397 lecture 16 17
222397 lecture 16 17222397 lecture 16 17
222397 lecture 16 17
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 

Recently uploaded

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Recently uploaded (20)

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

4. sequence alignment.pptx

  • 1. DNA and Protein sequence alignments: Pairwise alignment Dot Plots Substitution Matrices (PAM, BLOSUM) Computer applications for Biosciences and Bioinformatics Module III b
  • 2. Sequence alignment To align and score a pair of sequences (DNA or protein) To find the correspondences between substrings in the sequences such that the similarity score is maximized Why do alignment? To find out homology: similarity due to descent from a common ancestor Often we can infer homology from similarity Thus we can sometimes infer structure/function from sequence similarity
  • 3. Sequence analysis tools depending on pair wise comparison • Multiple alignments • Profile and HMM making (used to search for protein families and domains) • 3D protein structure prediction • Phylogenetic analysis • Construction of certain substitution matrices • Similarity searches in a database
  • 4. Homology Members of a family are called homologs or homologous molecules. Homologous sequences can be divided into two groups – orthologous sequences: sequences that differ because they are found in different species (e.g. human α -globin and mouse α-globin) – paralogous sequences: sequences that differ because of a gene duplication event (e.g. human α-globin and human β-globin, various versions of both)
  • 5. Issues in Sequence Alignment The sequences we are comparing probably differ in length There may be only a relatively small region in the sequences that match We want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) Variable length regions may have been inserted/deleted from the common ancestral sequence
  • 6. Applications Sequence alignment arises in many fields: • Molecular biology • Inexact text matching (e.g. spell checkers; web page search) • Speech recognition In general: • The precise definition of what constitutes an alignment may vary by field, and even within a field. • Many different alignments of two sequences are possible, so to select among them one requires an objective (score) function on alignments. • The number of possible alignments of two sequences grows exponentially with the length of the sequences. Good algorithms are required.
  • 7.
  • 8. Important questions Q. What do we want to align and how? A: Two sequences (nucleotide or protein) through pairwise alignment Or To find similar sequences in a database against our query sequence by multiple sequence alignment Q. How do we “score” an alignment?  Simple scoring (match= 1, mismatch= 0),  Dot plots (graphical representation)  Substitution matrices (PAM and BLOSUM)[s(a,b) indicates score of aligning character a with character b; Also accounts for relative substitutability of amino acid pairs in the context of evolution]  Gap penalty function: w(k) indicates cost of a gap of length k
  • 9. Q. How do we find the “best” alignment? A: Alignment algorithms An alignment program tries to find the best alignment between two sequences given the scoring system. Alignement types Global Alignment between the complete sequence A and the complete sequence B Local Alignment between a sub-sequence of A an a subsequence of B Computer implementation (Algorithms) Dynamic programming  Global: Needleman-Wunsch  Local: Smith-Waterman Heuristic algorithms (faster but approximate)  BLAST  FASTA
  • 10. Pairwise alignment The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. There are lots of possible alignments. Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.
  • 11. Sequence comparison through pairwise alignments Goal of pairwise comparison is to find conserved regions (if any) between two sequences Extrapolate information about our sequence using the known characteristics of the other sequence
  • 12. Evolution of sequences Sequences evolve through mutation and selection [Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge,etc.)] Modular nature of proteins [Nature keeps re-using domains] Alignments try to tell the evolutionary story of the proteins Relationships
  • 13. Example of Alignment-textual view Two similar regions of the Drosophila melanogaster Slit and Notch proteins
  • 14. Some Definitions Identity • Proportion of pairs of identical residues between two aligned sequences. • Generally expressed as a percentage. • This value strongly depends on how the two sequences are aligned. Similarity • Proportion of pairs of similar residues between two aligned sequences. • If two residues are similar is determined by a substitution matrix. • This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. Homology • Two sequences are homologous if and only if they have a common ancestor. • There is no such thing as a level of homology ! (It's either yes or no) Note: Homologous sequences do not necessarily serve the same function... Nor are they always highly similar: structure may be conserved while sequence is not
  • 15. Consider a set S (say, globins) and a test t that tries to detect members of S (for example, through a pairwise comparison with another globin). True positive • A protein is a true positive if it belongs to S and is detected by t. True negative • A protein is a true negative if it does not belong to S and is not detected by t. False positive • A protein is a false positive if it does not belong to S and is (incorrectly) detected by t. False negative • A protein is a false negative if it belongs to S and is not detected by t (but should be).
  • 16. Example The set of all globins and a test to identify them Consider: A set S (say, globins: G) A test t that tries to detect members of S (for example, through a pairwise comparison with another globin).
  • 17. Concept of a sequence alignment Pairwise Alignment: Explicit mapping between the residues of 2 sequences Tolerant to errors (mismatches, insertion / deletions or indels) Evaluation of the alignment in a biological concept (significance)
  • 18. Number of alignments There are many ways to align two sequences Consider the sequence fragments below: a simple alignment shows some conserved portions Number of possible alignments for 2 sequences of length 1000 residues: more than 10 600gapped alignments (Avogadro 1024, estimated number of atoms in the universe 1080)
  • 19. What is a good alignment ? We need a way to evaluate the biological meaning of a given alignment Intuitively we "know" that the following alignment: We can express this notion more rigorously, by using a scoring system
  • 20. Scoring system Simple alignment scores A simple way (but not the best) to score an alignment is to count 1 for each match and 0 for each mismatch.
  • 21. Importance of the scoring system Discrimination of significant biological alignments Based on physico-chemical properties of amino-acids Hydrophobicity, acid / base, sterical properties, ... Scoring system scales are arbitrary Based on biological sequence information Substitutions observed in structural or evolutionary alignments of well studied protein families Scoring systems have a probabilistic foundation Substitution matrices In proteins some mismatches are more acceptable than others Substitution matrices give a score for each substitution of one amino acid by another Dot Plots or Diagonal plots Produces a graphical representation of similarity regions.
  • 22. Dot Plots A dot plot gives an overview of all possible alignments
  • 23. In a dot plot, each diagonal corresponds to a possible (ungapped) alignment
  • 24. Insertions and deletions in a dot plot
  • 25. Concept of a dot plot • Produces a graphical representation of similarity regions. • The horizontal and vertical dimensions correspond to the compared sequences. • A region of similarity stands out as a diagonal A Simple example A dot is placed at each position where two residues match. The colour of the dot can be chosen according to the substitution value in the substitution matrix
  • 26. Limitations of a dot plot • It is a visual aid. • It does not provide an alignment. • This method produces dot plots with too much noise to be useful
  • 27. Protein Scoring Systems Scoring matrices reflect: • % of mutations to convert one to another • chemical similarity • observed mutation frequencies • the probability of occurrence of each amino acid
  • 28. Substitution Matrices (Log odds matrices) Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al., 1978] BLOSUM matrices [Henikoff & Henikoff, 1992] Both try to capture the relative substitutability of amino acid pairs in the context of evolution
  • 29. PAM series (Dayhoff M., 1968, 1972, 1978) PAM (Percent Accepted Mutation ) matrices: Family of matrices PAM 80, PAM 120,PAM 250 A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based Greater numbers denote greater distances The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM250 = 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance
  • 30. Percent Accepted Mutation. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. Based on 1572 protein sequences from 71 families Old standard matrix: PAM250
  • 33. BLOSUM matrices Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45 The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.
  • 34. BLOSUM series (Henikoff S. & Henikoff JG., PNAS, 1992) Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. Based on alignments in the BLOCKS database: Standard matrix: BLOSUM62
  • 35.
  • 36.
  • 37. TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
  • 38. Limitations of Substitution Matrices Substitution matrices do not take into account long range interactions between residues. They assume that identical residues are equal (whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site) They assume evolution rate to be constant.
  • 39. Gaps Insertions or deletions Proteins often contain regions where residues have been inserted or deleted during evolution There are constraints on where these insertions and deletions can happen (between structural or functional elements like: alpha helices, active site, etc.)
  • 40. Why Gap Penalties? The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non- homologous sequences. Penalizing gaps forces alignments to have relatively few gaps.
  • 41. Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, muti-base (or amino acid) gaps are quite common genetic insertion/deletion events “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
  • 42. Gap opening and extension penalties Costs of gaps in alignments We want to simulate as closely as possible the evolutionary mechanisms involved in gap occurrence. Example Two alignments with identical number of gaps but very different gap distribution. We may prefer one large gap to several small ones (e.g. poorly conserved loops between well-conserved helices) Gap opening penalty Counted each time a gap is opened in an alignment (some programs include the first extension into this penalty) Gap extension penalty Counted for each extension of a gap in an alignment
  • 43. Example With a match score of 1 and a mismatch score of 0 With an opening penalty of 10 and extension penalty of 1, we have the following score:
  • 44. Statistical evaluation of results Alignments are evaluated according to their score • Raw score It is the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension) Depends on the scoring system (substitution matrix, etc.) Different alignments should not be compared based only on the raw score It is possible that a "bad" long alignment gets a better raw score than a very goodshort alignment. We need a normalised score to compare alignments We need to evaluate the biological meaning of the score (p-value, e-value). • Normalised score Is independent of the scoring system Allows the comparison of different alignments Units: expressed in bits
  • 45. Distribution of alignment scores - Extreme Value Distribution Random sequences and alignment scores Sequence alignment scores between random sequences are distributed following an extreme value distribution (EVD).
  • 46. Extreme Value Distribution High scoring random alignments have a low probability. The EVD allows us to compute the probability with which our biological alignment could be due to randomness (to chance). Caveat: finding the threshold of significant alignments.
  • 47. Statistics derived from the scores p-value Probability that an alignment with this score occurs by chance in a database of this size The closer the p-value is towards 0, the better the alignment e-value Number of matches with this score one can expect to find by chance in a database of this size The closer the e-value is towards 0, the better the alignment Relationship between e-value and p-value: In a database containing N sequences, e = p x N