SlideShare a Scribd company logo
1 of 47
DNA and Protein sequence
alignments:
Pairwise alignment
Dot Plots
Substitution Matrices (PAM,
BLOSUM)
Computer applications for
Biosciences and Bioinformatics
Module III b
Sequence alignment
To align and score a pair of sequences (DNA or
protein)
To find the correspondences between substrings in
the sequences such that the similarity score is
maximized
Why do alignment?
To find out homology: similarity due to descent
from a common ancestor
Often we can infer homology from similarity
Thus we can sometimes infer structure/function
from sequence similarity
Sequence analysis tools depending on
pair wise comparison
• Multiple alignments
• Profile and HMM making (used to search for
protein families and domains)
• 3D protein structure prediction
• Phylogenetic analysis
• Construction of certain substitution matrices
• Similarity searches in a database
Homology
Members of a family are called homologs or
homologous molecules.
Homologous sequences can be divided into two
groups
– orthologous sequences: sequences that differ
because they are found in different species (e.g.
human α -globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event (e.g. human
α-globin and human β-globin, various versions of
both)
Issues in Sequence Alignment
The sequences we are comparing probably differ
in length
There may be only a relatively small region in
the sequences that match
We want to allow partial matches (i.e. some
amino acid pairs are more substitutable than
others)
Variable length regions may have been
inserted/deleted from the common ancestral
sequence
Applications
Sequence alignment arises in many fields:
• Molecular biology
• Inexact text matching (e.g. spell checkers; web page search)
• Speech recognition
In general:
• The precise definition of what constitutes an alignment may
vary by field, and even within a field.
• Many different alignments of two sequences are possible, so to
select among them one requires an objective (score) function
on alignments.
• The number of possible alignments of two sequences grows
exponentially with the length of the sequences. Good
algorithms are required.
Important questions
Q. What do we want to align and how?
A: Two sequences (nucleotide or protein) through pairwise
alignment
Or To find similar sequences in a database against our query
sequence by multiple sequence alignment
Q. How do we “score” an alignment?
 Simple scoring (match= 1, mismatch= 0),
 Dot plots (graphical representation)
 Substitution matrices (PAM and BLOSUM)[s(a,b)
indicates score of aligning character a with character b;
Also accounts for relative substitutability of amino acid
pairs in the context of evolution]
 Gap penalty function: w(k) indicates cost of a gap of
length k
Q. How do we find the “best” alignment?
A: Alignment algorithms
An alignment program tries to find the best alignment between
two sequences given the scoring system.
Alignement types
Global Alignment between the complete sequence A and the
complete sequence B
Local Alignment between a sub-sequence of A an a subsequence
of B
Computer implementation (Algorithms)
Dynamic programming
 Global: Needleman-Wunsch
 Local: Smith-Waterman
Heuristic algorithms (faster but approximate)
 BLAST
 FASTA
Pairwise alignment
The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
There are lots of possible alignments.
Two sequences can always be aligned.
Sequence alignments have to be scored.
Often there is more than one solution with
the same score.
Sequence comparison through pairwise
alignments
Goal of pairwise comparison is to find conserved
regions (if any) between two sequences
Extrapolate information about our sequence
using the known characteristics of the other
sequence
Evolution of sequences
Sequences evolve through mutation and selection
[Selective pressure is different for each residue
position in a protein (i.e. conservation of active
site, structure, charge,etc.)]
Modular nature of proteins [Nature keeps re-using
domains]
Alignments try to tell the evolutionary story of the
proteins
Relationships
Example of Alignment-textual view
Two similar regions of the Drosophila melanogaster
Slit and Notch proteins
Some Definitions
Identity
• Proportion of pairs of identical residues between two aligned sequences.
• Generally expressed as a percentage.
• This value strongly depends on how the two sequences are aligned.
Similarity
• Proportion of pairs of similar residues between two aligned sequences.
• If two residues are similar is determined by a substitution matrix.
• This value also depends strongly on how the two sequences are aligned,
as well as on the substitution matrix used.
Homology
• Two sequences are homologous if and only if they have a common
ancestor.
• There is no such thing as a level of homology ! (It's either yes or no)
Note: Homologous sequences do not necessarily serve the same function...
Nor are they always highly similar: structure may be conserved while
sequence is not
Consider a set S (say, globins) and a test t that tries to detect
members of S (for example, through a pairwise comparison
with another globin).
True positive
• A protein is a true positive if it belongs to S and is detected
by t.
True negative
• A protein is a true negative if it does not belong to S and is
not detected by t.
False positive
• A protein is a false positive if it does not belong to S and is
(incorrectly) detected by t.
False negative
• A protein is a false negative if it belongs to S and is not
detected by t (but should be).
Example
The set of all globins and a test to identify them
Consider:
A set S (say, globins: G)
A test t that tries to detect members of S (for
example, through a pairwise comparison with
another globin).
Concept of a sequence alignment
Pairwise Alignment:
Explicit mapping between the residues of 2
sequences
Tolerant to errors (mismatches, insertion /
deletions or indels)
Evaluation of the alignment in a biological concept
(significance)
Number of alignments
There are many ways to align two sequences
Consider the sequence fragments below: a simple
alignment shows some conserved portions
Number of possible alignments for 2 sequences of length 1000 residues:
more than 10 600gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080)
What is a good alignment ?
We need a way to evaluate the biological meaning
of a given alignment
Intuitively we "know" that the following alignment:
We can express this notion more rigorously, by using a scoring
system
Scoring system
Simple alignment scores
A simple way (but not the best) to score an
alignment is to count 1 for each match and 0 for
each mismatch.
Importance of the scoring system
Discrimination of significant biological alignments
Based on physico-chemical properties of amino-acids
Hydrophobicity, acid / base, sterical properties, ...
Scoring system scales are arbitrary
Based on biological sequence information
Substitutions observed in structural or evolutionary alignments of well
studied protein families
Scoring systems have a probabilistic foundation
Substitution matrices
In proteins some mismatches are more acceptable than others
Substitution matrices give a score for each substitution of one amino acid
by another
Dot Plots or Diagonal plots
Produces a graphical representation of similarity regions.
Dot Plots
A dot plot gives an overview of all possible alignments
In a dot plot, each diagonal corresponds to a
possible (ungapped) alignment
Insertions and deletions in a dot plot
Concept of a dot plot
• Produces a graphical representation of similarity regions.
• The horizontal and vertical dimensions correspond to the compared sequences.
• A region of similarity stands out as a diagonal
A Simple example
A dot is placed at each position where two
residues match.
The colour of the dot can be chosen
according to the substitution value in the
substitution matrix
Limitations of a dot plot
• It is a visual aid.
• It does not provide an alignment.
• This method produces dot plots with too
much noise to be useful
Protein Scoring Systems
Scoring matrices reflect:
• % of mutations to convert one to another
• chemical similarity
• observed mutation frequencies
• the probability of occurrence of each amino acid
Substitution Matrices (Log odds matrices)
Two popular sets of matrices for protein
sequences
PAM matrices [Dayhoff et al., 1978]
BLOSUM matrices [Henikoff & Henikoff, 1992]
Both try to capture the relative substitutability of
amino acid pairs in the context of evolution
PAM series (Dayhoff M., 1968, 1972, 1978)
PAM (Percent Accepted Mutation ) matrices: Family of
matrices PAM 80, PAM 120,PAM 250
A unit introduced by Dayhoff et al. to quantify the amount of
evolutionary change in a protein sequence.
The number with a PAM matrix represents the evolutionary
distance between the sequences on which the matrix is
based
Greater numbers denote greater distances
The PAM-1 matrix reflects an average change of 1% of all
amino acid positions.
PAM250 = 250 mutations per 100 residues.
Greater numbers mean bigger evolutionary distance
Percent Accepted Mutation.
A PAM(x) substitution matrix is a look-up table in
which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins that
have experienced a certain amount (x) of
evolutionary divergence.
Based on 1572 protein sequences from 71 families
Old standard matrix: PAM250
Substitution matrix (PAM 250)
Alignment score
BLOSUM matrices
Different BLOSUMn matrices are calculated
independently from BLOCKS (ungapped local
alignments)
BLOSUMn is based on a cluster of BLOCKS of
sequences that share at least n percent identity
BLOSUM62 represents closer sequences than
BLOSUM45
The number in the matrix name (e.g. 62 in
BLOSUM62) refers to the percentage of sequence
identity used to build the matrix.
Greater numbers mean smaller evolutionary
distance.
BLOSUM series (Henikoff S. & Henikoff
JG., PNAS, 1992)
Blocks Substitution Matrix.
A substitution matrix in which scores for each position are
derived from observations of the frequencies of substitutions
in blocks of local alignments in related proteins.
Each matrix is tailored to a particular evolutionary distance.
In the BLOSUM62 matrix, for example, the alignment from which
scores were derived was created using sequences sharing no
more than 62% identity.
Sequences more identical than 62% are represented by a single
sequence in the alignment so as to avoid over-weighting
closely related family members.
Based on alignments in the BLOCKS database: Standard matrix:
BLOSUM62
TIPS on choosing a scoring matrix
Generally, BLOSUM matrices perform better
than PAM matrices for local similarity searches
(Henikoff & Henikoff, 1993).
When comparing closely related proteins one
should use lower PAM or higher BLOSUM
matrices, for distantly related proteins higher
PAM or lower BLOSUM matrices.
For database searching the commonly used
matrix is BLOSUM62.
Limitations of Substitution Matrices
Substitution matrices do not take into account
long range interactions between residues.
They assume that identical residues are equal
(whereas in real life a residue at the active site
has other evolutionary constraints than the
same residue outside of the active site)
They assume evolution rate to be constant.
Gaps
Insertions or deletions
Proteins often contain regions where residues have
been inserted or deleted during evolution
There are constraints on where these insertions and
deletions can happen (between structural or
functional elements like: alpha helices, active site,
etc.)
Why Gap Penalties?
The optimal alignment of two similar sequences is
usually that which
maximizes the number of matches and
minimizes the number of gaps.
There is a tradeoff between these two - adding gaps
reduces mismatches
Permitting the insertion of arbitrarily many gaps can
lead to high scoring alignments of non-
homologous sequences.
Penalizing gaps forces alignments to have relatively
few gaps.
Gap Penalties
How to balance gaps with mismatches?
Gaps must get a steep penalty, or else you’ll end
up with nonsense alignments.
In real sequences, muti-base (or amino acid)
gaps are quite common
genetic insertion/deletion events
“Affine” gap penalties give a big penalty for each
new gap, but a much smaller “gap extension”
penalty.
Gap opening and extension penalties
Costs of gaps in alignments
We want to simulate as closely as possible the evolutionary
mechanisms involved in gap occurrence.
Example
Two alignments with identical number of gaps but very different
gap distribution.
We may prefer one large gap to several small ones (e.g. poorly
conserved loops between well-conserved helices)
Gap opening penalty
Counted each time a gap is opened in an alignment (some
programs include the first extension into this penalty)
Gap extension penalty
Counted for each extension of a gap in an alignment
Example
With a match score of 1 and a mismatch score of
0
With an opening penalty of 10 and extension
penalty of 1, we have the following score:
Statistical evaluation of results
Alignments are evaluated according to their score
• Raw score
It is the sum of the amino acid substitution scores and gap penalties (gap
opening and gap extension)
Depends on the scoring system (substitution matrix, etc.)
Different alignments should not be compared based only on the raw score
It is possible that a "bad" long alignment gets a better raw score than a very
goodshort alignment.
We need a normalised score to compare alignments
We need to evaluate the biological meaning of the score (p-value, e-value).
• Normalised score
Is independent of the scoring system
Allows the comparison of different alignments
Units: expressed in bits
Distribution of alignment scores - Extreme Value
Distribution
Random sequences and alignment scores
Sequence alignment scores between random
sequences are distributed following an extreme
value distribution (EVD).
Extreme Value Distribution
High scoring random alignments have a low probability.
The EVD allows us to compute the probability with which
our biological alignment could be due to randomness
(to chance).
Caveat: finding the threshold of significant alignments.
Statistics derived from the scores
p-value
Probability that an alignment with this score occurs by
chance in a database of this size
The closer the p-value is towards 0, the better the alignment
e-value
Number of matches with this score one can expect to find by
chance in a database of this size
The closer the e-value is towards 0, the better the alignment
Relationship between e-value and p-value:
In a database containing N sequences, e = p x N

More Related Content

What's hot

BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genomePaul Gardner
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserHoffman Lab
 
Comparative transcriptomics
Comparative transcriptomicsComparative transcriptomics
Comparative transcriptomicsSayak Ghosh
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02PILLAI ASWATHY VISWANATH
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisSangeeta Das
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysisRamikaSingla
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...VHIR Vall d’Hebron Institut de Recerca
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs OsamaZafar16
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysisyuvraj404
 
Protein identification and analysis on ExPASy server
Protein identification and analysis on ExPASy serverProtein identification and analysis on ExPASy server
Protein identification and analysis on ExPASy serverEkta Gupta
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 

What's hot (20)

BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
Comparative transcriptomics
Comparative transcriptomicsComparative transcriptomics
Comparative transcriptomics
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02
 
222397 lecture 16 17
222397 lecture 16 17222397 lecture 16 17
222397 lecture 16 17
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Ddbj
DdbjDdbj
Ddbj
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
Protein identification and analysis on ExPASy server
Protein identification and analysis on ExPASy serverProtein identification and analysis on ExPASy server
Protein identification and analysis on ExPASy server
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 

Similar to 4. sequence alignment.pptx

Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1SumatiHajela
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
How the blast work
How the blast workHow the blast work
How the blast workAtai Rabby
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Phylogenetic analysis in nutshell
Phylogenetic analysis in nutshellPhylogenetic analysis in nutshell
Phylogenetic analysis in nutshellAvinash Kumar
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsseham15
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational BoilogygyMUHAMMEDBAWAYUSUF
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformaticsSumatiHajela
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxericndunek
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 

Similar to 4. sequence alignment.pptx (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
How the blast work
How the blast workHow the blast work
How the blast work
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Phylogenetic analysis in nutshell
Phylogenetic analysis in nutshellPhylogenetic analysis in nutshell
Phylogenetic analysis in nutshell
 
Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignments
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygy
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 

Recently uploaded

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 

Recently uploaded (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 

4. sequence alignment.pptx

  • 1. DNA and Protein sequence alignments: Pairwise alignment Dot Plots Substitution Matrices (PAM, BLOSUM) Computer applications for Biosciences and Bioinformatics Module III b
  • 2. Sequence alignment To align and score a pair of sequences (DNA or protein) To find the correspondences between substrings in the sequences such that the similarity score is maximized Why do alignment? To find out homology: similarity due to descent from a common ancestor Often we can infer homology from similarity Thus we can sometimes infer structure/function from sequence similarity
  • 3. Sequence analysis tools depending on pair wise comparison • Multiple alignments • Profile and HMM making (used to search for protein families and domains) • 3D protein structure prediction • Phylogenetic analysis • Construction of certain substitution matrices • Similarity searches in a database
  • 4. Homology Members of a family are called homologs or homologous molecules. Homologous sequences can be divided into two groups – orthologous sequences: sequences that differ because they are found in different species (e.g. human α -globin and mouse α-globin) – paralogous sequences: sequences that differ because of a gene duplication event (e.g. human α-globin and human β-globin, various versions of both)
  • 5. Issues in Sequence Alignment The sequences we are comparing probably differ in length There may be only a relatively small region in the sequences that match We want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) Variable length regions may have been inserted/deleted from the common ancestral sequence
  • 6. Applications Sequence alignment arises in many fields: • Molecular biology • Inexact text matching (e.g. spell checkers; web page search) • Speech recognition In general: • The precise definition of what constitutes an alignment may vary by field, and even within a field. • Many different alignments of two sequences are possible, so to select among them one requires an objective (score) function on alignments. • The number of possible alignments of two sequences grows exponentially with the length of the sequences. Good algorithms are required.
  • 7.
  • 8. Important questions Q. What do we want to align and how? A: Two sequences (nucleotide or protein) through pairwise alignment Or To find similar sequences in a database against our query sequence by multiple sequence alignment Q. How do we “score” an alignment?  Simple scoring (match= 1, mismatch= 0),  Dot plots (graphical representation)  Substitution matrices (PAM and BLOSUM)[s(a,b) indicates score of aligning character a with character b; Also accounts for relative substitutability of amino acid pairs in the context of evolution]  Gap penalty function: w(k) indicates cost of a gap of length k
  • 9. Q. How do we find the “best” alignment? A: Alignment algorithms An alignment program tries to find the best alignment between two sequences given the scoring system. Alignement types Global Alignment between the complete sequence A and the complete sequence B Local Alignment between a sub-sequence of A an a subsequence of B Computer implementation (Algorithms) Dynamic programming  Global: Needleman-Wunsch  Local: Smith-Waterman Heuristic algorithms (faster but approximate)  BLAST  FASTA
  • 10. Pairwise alignment The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. There are lots of possible alignments. Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.
  • 11. Sequence comparison through pairwise alignments Goal of pairwise comparison is to find conserved regions (if any) between two sequences Extrapolate information about our sequence using the known characteristics of the other sequence
  • 12. Evolution of sequences Sequences evolve through mutation and selection [Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge,etc.)] Modular nature of proteins [Nature keeps re-using domains] Alignments try to tell the evolutionary story of the proteins Relationships
  • 13. Example of Alignment-textual view Two similar regions of the Drosophila melanogaster Slit and Notch proteins
  • 14. Some Definitions Identity • Proportion of pairs of identical residues between two aligned sequences. • Generally expressed as a percentage. • This value strongly depends on how the two sequences are aligned. Similarity • Proportion of pairs of similar residues between two aligned sequences. • If two residues are similar is determined by a substitution matrix. • This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. Homology • Two sequences are homologous if and only if they have a common ancestor. • There is no such thing as a level of homology ! (It's either yes or no) Note: Homologous sequences do not necessarily serve the same function... Nor are they always highly similar: structure may be conserved while sequence is not
  • 15. Consider a set S (say, globins) and a test t that tries to detect members of S (for example, through a pairwise comparison with another globin). True positive • A protein is a true positive if it belongs to S and is detected by t. True negative • A protein is a true negative if it does not belong to S and is not detected by t. False positive • A protein is a false positive if it does not belong to S and is (incorrectly) detected by t. False negative • A protein is a false negative if it belongs to S and is not detected by t (but should be).
  • 16. Example The set of all globins and a test to identify them Consider: A set S (say, globins: G) A test t that tries to detect members of S (for example, through a pairwise comparison with another globin).
  • 17. Concept of a sequence alignment Pairwise Alignment: Explicit mapping between the residues of 2 sequences Tolerant to errors (mismatches, insertion / deletions or indels) Evaluation of the alignment in a biological concept (significance)
  • 18. Number of alignments There are many ways to align two sequences Consider the sequence fragments below: a simple alignment shows some conserved portions Number of possible alignments for 2 sequences of length 1000 residues: more than 10 600gapped alignments (Avogadro 1024, estimated number of atoms in the universe 1080)
  • 19. What is a good alignment ? We need a way to evaluate the biological meaning of a given alignment Intuitively we "know" that the following alignment: We can express this notion more rigorously, by using a scoring system
  • 20. Scoring system Simple alignment scores A simple way (but not the best) to score an alignment is to count 1 for each match and 0 for each mismatch.
  • 21. Importance of the scoring system Discrimination of significant biological alignments Based on physico-chemical properties of amino-acids Hydrophobicity, acid / base, sterical properties, ... Scoring system scales are arbitrary Based on biological sequence information Substitutions observed in structural or evolutionary alignments of well studied protein families Scoring systems have a probabilistic foundation Substitution matrices In proteins some mismatches are more acceptable than others Substitution matrices give a score for each substitution of one amino acid by another Dot Plots or Diagonal plots Produces a graphical representation of similarity regions.
  • 22. Dot Plots A dot plot gives an overview of all possible alignments
  • 23. In a dot plot, each diagonal corresponds to a possible (ungapped) alignment
  • 24. Insertions and deletions in a dot plot
  • 25. Concept of a dot plot • Produces a graphical representation of similarity regions. • The horizontal and vertical dimensions correspond to the compared sequences. • A region of similarity stands out as a diagonal A Simple example A dot is placed at each position where two residues match. The colour of the dot can be chosen according to the substitution value in the substitution matrix
  • 26. Limitations of a dot plot • It is a visual aid. • It does not provide an alignment. • This method produces dot plots with too much noise to be useful
  • 27. Protein Scoring Systems Scoring matrices reflect: • % of mutations to convert one to another • chemical similarity • observed mutation frequencies • the probability of occurrence of each amino acid
  • 28. Substitution Matrices (Log odds matrices) Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al., 1978] BLOSUM matrices [Henikoff & Henikoff, 1992] Both try to capture the relative substitutability of amino acid pairs in the context of evolution
  • 29. PAM series (Dayhoff M., 1968, 1972, 1978) PAM (Percent Accepted Mutation ) matrices: Family of matrices PAM 80, PAM 120,PAM 250 A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based Greater numbers denote greater distances The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM250 = 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance
  • 30. Percent Accepted Mutation. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. Based on 1572 protein sequences from 71 families Old standard matrix: PAM250
  • 33. BLOSUM matrices Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45 The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.
  • 34. BLOSUM series (Henikoff S. & Henikoff JG., PNAS, 1992) Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. Based on alignments in the BLOCKS database: Standard matrix: BLOSUM62
  • 35.
  • 36.
  • 37. TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
  • 38. Limitations of Substitution Matrices Substitution matrices do not take into account long range interactions between residues. They assume that identical residues are equal (whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site) They assume evolution rate to be constant.
  • 39. Gaps Insertions or deletions Proteins often contain regions where residues have been inserted or deleted during evolution There are constraints on where these insertions and deletions can happen (between structural or functional elements like: alpha helices, active site, etc.)
  • 40. Why Gap Penalties? The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non- homologous sequences. Penalizing gaps forces alignments to have relatively few gaps.
  • 41. Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, muti-base (or amino acid) gaps are quite common genetic insertion/deletion events “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
  • 42. Gap opening and extension penalties Costs of gaps in alignments We want to simulate as closely as possible the evolutionary mechanisms involved in gap occurrence. Example Two alignments with identical number of gaps but very different gap distribution. We may prefer one large gap to several small ones (e.g. poorly conserved loops between well-conserved helices) Gap opening penalty Counted each time a gap is opened in an alignment (some programs include the first extension into this penalty) Gap extension penalty Counted for each extension of a gap in an alignment
  • 43. Example With a match score of 1 and a mismatch score of 0 With an opening penalty of 10 and extension penalty of 1, we have the following score:
  • 44. Statistical evaluation of results Alignments are evaluated according to their score • Raw score It is the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension) Depends on the scoring system (substitution matrix, etc.) Different alignments should not be compared based only on the raw score It is possible that a "bad" long alignment gets a better raw score than a very goodshort alignment. We need a normalised score to compare alignments We need to evaluate the biological meaning of the score (p-value, e-value). • Normalised score Is independent of the scoring system Allows the comparison of different alignments Units: expressed in bits
  • 45. Distribution of alignment scores - Extreme Value Distribution Random sequences and alignment scores Sequence alignment scores between random sequences are distributed following an extreme value distribution (EVD).
  • 46. Extreme Value Distribution High scoring random alignments have a low probability. The EVD allows us to compute the probability with which our biological alignment could be due to randomness (to chance). Caveat: finding the threshold of significant alignments.
  • 47. Statistics derived from the scores p-value Probability that an alignment with this score occurs by chance in a database of this size The closer the p-value is towards 0, the better the alignment e-value Number of matches with this score one can expect to find by chance in a database of this size The closer the e-value is towards 0, the better the alignment Relationship between e-value and p-value: In a database containing N sequences, e = p x N