Protein Evolution and
Protein Evolution and
Sequence Analysis
Sequence Analysis
Significant sequence similarity allows one to
assign function to an unknown protein(s) based
on properties of known proteins and is a direct
consequence of evolutionary relationships.
Central Premise
Central Premise
Homolog
Homolog- A gene/protein related to a second gene/protein by descent from a
common ancestral gene by speciation.
Ortholog
Ortholog- Genes/proteins in different species that evolved from a common
ancestral gene by speciation and that retain the same function.
Paralog
Paralog- Genes/proteins related by duplication of a common ancestral gene
that evolves new functions even if related to that of the ancestor.
Speciation
Speciation- Evolution of a new gene/protein that is genetically independent of
the ancestral gene from which it arose.
Convergent evolution
Convergent evolution-
- Evolution of similar features or properties in
genes/proteins of different genetic lineages.
Trypsin
3NKK
Chymotrypsin
1ACB
Overlay
Subtilisn
1SBT
Divergent and Convergent Evolution Among
Divergent and Convergent Evolution Among
the Serine Proteases
the Serine Proteases
Mechanisms Involved in Molecular
Mechanisms Involved in Molecular
Evolution of Genes/Proteins
Evolution of Genes/Proteins
Mutation- Stochastic single point changes in the genetic material due to
errors in DNA replication during mitosis, radiation exposure, chemical or
environmental stressors, or viruses and transposable elements. Slow but
constant rate (molecular clock) of 10-9
to 10-8
mutations per base per
generation. Splicing errors in eukaryotes that retain introns.
Recombination- Exchange of genes or portions of genes between different
chromosomes to create new combinations of elements.
Gene duplication- Duplication of a gene or portions of a gene, one of
which continues the original function and the other is free to evolve and
acquire new functions.
Retrotransposition- Incorporation of mRNA sequences back into DNA,
frequently inserting into new locations with different expression patterns.
The mechanism by which new genes/proteins arise allow for the
possibility of sequence analysis to infer functional and structural
relationships among different sequences.
AGGCTTAGCAAA........TCAGGGCCTAATGCG
|||||||| ||| ||||||||||| |||
AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG
The above pairwise alignment could be scored giving a “1” for each
identical nucleotide, A zero for a mismatch, and a -4 for “opening a “gap”
and a -1 for each extension of the gap. So score = 25 – 11= 14
Sequence alignments are methods to arranging DNA, RNA, or protein
sequences to identify regions of similarity or identity with the goal of
inferring structure, function, or both.
Sequence searches and alignments using DNA/RNA are usually not as
informative as searches and alignments using protein sequences.
However. DNA/RNA searches are intuitively easier to understand:
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
| | | | | ||| | | || |||
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are either identical or
not identical at a given position, protein sequence alignments include
“shades of grey” where one might acknowledge that a T is sort of
equivalent to an S. But how equivalent? What number would you
assign to an S-T mismatch? And what about gaps? Since alanine is
a common amino acid, couldn’t the A-A match be by chance? Since
Trp and Cys are uncommon, should those matches be given higher
scores?
Therefore, accurately aligning sequences and accurately finding
related sequences are approximately the same problem?
Protein sequence alignments are much more complicated but are
more informative because they involve 20 degrees of freedom (total
possible amino acids) rather than 4 (total possible bases).
Multiple Sequence Alignments
Multiple Sequence Alignments
Sequence comparisons fall into two categories: Local alignment in
which regions of a large sequences are compared to identify regions of
similarity such as in domains and global alignments in which similar
sequences of similar length are compared to analyze overall similarity.
Various methods are available depending on the assumptions of the
algorithm and the types of sequences to be analyzed. All require a
scoring matrix for dealing with similarities, gaps, and insertions.
Clustal is a commonly used global alignment algorithm for performing
multiple sequence alignments. Algorithm is executed in three stages:
(1) A pairwise sequence comparison is performed across all sequences
starting from the most similar; (2) The pairwise information is used to
create a guide tree; (3) The guide tree is used to perform the final
alignment.
PAM (Percent Acceptable Mutation) matrices
• Are derived from studying global alignments of well-characterized protein families.
• PAM1 = only 1% of residues has changed (ie short evolutionary distance)
• Raise this to 250 power to get 250% change of two sequences (greater
evolutionary distance), or about 20% sequence identity.
• Therefore,
a PAM 30 would be used to analyze more closely related proteins,
a PAM 400 is used for finding and analyzing distantly related proteins.
• PAMx = PAM1x
Are derived from studying local alignments (blocks) of sequences from related proteins
that differ by no more than X%.
1)In other words, one might use the portions of aligned sequences from related proteins
that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM
62 scoring matrix.
2)One might use only the blocks that have <80% identity to derive the BLOSUM 80
matrix.
Block substitution matrices (BLOSUM)
3) BLOSUM and PAM substitution matrices have the opposite effects:
a)The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related
proteins you are looking for.
a)The higher the number of the PAM matrix (PAM X), the more distantly related proteins
you are looking for.
Gap penalties – Intuitively one recognizes that there should be a penalty
for introducing (requiring) a gap during identification/alignment of a given
sequence. But if two sequences are related, the gaps may well be located
in loop regions which are more tolerant of mutational events and probably
have little impact on structure. Therefore, a new gap should be penalized,
but extending an existing gap should be penalized very little.
Filtering – many proteins and nucleotides contain simple repeats or regions
of low sequence complexity. These must be excluded from searches and
alignments.
Significance of a “hit” during a search - More important than an arbitrary
score is an estimation of the likelihood of finding a hit through pure chance
(lower the value to more certainty of a match). Ergo the “Expectation value”
or E-value. E-values can be as low as 10-70
.
Useful Bioinformatics Sites
Useful Bioinformatics Sites
National Center for Biotechnology Information (NCBI)- National Institutes of
Health sponsored sites with rich array of resources and data bases.
[http://www.ncbi.nlm.nih.gov/pubmed]
ExPASy (Swiss Institute of Bioinformatics)- Large number of different
tools for sequence and function analysis. [http://www.expasy.org/tools/]
RCSB Protein Data Bank- Largest data base for curated of protein structures.
[http://www.rcsb.org/pdb/home/home.do]
BioGRID- Large data base of curated protein interaction datasets.
[http://thebiogrid.org/]
Osprey- Software and interactome analysis tools for visualizing interaction
data sets. [http://en.bio-soft.net/protein/Osprey.html]
Tree of Life website- Database information on phylogenetic relationships
among organisms with useful link outs. [http://tolweb.org/tree/]

Protein Evolution and Sequence Analysis.ppt

  • 1.
    Protein Evolution and ProteinEvolution and Sequence Analysis Sequence Analysis
  • 2.
    Significant sequence similarityallows one to assign function to an unknown protein(s) based on properties of known proteins and is a direct consequence of evolutionary relationships. Central Premise Central Premise Homolog Homolog- A gene/protein related to a second gene/protein by descent from a common ancestral gene by speciation. Ortholog Ortholog- Genes/proteins in different species that evolved from a common ancestral gene by speciation and that retain the same function. Paralog Paralog- Genes/proteins related by duplication of a common ancestral gene that evolves new functions even if related to that of the ancestor. Speciation Speciation- Evolution of a new gene/protein that is genetically independent of the ancestral gene from which it arose. Convergent evolution Convergent evolution- - Evolution of similar features or properties in genes/proteins of different genetic lineages.
  • 3.
    Trypsin 3NKK Chymotrypsin 1ACB Overlay Subtilisn 1SBT Divergent and ConvergentEvolution Among Divergent and Convergent Evolution Among the Serine Proteases the Serine Proteases
  • 4.
    Mechanisms Involved inMolecular Mechanisms Involved in Molecular Evolution of Genes/Proteins Evolution of Genes/Proteins Mutation- Stochastic single point changes in the genetic material due to errors in DNA replication during mitosis, radiation exposure, chemical or environmental stressors, or viruses and transposable elements. Slow but constant rate (molecular clock) of 10-9 to 10-8 mutations per base per generation. Splicing errors in eukaryotes that retain introns. Recombination- Exchange of genes or portions of genes between different chromosomes to create new combinations of elements. Gene duplication- Duplication of a gene or portions of a gene, one of which continues the original function and the other is free to evolve and acquire new functions. Retrotransposition- Incorporation of mRNA sequences back into DNA, frequently inserting into new locations with different expression patterns. The mechanism by which new genes/proteins arise allow for the possibility of sequence analysis to infer functional and structural relationships among different sequences.
  • 5.
    AGGCTTAGCAAA........TCAGGGCCTAATGCG |||||||| ||| |||||||||||||| AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG The above pairwise alignment could be scored giving a “1” for each identical nucleotide, A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension of the gap. So score = 25 – 11= 14 Sequence alignments are methods to arranging DNA, RNA, or protein sequences to identify regions of similarity or identity with the goal of inferring structure, function, or both. Sequence searches and alignments using DNA/RNA are usually not as informative as searches and alignments using protein sequences. However. DNA/RNA searches are intuitively easier to understand:
  • 6.
    ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH | | || | ||| | | || ||| AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH Unlike nucleotide sequence alignments, which are either identical or not identical at a given position, protein sequence alignments include “shades of grey” where one might acknowledge that a T is sort of equivalent to an S. But how equivalent? What number would you assign to an S-T mismatch? And what about gaps? Since alanine is a common amino acid, couldn’t the A-A match be by chance? Since Trp and Cys are uncommon, should those matches be given higher scores? Therefore, accurately aligning sequences and accurately finding related sequences are approximately the same problem? Protein sequence alignments are much more complicated but are more informative because they involve 20 degrees of freedom (total possible amino acids) rather than 4 (total possible bases).
  • 7.
    Multiple Sequence Alignments MultipleSequence Alignments Sequence comparisons fall into two categories: Local alignment in which regions of a large sequences are compared to identify regions of similarity such as in domains and global alignments in which similar sequences of similar length are compared to analyze overall similarity. Various methods are available depending on the assumptions of the algorithm and the types of sequences to be analyzed. All require a scoring matrix for dealing with similarities, gaps, and insertions. Clustal is a commonly used global alignment algorithm for performing multiple sequence alignments. Algorithm is executed in three stages: (1) A pairwise sequence comparison is performed across all sequences starting from the most similar; (2) The pairwise information is used to create a guide tree; (3) The guide tree is used to perform the final alignment.
  • 8.
    PAM (Percent AcceptableMutation) matrices • Are derived from studying global alignments of well-characterized protein families. • PAM1 = only 1% of residues has changed (ie short evolutionary distance) • Raise this to 250 power to get 250% change of two sequences (greater evolutionary distance), or about 20% sequence identity. • Therefore, a PAM 30 would be used to analyze more closely related proteins, a PAM 400 is used for finding and analyzing distantly related proteins. • PAMx = PAM1x
  • 9.
    Are derived fromstudying local alignments (blocks) of sequences from related proteins that differ by no more than X%. 1)In other words, one might use the portions of aligned sequences from related proteins that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM 62 scoring matrix. 2)One might use only the blocks that have <80% identity to derive the BLOSUM 80 matrix. Block substitution matrices (BLOSUM) 3) BLOSUM and PAM substitution matrices have the opposite effects: a)The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related proteins you are looking for. a)The higher the number of the PAM matrix (PAM X), the more distantly related proteins you are looking for.
  • 10.
    Gap penalties –Intuitively one recognizes that there should be a penalty for introducing (requiring) a gap during identification/alignment of a given sequence. But if two sequences are related, the gaps may well be located in loop regions which are more tolerant of mutational events and probably have little impact on structure. Therefore, a new gap should be penalized, but extending an existing gap should be penalized very little. Filtering – many proteins and nucleotides contain simple repeats or regions of low sequence complexity. These must be excluded from searches and alignments. Significance of a “hit” during a search - More important than an arbitrary score is an estimation of the likelihood of finding a hit through pure chance (lower the value to more certainty of a match). Ergo the “Expectation value” or E-value. E-values can be as low as 10-70 .
  • 11.
    Useful Bioinformatics Sites UsefulBioinformatics Sites National Center for Biotechnology Information (NCBI)- National Institutes of Health sponsored sites with rich array of resources and data bases. [http://www.ncbi.nlm.nih.gov/pubmed] ExPASy (Swiss Institute of Bioinformatics)- Large number of different tools for sequence and function analysis. [http://www.expasy.org/tools/] RCSB Protein Data Bank- Largest data base for curated of protein structures. [http://www.rcsb.org/pdb/home/home.do] BioGRID- Large data base of curated protein interaction datasets. [http://thebiogrid.org/] Osprey- Software and interactome analysis tools for visualizing interaction data sets. [http://en.bio-soft.net/protein/Osprey.html] Tree of Life website- Database information on phylogenetic relationships among organisms with useful link outs. [http://tolweb.org/tree/]

Editor's Notes

  • #9 PAM 250 corresponds to 250 mutations in a 100 residue protein. That would leave only 5% of residues unchanged