PROTEIN EVOLUTION
Function and Human Health
Daniel Gaston, PhD October 30th, 2014
WHY DO WE CARE?
whydoes all of this evolution stuff matter anyway?
Why it matters
• Pure scientific curiosity
• Knowledge is intrinsically valuable, regardless of applications
• Critical for truly understanding function
• Translating research/knowledge between model
organisms
• Evolution shapes population genetics
• Critical for understanding how mutations cause disease
Why it matters
• Ecology, ecological interactions, diversity
• Antibiotic resistance
• Microbiome
• Cancer
Major Groups of Organisms
Bacteria
Archaea
Eukaryotes
Major Groups of Organisms
Bacteria
Archaea
Eukaryotes
Major Groups of Organisms
Bacteria
Archaea
Eukaryotes
Major Groups of Eukaryotes
You are
here
A Brief History of Life on Earth
Time
4.5B: Origin of the Earth
3 – 4B: Origin of Life
2.7B: Bacteria
1.5B: Eukaryotes
1B: Animals
Definitions
• Homology
• Descent from a common ancestor
• All or nothing, no such thing as percent homology
• Divergence
• Change in two sequences over time, after splitting from a common
ancestor
• Convergence
• Similarity due to independent evolutionary events
• On the amino acid level: rare and difficult to prove
EVOLUTION IN PROTEINS
Processes
Two Groups of Processes
• Mutation
• Provides raw material of evolution
• Many different processes and mechanisms
• Happens within individuals
• Selection and Drift
• Happens within populations of organisms
• Affect the frequency if mutations within organisms over time
AGTCCAAGGCCTTAA -------------> AGTTCAAGGCCTTAA
point mutation
CCTTA
AGTCCAAGGCCTTAA
insertion
-------------> AGTCCAAGGCCTTACCTTAA
AAGG
------------->AGTCCAAGGCCTTAA
deletion
AGTCC-CCTTAA
AGTCCAAGGCCTTAA
` inversion
AGTCCAAGGCCTTAA
+
GGTCCTGGAATTCAG
AGTCCAAGGCC
-------------> AGTCCCCTTCCTTAA
------------->
translocation +
AGTCCAAGGCC
GGTCCTGGAATTCAGTTAA
-------------->
duplication
AGTCCAAGGCCAGTCCAAGGCC
AAGG
AGTCCAAGGCCTTAA ---------------> AGTCCAAAGGCTTAA
recombination AGGC
Exon1 Exon 2 Exon 3
Domain 1
Domain
2
Exon1Exon 2 Exon 3
Domain
2
Domain A
Exon Shuffling
Genomic Scale Mutations
Gene 1 Gene 2 Gene 3
Genomic Scale Mutations
Gene 1 Gene 2Gene 1a
Mutational Processes
• Arise generally as unrepaired mismatches during DNA
replication
• Some repair processes introduce mutation
• Chemical processes change non-replicating DNA
• Multi-cellularity buffers from all acquired (somatic)
mutations being hereditary
• Humans:
• de novo mutation rate of 1.2 x 10-8/nucleotide/generation
• ~70 per child
• Majority of paternal origin
SELECTION AND DRIFT
Polmorphisms and Populations, Oh My!
Mutations, Polymorphisms, Substitutions
• Mutations: Appear in individuals within a population
• Sometimes in human genetics used to specifically describe
pathogenic or disease causing variation
• Polymorphism: An unfixed mutation of varying frequency
within a population
• In human genetics generally used to describe functionally
neutral/benign variation. Often must have a frequency of >5%
• Substitution: A fixed mutation. All individuals within a
population have the mutation
• Most often used when comparing one or more species
Selection and Drift
• Fitness
• Measured in terms of the number of offspring that survive to
themselves reproduce
• Positive Selection
• Rare
• Mutation confers some fitness advantage
• Negative Selection
• Frequent
• Mutation confers a fitness disadvantage
• Neutral
• Mutation has little to no impact on fitness
• Most frequent
Nearly Neutral Theory
Genetic Drift in Action
Examples of Positive Selection
• MHC Genes
• Balancing selection: favours diversity at loci
• Many genes involved in metabolism and digestion
• Accelerated evolution over last ~10,000 years
• Adaptation to Agriculture
• Human adaptations to high altitide
• EPAS1, PPARA, EGLN1 (Tibetans)
• CBARA1, VAV3, ARNT2, THRB (Ethiopian Highlanders)
• EGLN1 (Andean Peruvians)
Mutation at the Codon Level
Synonymous (Silent)
Mutation: Codon still codes
for the same amino acid
Non-Synonymous
Mutation: Codon now
codes for a different amino
acid (missense), premature
stop codon (nonsense), or
alters a start codon
PROTEIN FUNCTION AND
STRUCTURE
Impacts on Evolution
Evolutionary Rates and Constraints
• Evolution is only partially random
• Mutations (quasi-random, non-uniform distribution of possibilities)
• Drift (Random)
• Selection (Non-random)
• Evolutionary rate at the protein level is the number of
fixed amino acid substitutions over evolutionary time
• Measured between one or more species-level comparisons
Evolutionary Rates and Constraints
• Different proteins have different overall rates of evolution
• Functional necessity
• Structural necessity
• Number of protein-protein interactions
• Different regions within a protein have different rates of
evolution
• Functional constraint
• Structural constraint
Evolutionary Rates and Constraints
All Eukaryotes site rates (63 taxa) mapped on Lobster
Enolase
low rates blue
high rates red
Site rate categories 1 and 2 (slowest sites)
Site rates Categories 3 and 4
Site rates Categories 5 and 6
Site rates Categories 7 and 8 (fastest sites)
Evolutionary Rate: Structure/Function
Relationship
• Pattern of evolution is that rates are slowest near the
centre, fastest on exterior
• Distance to catalytic centre
• Hydrophobic packing of the interior
• Spatial/size constraints in interior
• More loops and alpha-helices on exterior
• How does this change for structural proteins like tubulin or
actin?
PRACTICAL
APPLICATIONS
Identifying Disease Causing Genes
• Lynch Syndrome
• Autosomal dominant cancer syndrome
• Defective mismatch repair
• Increased risk of many cancers, particularly colorectal
Identifying the Gene using Evolutionary
Reasoning
• Inactivation of genes known to be involved in mismatch
repair in E. coli and yeast lead to ‘mutator’ phenotype
• Microsatellite instability observed
• Searched for homologous genes in humans based on
Microsatellite instability
• Identified MLH1 and MSH2
• Sequenced genes in Lynch syndrome patients and identified
mutations
Identifying Likely Pathogenic Mutations
• Needle in a stack of needles (Exome and Genome
Sequencing)
• Individual humans ~70 new mutations
• Can be hundreds to thousands of shared variants between small
numbers of individuals in a family
Evolutionary Profile of Pathogenic
Mutations
• Highly conserved amino acids more likely to be
functionally important
• Highly conserved genes more likely to be indispensable
• Conservation alone can be misleading
• Factor in evolutionary history and relatedness of species being
compared
• Best tools use many sources of information and high-level machine
learning
Exome Sequencing for Disease: Gastric
Cancer
o Older age of diagnosis
o Often diagnosed at later
stages as symptoms similar
to many common diseases
o 3rd leading cause of cancer
death worldwide: 730,000
deaths per year
o 90% of cases are sporadic
o Most cases of familial
clustering due to shared
environmental factors
o 60% of hereditary cases
caused by mutations in the
gene CDH1
Genomic
Regions
Number of
Exomes
Number of
Variants <5%
Allele Frequency
in Regions of
Interest
Number With
Medium or High
Impact
All Affected 3 14 0
Siblings Only 2 9550 525
All Variants in Exome
Variants in Shared Regions
Variant Frequency in
Population
Variant Impact
Candidates
MAP3K6
Protein Kinase
ATP
Bindin
g
Proton
Acceptor
D200Y
V207G
H506Y* P946L
P958T
F849Sfs*142
Coiled-
Coil
Functional Divergence
• Duplicated genes (paralogs)
• Can diverge in function as well as sequence
Gene 1 Gene 2Gene 1a
Types of Functional Divergence
• Subfunctionalization
• Specialize and retain only a subset of ancestral function
• Neofunctionalization
• Gain a new function, lose ancestral
• Subneofunctionalization
• Specialize and elaborate
Functional Divergence and Protein
Families
Functional Divergence and Protein
Families
Detecting Functional Divergence
Detecting Functional Divergence
Glyceraldehyde-3-Phosphate
Dehydrogenase
NAD+ NADH
+Pi +H+
NAD+ NADH
+ Pi + H+
Cytosol: Glycolysis
Glyceraldehyde-3-Phosphate 1,3-Biphosphate
Glyceraldehyde-3-Phosphate
Dehydrogenase
NADP+ NADPH
+Pi +H+
NADP+ NADPH
+Pi +H+
Glyceraldehyde-3-Phosphate 1,3-Biphosphate
Plastid: Calvin Cycle
GAPDH Structure
Divergent and Convergent Evolution in
GAPDH
• Many sites predicted to be functionally divergent
• 69 in the green group (GapA/B)
• 26 in GapC1
• 20 in both GapC1 and GapA/B
GAPDH Functional Residues

Bioc4700 2014 Guest Lecture

  • 1.
    PROTEIN EVOLUTION Function andHuman Health Daniel Gaston, PhD October 30th, 2014
  • 2.
    WHY DO WECARE? whydoes all of this evolution stuff matter anyway?
  • 3.
    Why it matters •Pure scientific curiosity • Knowledge is intrinsically valuable, regardless of applications • Critical for truly understanding function • Translating research/knowledge between model organisms • Evolution shapes population genetics • Critical for understanding how mutations cause disease
  • 4.
    Why it matters •Ecology, ecological interactions, diversity • Antibiotic resistance • Microbiome • Cancer
  • 5.
    Major Groups ofOrganisms Bacteria Archaea Eukaryotes
  • 6.
    Major Groups ofOrganisms Bacteria Archaea Eukaryotes
  • 7.
    Major Groups ofOrganisms Bacteria Archaea Eukaryotes
  • 8.
    Major Groups ofEukaryotes You are here
  • 9.
    A Brief Historyof Life on Earth Time 4.5B: Origin of the Earth 3 – 4B: Origin of Life 2.7B: Bacteria 1.5B: Eukaryotes 1B: Animals
  • 10.
    Definitions • Homology • Descentfrom a common ancestor • All or nothing, no such thing as percent homology • Divergence • Change in two sequences over time, after splitting from a common ancestor • Convergence • Similarity due to independent evolutionary events • On the amino acid level: rare and difficult to prove
  • 11.
  • 12.
    Two Groups ofProcesses • Mutation • Provides raw material of evolution • Many different processes and mechanisms • Happens within individuals • Selection and Drift • Happens within populations of organisms • Affect the frequency if mutations within organisms over time
  • 13.
    AGTCCAAGGCCTTAA -------------> AGTTCAAGGCCTTAA pointmutation CCTTA AGTCCAAGGCCTTAA insertion -------------> AGTCCAAGGCCTTACCTTAA AAGG ------------->AGTCCAAGGCCTTAA deletion AGTCC-CCTTAA AGTCCAAGGCCTTAA ` inversion AGTCCAAGGCCTTAA + GGTCCTGGAATTCAG AGTCCAAGGCC -------------> AGTCCCCTTCCTTAA -------------> translocation + AGTCCAAGGCC GGTCCTGGAATTCAGTTAA --------------> duplication AGTCCAAGGCCAGTCCAAGGCC AAGG AGTCCAAGGCCTTAA ---------------> AGTCCAAAGGCTTAA recombination AGGC
  • 14.
    Exon1 Exon 2Exon 3 Domain 1 Domain 2 Exon1Exon 2 Exon 3 Domain 2 Domain A Exon Shuffling
  • 15.
  • 16.
  • 17.
    Mutational Processes • Arisegenerally as unrepaired mismatches during DNA replication • Some repair processes introduce mutation • Chemical processes change non-replicating DNA • Multi-cellularity buffers from all acquired (somatic) mutations being hereditary • Humans: • de novo mutation rate of 1.2 x 10-8/nucleotide/generation • ~70 per child • Majority of paternal origin
  • 18.
    SELECTION AND DRIFT Polmorphismsand Populations, Oh My!
  • 19.
    Mutations, Polymorphisms, Substitutions •Mutations: Appear in individuals within a population • Sometimes in human genetics used to specifically describe pathogenic or disease causing variation • Polymorphism: An unfixed mutation of varying frequency within a population • In human genetics generally used to describe functionally neutral/benign variation. Often must have a frequency of >5% • Substitution: A fixed mutation. All individuals within a population have the mutation • Most often used when comparing one or more species
  • 20.
    Selection and Drift •Fitness • Measured in terms of the number of offspring that survive to themselves reproduce • Positive Selection • Rare • Mutation confers some fitness advantage • Negative Selection • Frequent • Mutation confers a fitness disadvantage • Neutral • Mutation has little to no impact on fitness • Most frequent
  • 21.
  • 22.
  • 23.
    Examples of PositiveSelection • MHC Genes • Balancing selection: favours diversity at loci • Many genes involved in metabolism and digestion • Accelerated evolution over last ~10,000 years • Adaptation to Agriculture • Human adaptations to high altitide • EPAS1, PPARA, EGLN1 (Tibetans) • CBARA1, VAV3, ARNT2, THRB (Ethiopian Highlanders) • EGLN1 (Andean Peruvians)
  • 24.
    Mutation at theCodon Level Synonymous (Silent) Mutation: Codon still codes for the same amino acid Non-Synonymous Mutation: Codon now codes for a different amino acid (missense), premature stop codon (nonsense), or alters a start codon
  • 25.
  • 26.
    Evolutionary Rates andConstraints • Evolution is only partially random • Mutations (quasi-random, non-uniform distribution of possibilities) • Drift (Random) • Selection (Non-random) • Evolutionary rate at the protein level is the number of fixed amino acid substitutions over evolutionary time • Measured between one or more species-level comparisons
  • 27.
    Evolutionary Rates andConstraints • Different proteins have different overall rates of evolution • Functional necessity • Structural necessity • Number of protein-protein interactions • Different regions within a protein have different rates of evolution • Functional constraint • Structural constraint
  • 28.
  • 29.
    All Eukaryotes siterates (63 taxa) mapped on Lobster Enolase low rates blue high rates red
  • 30.
    Site rate categories1 and 2 (slowest sites)
  • 31.
  • 32.
  • 33.
    Site rates Categories7 and 8 (fastest sites)
  • 34.
    Evolutionary Rate: Structure/Function Relationship •Pattern of evolution is that rates are slowest near the centre, fastest on exterior • Distance to catalytic centre • Hydrophobic packing of the interior • Spatial/size constraints in interior • More loops and alpha-helices on exterior • How does this change for structural proteins like tubulin or actin?
  • 35.
  • 36.
    Identifying Disease CausingGenes • Lynch Syndrome • Autosomal dominant cancer syndrome • Defective mismatch repair • Increased risk of many cancers, particularly colorectal
  • 37.
    Identifying the Geneusing Evolutionary Reasoning • Inactivation of genes known to be involved in mismatch repair in E. coli and yeast lead to ‘mutator’ phenotype • Microsatellite instability observed • Searched for homologous genes in humans based on Microsatellite instability • Identified MLH1 and MSH2 • Sequenced genes in Lynch syndrome patients and identified mutations
  • 38.
    Identifying Likely PathogenicMutations • Needle in a stack of needles (Exome and Genome Sequencing) • Individual humans ~70 new mutations • Can be hundreds to thousands of shared variants between small numbers of individuals in a family
  • 39.
    Evolutionary Profile ofPathogenic Mutations • Highly conserved amino acids more likely to be functionally important • Highly conserved genes more likely to be indispensable • Conservation alone can be misleading • Factor in evolutionary history and relatedness of species being compared • Best tools use many sources of information and high-level machine learning
  • 40.
    Exome Sequencing forDisease: Gastric Cancer
  • 41.
    o Older ageof diagnosis o Often diagnosed at later stages as symptoms similar to many common diseases o 3rd leading cause of cancer death worldwide: 730,000 deaths per year
  • 42.
    o 90% ofcases are sporadic o Most cases of familial clustering due to shared environmental factors o 60% of hereditary cases caused by mutations in the gene CDH1
  • 43.
    Genomic Regions Number of Exomes Number of Variants<5% Allele Frequency in Regions of Interest Number With Medium or High Impact All Affected 3 14 0 Siblings Only 2 9550 525 All Variants in Exome Variants in Shared Regions Variant Frequency in Population Variant Impact Candidates
  • 44.
  • 45.
    Functional Divergence • Duplicatedgenes (paralogs) • Can diverge in function as well as sequence Gene 1 Gene 2Gene 1a
  • 46.
    Types of FunctionalDivergence • Subfunctionalization • Specialize and retain only a subset of ancestral function • Neofunctionalization • Gain a new function, lose ancestral • Subneofunctionalization • Specialize and elaborate
  • 47.
    Functional Divergence andProtein Families
  • 48.
    Functional Divergence andProtein Families
  • 49.
  • 50.
  • 51.
    Glyceraldehyde-3-Phosphate Dehydrogenase NAD+ NADH +Pi +H+ NAD+NADH + Pi + H+ Cytosol: Glycolysis Glyceraldehyde-3-Phosphate 1,3-Biphosphate
  • 52.
    Glyceraldehyde-3-Phosphate Dehydrogenase NADP+ NADPH +Pi +H+ NADP+NADPH +Pi +H+ Glyceraldehyde-3-Phosphate 1,3-Biphosphate Plastid: Calvin Cycle
  • 54.
  • 55.
    Divergent and ConvergentEvolution in GAPDH • Many sites predicted to be functionally divergent • 69 in the green group (GapA/B) • 26 in GapC1 • 20 in both GapC1 and GapA/B
  • 56.