Gene sequences accumulate substitutions at a constant rate, therefore we can use genes sequences to time divergences. This is referred to as a ‘Molecular Clock’
• The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962. They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to real time, as judged against the fossil record.
• The “constancy” of the molecular clock is particularly striking when compared to the obvious variation in the rates of morphological evolution (e.g. the existence of “living fossils”).
- different genomic systems, e.g. repair mechanisms
- different region genes or sites in a molecule
(together referred to as lineage effects - a neutralist explanation)
The existence of “nearly” neutral mutations and fluctuations in population size (the nearly neutral theory ).
Natural selection - species adapt to variable environments .
The molecular clock can vary over time
- how constant is the environment?
- how neutral is evolution?
Average Rates of Nucleotide Substitution in Different Organisms Organism/Genome Substitution Rate (per site, per year) Plant chloroplast DNA ~ 1 x 10 -9 Mammalian nuclear DNA 3.5 x 10 -9 Plant nuclear DNA ~ 5 x 10 -9 E. coli and Salmonella enterica bacteria ~5 x 10 -9 Drosophila nuclear DNA 1.5 x 10 -8 Mammalian mitochondrial DNA 5.7 x 10 -8 HIV-1 6.6 x 10 -3
Constant Molecular Clocks are Difficult to Obtain Under Natural Selection • The rate of substitution of mutations with selective advantage depends on; i. effective population size (4 N e ) ii. degree of selective advantage ( s ) iii. mutation rate (m) k = 4 N e s m How true is THAT for HIV? • For natural selection to produce a molecular clock population sizes, selection pressures, and mutation rates must be constant over evolutionary time.
Testing the Molecular Clock • So, is there a good molecular clock? • There are a variety of ways to test the molecular clock. i. The dispersion index, R(t) ii. The relative rate test iii. The Likelihood Ratio test using ML statistics.
Maximum Likelihood Tests of the Molecular Clock Human Chimp Gorilla Orang-utan Gibbon Human Chimp Gorilla Orang-utan Gibbon log Likelihood = -2660.61 log Likelihood = -2659.18 • Likelihood Ratio Test: The differences in log likelihood can be compared directly LRT = Chidist 2(ABS lnL), df (n-2) (not significantly different in this case - primate mitochondrial DNA) time substitutions
Measurably Evolving Populations Population is heterochronously sampled, spanning hundreds or thousands of generations, and contain a significant amount of genetic variation. Hence, this typically includes either 1. Organisms with rapid evolution and small generation time e.g, RNA viruses 2. Organisms with a wide range of sampling dates of dates e.g ancient DNA samples 1.0 x 10 -2 660 8.3 env HIV-1 5.7 x 10 -3 987 13 HA1 Human Influenza A 7.9 x 10 -4 1485 38 E Dengue-4 9.3 x 10 -7 326 ~6500 HVR-1 Adelie Penguin 4.3 x 10 -7 195 ~59000 HVR-1 Brown Bear Mutation rate Site -1 y -1 Sequence Length Sampling Interval / y Locus Organism
Maximum Likelihood Estimation of Viral Substitution Rates substitution rate Programme “Tip-Date” or “Rhino” • Construct rooted maximum likelihood tree • Optimise branch lengths under a single rate with relative tip positions consistent with isolation dates • Test molecular clock using a likelihood ratio test • Estimate confidence intervals • RNA viruses often have different sampling times. Small differences can have big effects. 1970 1980 1990 2000
• Tells us how phylogenies of sample populations are affected by changes in population size and structure (demography).
• The descent of lineages is traced backwards in time, to the point when they share common ancestral alleles. The number of lineages is reduced at each coalescent event (creating nodes on the tree).
• The probability that two sequences share a common ancestor ( a coalescent event occurs in the previous generation) is 1/2 N. Therefore the probability any two sequences shared a common ancestor a number of generations ( G ) ago: f ( G ) = (1/2 N ) e-( G -1)/2 N
Therefore the probability that sequences sampled randomly from a population share a common ancestor is dependent on population size .
Phylogeny Coalescent Theory Demographic History
• Changes in population size affect the distribution of coalescent times (i.e. when in time branching events occur). • In a constant sized population more coalescent events occur near the tips than the root, but in a growing population coalescent events more towards the root because the population size is smaller so that coalescent events are more likely (i.e. drift is more powerful in small populations). Constant (“endemic”) Growing (“epidemic”) The Coalescent Big N Small N • Therefore possible to distinguish continually large populations, from those that have only recently grown in size.
slow growth rapid growth T I M E small population large population
Models of Demographic History • Constant size (endemic) population ; - 1 parameter, population size ( N ) • Exponentially growing (epidemic) population ; 2 parameters, current ( N 0 ) and rate of growth ( r ) • More complex models: - logistic (growth slows down toward the present ) - expansion (sudden change in growth rate) • Estimate all parameters (e.g. N 0 , r ) from tree structure Can compare these nested models using the likelihood ratio test
Assumptions of the Model A) Lineages coalesce independently B) No more than one coalescent event can occur in a single generation C) The time-scale is so large that it can be represented as continuous • Works best for neutral mutations subject to genetic drift in non-recombining populations - i.e. in this case any change in the structure of the genealogy must be due to demographic processes, rather than fitness differences (i.e. fit alleles produce more branches).
Estimating Demographic History of HIV-1 Subtype C
Step 1 Sequence selection
Large range of dates e.g. 1989- 2001
Monophyletic (to comparison group e.g. subtype B
Length of sequences available, optimise length against samples size.
Make a Neighbour Joining tree, check this tree and remove identical / almost identical sequences
Estimate all parameters under a realistic evolutionary model, e.g. GTR: gamma., derive the best ML tree.
Rooting the tree: e.g. outgroup rooting.
Add in a distantly related sequence, like another subtype.
Subtype B is the most distantly related sequence. The closest sequence/s to the root of the tree is defined as the outgroup Return to your original tree and use this sequence to root the tree (under rooting options)
The likelihood ratio test tells us whether we are justified in assuming a molecular clock. If a clock exists then the difference is not significant.
LRT = dist (2 x (ABS (lnL (VR) - lnL (clock))
df = n - 2
This is a very strict measure of a molecular clock. Look at root- to tip regression lines.
Using The Clock: 1. Timing the origin of the epidemic TMRCA = tree node height = years since MRCA substitution rate Not significant difference between timing of two subtypes. Subtype C has a slightly lower point estimate for rate but broader CIs Can apply the rates to other data sets, provided it is the same gene region
Within subtype C confidence intervals overlap. Subtype B and C show different demographic histories.
Subtype C has a slower exponential phase than subtype B
Subtype C on a global level is showing a logistic trend, not yet significant, but in Africa it is still exponentially growing.
Potential Applications: Comparing growth rates within different groups, e.g. risk group, HLA type, or the spread of different clades. Detecting decreases in epidemic growth rate.
Conclusions Molecular Clocks can be used to: a.) time the origin of an epidemic b.) determine population dynamics c.) Your estimates are only as good as your clock. d.) HIV is subject to variable rates of evolution among branches: needs new models which allow for this (relaxed clocks).