Rates, Clocks and CoalescentsPresentation Transcript
Polly R. Walker D. Phil Student Dept of Zoology, University of Oxford Molecular Clocks and HIV-1
Summary of Talk
Measurably Evolving Populations (MEPs)
Methods for measuring evolution
Application of the molecular clock
Estimating divergence times
Population dynamics using coalescent theory
Demonstration: HIV-1 in South Africa.
The Molecular Clock
Gene sequences accumulate substitutions at a constant rate, therefore we can use genes sequences to time divergences. This is referred to as a ‘Molecular Clock’
• The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962. They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to real time, as judged against the fossil record.
• The “constancy” of the molecular clock is particularly striking when compared to the obvious variation in the rates of morphological evolution (e.g. the existence of “living fossils”).
There is No “Universal” Molecular Clock
Sources of variation in the Clock:
Mutation rates are variable though time
- different generation times of organism
- different metabolic rates
- different genomic systems, e.g. repair mechanisms
- different region genes or sites in a molecule
(together referred to as lineage effects - a neutralist explanation)
The existence of “nearly” neutral mutations and fluctuations in population size (the nearly neutral theory ).
Natural selection - species adapt to variable environments .
The molecular clock can vary over time
- how constant is the environment?
- how neutral is evolution?
Average Rates of Nucleotide Substitution in Different Organisms Organism/Genome Substitution Rate (per site, per year) Plant chloroplast DNA ~ 1 x 10 -9 Mammalian nuclear DNA 3.5 x 10 -9 Plant nuclear DNA ~ 5 x 10 -9 E. coli and Salmonella enterica bacteria ~5 x 10 -9 Drosophila nuclear DNA 1.5 x 10 -8 Mammalian mitochondrial DNA 5.7 x 10 -8 HIV-1 6.6 x 10 -3
Constant Molecular Clocks are Difficult to Obtain Under Natural Selection • The rate of substitution of mutations with selective advantage depends on; i. effective population size (4 N e ) ii. degree of selective advantage ( s ) iii. mutation rate (m) k = 4 N e s m How true is THAT for HIV? • For natural selection to produce a molecular clock population sizes, selection pressures, and mutation rates must be constant over evolutionary time.
Testing the Molecular Clock • So, is there a good molecular clock? • There are a variety of ways to test the molecular clock. i. The dispersion index, R(t) ii. The relative rate test iii. The Likelihood Ratio test using ML statistics.
Maximum Likelihood Tests of the Molecular Clock Human Chimp Gorilla Orang-utan Gibbon Human Chimp Gorilla Orang-utan Gibbon log Likelihood = -2660.61 log Likelihood = -2659.18 • Likelihood Ratio Test: The differences in log likelihood can be compared directly LRT = Chidist 2(ABS lnL), df (n-2) (not significantly different in this case - primate mitochondrial DNA) time substitutions
Measurably Evolving Populations Population is heterochronously sampled, spanning hundreds or thousands of generations, and contain a significant amount of genetic variation. Hence, this typically includes either 1. Organisms with rapid evolution and small generation time e.g, RNA viruses 2. Organisms with a wide range of sampling dates of dates e.g ancient DNA samples 1.0 x 10 -2 660 8.3 env HIV-1 5.7 x 10 -3 987 13 HA1 Human Influenza A 7.9 x 10 -4 1485 38 E Dengue-4 9.3 x 10 -7 326 ~6500 HVR-1 Adelie Penguin 4.3 x 10 -7 195 ~59000 HVR-1 Brown Bear Mutation rate Site -1 y -1 Sequence Length Sampling Interval / y Locus Organism
Maximum Likelihood Estimation of Viral Substitution Rates substitution rate Programme “Tip-Date” or “Rhino” • Construct rooted maximum likelihood tree • Optimise branch lengths under a single rate with relative tip positions consistent with isolation dates • Test molecular clock using a likelihood ratio test • Estimate confidence intervals • RNA viruses often have different sampling times. Small differences can have big effects. 1970 1980 1990 2000
• Tells us how phylogenies of sample populations are affected by changes in population size and structure (demography).
• The descent of lineages is traced backwards in time, to the point when they share common ancestral alleles. The number of lineages is reduced at each coalescent event (creating nodes on the tree).
• The probability that two sequences share a common ancestor ( a coalescent event occurs in the previous generation) is 1/2 N. Therefore the probability any two sequences shared a common ancestor a number of generations ( G ) ago: f ( G ) = (1/2 N ) e-( G -1)/2 N
Therefore the probability that sequences sampled randomly from a population share a common ancestor is dependent on population size .
Phylogeny Coalescent Theory Demographic History
• Changes in population size affect the distribution of coalescent times (i.e. when in time branching events occur). • In a constant sized population more coalescent events occur near the tips than the root, but in a growing population coalescent events more towards the root because the population size is smaller so that coalescent events are more likely (i.e. drift is more powerful in small populations). Constant (“endemic”) Growing (“epidemic”) The Coalescent Big N Small N • Therefore possible to distinguish continually large populations, from those that have only recently grown in size.
slow growth rapid growth T I M E small population large population
Models of Demographic History • Constant size (endemic) population ; - 1 parameter, population size ( N ) • Exponentially growing (epidemic) population ; 2 parameters, current ( N 0 ) and rate of growth ( r ) • More complex models: - logistic (growth slows down toward the present ) - expansion (sudden change in growth rate) • Estimate all parameters (e.g. N 0 , r ) from tree structure Can compare these nested models using the likelihood ratio test
Assumptions of the Model A) Lineages coalesce independently B) No more than one coalescent event can occur in a single generation C) The time-scale is so large that it can be represented as continuous • Works best for neutral mutations subject to genetic drift in non-recombining populations - i.e. in this case any change in the structure of the genealogy must be due to demographic processes, rather than fitness differences (i.e. fit alleles produce more branches).
Estimating Demographic History of HIV-1 Subtype C
Step 1 Sequence selection
Large range of dates e.g. 1989- 2001
Monophyletic (to comparison group e.g. subtype B
Length of sequences available, optimise length against samples size.
Step1. Sequence Alignment Using Clustal AND manual alignment e.g. Se-Al version Remove all incomplete or codons (*, ?), and in the correct reading frame. Sequences are out-of-frame
Step 2. ML tree construction
Make a Neighbour Joining tree, check this tree and remove identical / almost identical sequences
Estimate all parameters under a realistic evolutionary model, e.g. GTR: gamma., derive the best ML tree.
Rooting the tree: e.g. outgroup rooting.
Add in a distantly related sequence, like another subtype.
Subtype B is the most distantly related sequence. The closest sequence/s to the root of the tree is defined as the outgroup Return to your original tree and use this sequence to root the tree (under rooting options)
Step 3 Tip-dating the Tree
Prepare correct input format: must have sequence file in nexus format, rooted tree file, and tip dating information
Use the same evolutionary model here as you have used to generate the tree (get commmands from the manual.
Estimate the rate of evolution (absrate) and confidence intervals (interval tree) using bootstrapping.
Begin RHINO; NUCMODEL TYPE=GTR; TREEMODEL TYPE=TIPDATES; SITEMODEL TYPE=GAMMA; OPTIMIZE; STATUS param; interval tree:absrate; End;
Carry out the likelihood ratio test: is it significant?
Rhino Version 1.2 http//evolve/ox.ac.uk Macintosh version - Runs on MacOS9 and MacOSX UNIX/Linux version - could be compiled for Windows
The likelihood ratio test tells us whether we are justified in assuming a molecular clock. If a clock exists then the difference is not significant.
LRT = dist (2 x (ABS (lnL (VR) - lnL (clock))
df = n - 2
This is a very strict measure of a molecular clock. Look at root- to tip regression lines.
Using The Clock: 1. Timing the origin of the epidemic TMRCA = tree node height = years since MRCA substitution rate Not significant difference between timing of two subtypes. Subtype C has a slightly lower point estimate for rate but broader CIs Can apply the rates to other data sets, provided it is the same gene region
Determine the maximum likelihood population growth model.
Estimate the parameter of Rho under the best-fit growth model
Scale skyline plots according to the substitution rate.
4. Estimate parameter R, which is the growth rate in units year -1 , or rho/ 5. Estimate the doubling time : Doubling time (years) = LN (2) R
Within subtype C confidence intervals overlap. Subtype B and C show different demographic histories.
Subtype C has a slower exponential phase than subtype B
Subtype C on a global level is showing a logistic trend, not yet significant, but in Africa it is still exponentially growing.
Potential Applications: Comparing growth rates within different groups, e.g. risk group, HLA type, or the spread of different clades. Detecting decreases in epidemic growth rate.
Conclusions Molecular Clocks can be used to: a.) time the origin of an epidemic b.) determine population dynamics c.) Your estimates are only as good as your clock. d.) HIV is subject to variable rates of evolution among branches: needs new models which allow for this (relaxed clocks).