Rates, Clocks and Coalescents


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Rate is a function not just of how quickly mutations occur in a sequence, but also relating to how the mutations that occur become fixed in the population. Because the pressure swhich drive evolution can vary so much between species, there is no universal clock. Variation in the tick rate of your clock is dependent on a number of considerations. Firstly generation time, affects how quickly mutations are able to become fixed. Different genomic systems: single or double stranded genome, rna or dna, and what the repair mechanism is like. Regions of the genome, genes and sites, well, this all affects how neutral evolution is. Sites that are functionally constrained, the fixation of substitutions is far less likely than neutral sites. Equally, sites at which diversity is favoured such as the MHC complex in humans or the hypervariable regions of HIV-1, are likely to evolve much faster than their flanking regions. Also, the rate can vary over time.. I.e. evolution can be episodic, as environmental factors may take a role in the rate of mutation, the fixation rate or the generation time. So when thinking about your molecular clock that you want for your studies, then you must consider these two questions. In the case of HIV-1, the environment is constantly changing and the selection pressures on the virus vary greatly between and within hosts.
  • This is just to show you some examples of rates in differnet genes and different organisms, to give you an idea, and show you how much faster evolution of HIV is as compared to bacterial DNA.
  • But…. Many organisms and many genes are subject to natural selection, and in the case of HIV, it is almost always under selection, due to the immune system, anti viral drugs, and cellular adaptation…. Etc. In the presence of natural selection, the molecular clock doesn’t fall down,. But it does become more complicated: It is now depednent on several other factors including the effective pop size, the degree of selection, in addition to the mutation rate.
  • So there are ways to test the presence of a molecular clock. Two early methods were the dispersion index and the relative rate test, but nowadays, with the introduction of maximum likelihood methods for determining rates, we use the likleihood ratio test, and this is the one I am going to tell you about.
  • The likelihood ratio tests the possibility that a tree created using an evolutionary model that assumes a constant rate of evolution along different branches of the tree better explains the phylogeny than one which assumes variable rates. The existence of a true molecular clock will show no significantly different changes in likelihood between the two trees, as in this example of primate mitochondrial DNA. Therefore rates of evolution are roughly consistent amongst branches. On the left is an ultrametric tree in which we have standardised the branch lengths into years, however, on the right is the variable rates tree, and in this case the horizontal axis is in substitutions.
  • RNA Viruses often fall into the category of MEPs, on account of their small geenration times and high mutation rates, because of this samples spanning ten years or more can give us an accurate measure of evolutionary rate. It is very important that we take account of these different sampling times, as even a small difference in sampling time can have a big effect and bias the tree. So we need to date the tips of the tree. The tip date model
  • The Coalescent is a population genetics model which decribes the relationship between a population’s demographic history and the phylogeny of individuals sampled at random from that population. The phylogenies of sampled populations contain information about the effective population size, and how it has changed over time. The key to thinking about the coalescent is how the number of lineages in the tree changes as you go backwards in time. As you go abckwards down the tre the number of lineages ddecreases each time a coalescent event occurs. relationship been the coalescent and genetic drift is essential: The probability of the fixation of a neutral mutation by genetic drift is 1/2N (N is the population size). This is also the probability that two sequences sampled at random from this population shaer a common ancestorat a given time, we can calculate this probability for each genration using the following equation… the take home message here being that probabiliiity of a coalecent event is mianly subject to the Ne. I’ll come back to this in the nnext slide. the
  • So, when looking at tree topologies: the structure of the tree is altered both by the effective population size and the rate of grwoth. In a large population, oor a rapidly growing one the probability that two randomnly ssampled individuals are related in the previous generation is smaller than a small population.
  • These are the bbasic assumptions of the model and are equivalent top saying that it is a small smaple staken from a very large population. In HIV this is usually the case. So coalescent model can be applied to many situations, however, it is likely to work better in some situations than others. For example positive selction can cause problems for coalescent process. It works best for neutral evolution as coalescent events occur more like genetc drift process. However, selection can alter the phylogeny by introducing sweep like effects (a fit sequence giving rise to many more branches than less fit ones.) Alos, recombination… this introduces large uncertainties into the tree.. And different regions of the sequence may have different ancestries. For HIV, we know that recombination is very high… but in this case it seems that the more star like the phylogeny is the lessthe impact of recombination.
  • So, I am going to carry out a demonstration of some of these methods, and take you though the whole process from start to finish. First of all we are going to estimate a molecular clock for a particualr gene from HIV-1, gag, then we are going to test how good our clock is, and then after that apply the caluclated substitution rate to coalescent methods and use it to caculate the growth rate of the HIV-1 subtype C epidemic. The first step in this process is to select our sequences. You can use your own sequences for population demographics, but for estimating substituion rate we need to maximise the distribution of sampling times. We also need to consider the length of our sequence and the number of samples. I suggest if you have a decent sequence length: e.g. 1600 bases, then you only need about 39 sequences, longer than this will take forveer to analyse. If it is shorter, use more sequences. I am using subtype c, gag this is my data set.
  • Sequence alignment. I do most of my alignment using se-al which is a completely manual alignment program. If you are downloading sequences from los alamos then some alignment is already done for you, and you just have to g through removing insterted gaps and getting it into the correct reading frame. With gag this is simple but with e.g. env regions then you may want to run it through clustal auto-alignment first to save time, but you will still need to do some manual alignment. If you are using mac then Seal is available from the oxford website free. Very easy to switch between amino acids and nucletides and shift reading frames. Ok, just be sure it is in the correct reading frame and that there are no ? Or * codons. You can tell it is out of frame if you see downstream columns of stop codons. Need to do this, as this could cause problems reading the data files later on. This may involve removing some genetic info. Just make sure you don’t remove too much. And do the same thing with all other data sets.
  • The next thing we need to do is find the best tree for this data set. After we have done the alignment in seal, or other prgram, then export this in a format for paup, I.e. a nexus file. First of all we make a NJ tree, without branch swapping, so that we have a strating point, on which we can begin optimising the tree. * Start demonstration* Before we go any further we can use this tree to make sure there are no identical or very cloesly related sequences which may have come from the same patient. This will skew the genie result. Execute file Analysis, heuristic search Starting tree options , get by NJ Branch swapping, NNI, just do nearest neighbour interchange, it not the best branch swapping algorithm, but will improve our tree considerably Now we want to improve this tree by using a more realistic evolutionary model. I have have that the General time reversible model using the gamma rate parameter gives the best likleihoods for HIV. So we use the sequence data we have to estimate the parameter values under this model. Change settings to likelihood, then open likelihood settings. Set to GTR, estimate rate matrix, base frequencies, the porportion of invariable sites, and the gamma shape parameter. Then get the tree scores, Use all these parameters to derive the tree. Use TBR branch swapping. If possible run this on a unix machine or unused machine cos it takes a long time. To root the tree, add in an outgroup e.g. another subtype. Make an NJ tree and do NNI swapping. The next sequence to the outgroup rrot, is the root of the tree. Return to you original tree and select this under outgroup rooting.
  • Now we have the best tree we can tip date it (I.e. account for different sampling times) and then use this to estimate the rate at which substitutions occur along the tree,. This is done using another oxford programme called Rhino. Rhino is based on a previous program called tipdate, hence is referred to as tipdate in the literature so far. For rhino we use one input file which contains all the information the prgramme needs to calculate a rate of evolution and tipdate the branches of the tree. *open CgagSR in Bbedit, then point out the formats, nexus and paup tree then the dates and the instructions script at the bottom of the file. Open rhino and show load and run. Then open the output file. The rate estimated for gag is 0.0015, which is about what we expected. The Likelihood ratio test: can be done in excel. Basically take the ML valuie for this tipdated tree, and the ML value of the variable rates tree (paup, log file)
  • The results of this type of analysis for subtypes B and C give the above results. You can see that the estimated rates are not significantly different, Cis over lap. However. When we do the likelihood ratio test we find that the variable rates tree gives a better likelihood than the tree in which we have enforced a molecular clock, therefore there is some rate variation among branches of the tree. This doesn’t screw things up completely. Obviously it tells us that we would be better using a model of evolution that allows a more ‘relaxed clock’ but e can see from the root to tip regression plots that there is molecular clock like behaviour even though it is not a strict clock. Within the confidence intervals that rate should be an average of the rate across all branches of the tree.
  • Once you have a rate estimate then it is quite easy to work out the timing of the epidemics, the T mrca. You can see the B estimate has narrower intervals because of the inclusion of older sequences. However, the rates overlap with C and we find no evidence from this which suggests a different timing of the two epidemics. These estimates of rate can also be transferable, ie once you have a reliable rate value, you can look at your own sequences and constrain the tree using a defined rate value. E.g. african C sequences, and south africa, shown here.
  • Open the genie program and load the file CgagSRtiptree.out. Say that the commands are available in the manual. First we want to log what we are doing. Next change the optimisation algorithm to differential evolution. The optimiser is a mathematical algorithm under which the likelihood is optimised. Differential evolution is a genetic algorthm, and though it is the slowest it is the best to use in this case. Find out what this means . Then estimate the likelihood of each model of interest mod con ml est etc. Mention, that to choose a more complex model (I.e. one with more parameters) over another it has to be greater than 1.92 higher likelihood value than the simpler model. Hence in the case of CgagSR the exponenetial model is the best. The log gives a better likelihood but not significantly so. We can now draw our skyline plots. These plots estimate demographic history using a stepwise function,. Ne at time T is constant within each step but changes between steps. *show powerpoint figure. The number of steps represented in the plot is the number of demographic parameters: so in the classic plot, this is the number of coalescent events in the tree. However, in the generalised skyline plot these steps are grouped together, by maximising the AICC (aikakaee index) nevermind about this. without inputting the sub rate the plots are represented in branch length, but you can scale your plot by the rate and show history in years since the present day. Also by scaling the tree you can estimate the growth rate R, and use this t calculate the doubling time in years. The results for this analysis are:
  • Rates, Clocks and Coalescents

    1. 1. Polly R. Walker D. Phil Student Dept of Zoology, University of Oxford Molecular Clocks and HIV-1
    2. 2. Summary of Talk <ul><li>Molecular clocks </li></ul><ul><li>Measurably Evolving Populations (MEPs) </li></ul><ul><li>Methods for measuring evolution </li></ul><ul><li>Coalescent theory </li></ul><ul><li>Application of the molecular clock </li></ul><ul><ul><li>Estimating divergence times </li></ul></ul><ul><ul><li>Population dynamics using coalescent theory </li></ul></ul><ul><li>Demonstration: HIV-1 in South Africa. </li></ul>
    3. 3. The Molecular Clock <ul><li>Gene sequences accumulate substitutions at a constant rate, therefore we can use genes sequences to time divergences. This is referred to as a ‘Molecular Clock’ </li></ul><ul><li>• The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962. They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to real time, as judged against the fossil record. </li></ul><ul><li>• The “constancy” of the molecular clock is particularly striking when compared to the obvious variation in the rates of morphological evolution (e.g. the existence of “living fossils”). </li></ul>
    4. 4. There is No “Universal” Molecular Clock <ul><li>Sources of variation in the Clock: </li></ul><ul><li>Mutation rates are variable though time </li></ul><ul><li>- different generation times of organism </li></ul><ul><li>- different metabolic rates </li></ul><ul><li>- different genomic systems, e.g. repair mechanisms </li></ul><ul><li>- different region genes or sites in a molecule </li></ul><ul><li>(together referred to as lineage effects - a neutralist explanation) </li></ul><ul><li>The existence of “nearly” neutral mutations and fluctuations in population size (the nearly neutral theory ). </li></ul><ul><li>Natural selection - species adapt to variable environments . </li></ul><ul><li>The molecular clock can vary over time </li></ul><ul><li>- how constant is the environment? </li></ul><ul><li>- how neutral is evolution? </li></ul>
    5. 5. Average Rates of Nucleotide Substitution in Different Organisms Organism/Genome Substitution Rate (per site, per year) Plant chloroplast DNA ~ 1 x 10 -9 Mammalian nuclear DNA 3.5 x 10 -9 Plant nuclear DNA ~ 5 x 10 -9 E. coli and Salmonella enterica bacteria ~5 x 10 -9 Drosophila nuclear DNA 1.5 x 10 -8 Mammalian mitochondrial DNA 5.7 x 10 -8 HIV-1 6.6 x 10 -3
    6. 6. Constant Molecular Clocks are Difficult to Obtain Under Natural Selection • The rate of substitution of mutations with selective advantage depends on; i. effective population size (4 N e ) ii. degree of selective advantage ( s ) iii. mutation rate (m) k = 4 N e s m How true is THAT for HIV? • For natural selection to produce a molecular clock population sizes, selection pressures, and mutation rates must be constant over evolutionary time.
    7. 7. Testing the Molecular Clock • So, is there a good molecular clock? • There are a variety of ways to test the molecular clock. i. The dispersion index, R(t) ii. The relative rate test iii. The Likelihood Ratio test using ML statistics.
    8. 8. Maximum Likelihood Tests of the Molecular Clock Human Chimp Gorilla Orang-utan Gibbon Human Chimp Gorilla Orang-utan Gibbon log Likelihood = -2660.61 log Likelihood = -2659.18 • Likelihood Ratio Test: The differences in log likelihood can be compared directly LRT = Chidist 2(ABS  lnL), df (n-2) (not significantly different in this case - primate mitochondrial DNA) time substitutions
    9. 9. Measurably Evolving Populations Population is heterochronously sampled, spanning hundreds or thousands of generations, and contain a significant amount of genetic variation. Hence, this typically includes either 1. Organisms with rapid evolution and small generation time e.g, RNA viruses 2. Organisms with a wide range of sampling dates of dates e.g ancient DNA samples 1.0 x 10 -2 660 8.3 env HIV-1 5.7 x 10 -3 987 13 HA1 Human Influenza A 7.9 x 10 -4 1485 38 E Dengue-4 9.3 x 10 -7 326 ~6500 HVR-1 Adelie Penguin 4.3 x 10 -7 195 ~59000 HVR-1 Brown Bear Mutation rate  Site -1 y -1 Sequence Length Sampling Interval / y Locus Organism
    10. 10. Maximum Likelihood Estimation of Viral Substitution Rates substitution rate Programme “Tip-Date” or “Rhino” • Construct rooted maximum likelihood tree • Optimise branch lengths under a single rate with relative tip positions consistent with isolation dates • Test molecular clock using a likelihood ratio test • Estimate confidence intervals • RNA viruses often have different sampling times. Small differences can have big effects. 1970 1980 1990 2000
    11. 11. The Coalescent <ul><li>• Tells us how phylogenies of sample populations are affected by changes in population size and structure (demography). </li></ul><ul><li>• The descent of lineages is traced backwards in time, to the point when they share common ancestral alleles. The number of lineages is reduced at each coalescent event (creating nodes on the tree). </li></ul><ul><li>• The probability that two sequences share a common ancestor ( a coalescent event occurs in the previous generation) is 1/2 N. Therefore the probability any two sequences shared a common ancestor a number of generations ( G ) ago: f ( G ) = (1/2 N ) e-( G -1)/2 N </li></ul><ul><li>Therefore the probability that sequences sampled randomly from a population share a common ancestor is dependent on population size . </li></ul>Phylogeny Coalescent Theory Demographic History
    12. 12. • Changes in population size affect the distribution of coalescent times (i.e. when in time branching events occur). • In a constant sized population more coalescent events occur near the tips than the root, but in a growing population coalescent events more towards the root because the population size is smaller so that coalescent events are more likely (i.e. drift is more powerful in small populations). Constant (“endemic”) Growing (“epidemic”) The Coalescent Big N Small N • Therefore possible to distinguish continually large populations, from those that have only recently grown in size.
    13. 13. slow growth rapid growth T I M E small population large population
    14. 14. Models of Demographic History • Constant size (endemic) population ; - 1 parameter, population size ( N ) • Exponentially growing (epidemic) population ; 2 parameters, current ( N 0 ) and rate of growth ( r ) • More complex models: - logistic (growth slows down toward the present ) - expansion (sudden change in growth rate) • Estimate all parameters (e.g. N 0 , r ) from tree structure Can compare these nested models using the likelihood ratio test
    15. 15. Assumptions of the Model A) Lineages coalesce independently B) No more than one coalescent event can occur in a single generation C) The time-scale is so large that it can be represented as continuous • Works best for neutral mutations subject to genetic drift in non-recombining populations - i.e. in this case any change in the structure of the genealogy must be due to demographic processes, rather than fitness differences (i.e. fit alleles produce more branches).
    16. 16. Estimating Demographic History of HIV-1 Subtype C <ul><li>Step 1 Sequence selection </li></ul><ul><li>Large range of dates e.g. 1989- 2001 </li></ul><ul><li>Monophyletic (to comparison group e.g. subtype B </li></ul><ul><li>Length of sequences available, optimise length against samples size. </li></ul>1986: C.ET.86.ETH2220 1993: C.IN.93.N904 C.IN.93.IN905 C.IN.93.IN101 C.IN.93.IN99, 1995: C.IN.95.IN21068 C.IN.95.IN21301 1996: C.BW.96.BW17B05 C.BW.96.BWM032 C.BW.96.BW0504 C.BW.96.BW1626 C.ZM.96.ZM651 C.ZM.96.ZM751 1997: C.ZA.97.ZA012 1998: C.TZ.98.TZ013 C.TZ.98.TZ017 C.ZA.98.TV001 C.ZA.TV002 1999: C.ZA.99.DU151 C.ZA.99.DU179 C.BW.99.BW47547 C.BW.99.BWMC168 2000: C.BW.00.BW18595 C.BW.00.BW18802 C.BW.00.BW192113 C.BW.00.BW20361 C.BW.00.BW20636 Example: CgagSR - ntax = 29, nchar = 1659 Los Alamos Sequence Database (http://hiv-web.lanl.gov)
    17. 17. Step1. Sequence Alignment Using Clustal AND manual alignment e.g. Se-Al version Remove all incomplete or codons (*, ?), and in the correct reading frame. Sequences are out-of-frame
    18. 18. Step 2. ML tree construction <ul><li>Make a Neighbour Joining tree, check this tree and remove identical / almost identical sequences </li></ul><ul><li>Estimate all parameters under a realistic evolutionary model, e.g. GTR: gamma., derive the best ML tree. </li></ul><ul><li>Rooting the tree: e.g. outgroup rooting. </li></ul><ul><ul><li>Add in a distantly related sequence, like another subtype. </li></ul></ul>Subtype B is the most distantly related sequence. The closest sequence/s to the root of the tree is defined as the outgroup Return to your original tree and use this sequence to root the tree (under rooting options)
    19. 19. Step 3 Tip-dating the Tree <ul><li>Prepare correct input format: must have sequence file in nexus format, rooted tree file, and tip dating information </li></ul><ul><li>Use the same evolutionary model here as you have used to generate the tree (get commmands from the manual. </li></ul><ul><li>Estimate the rate of evolution (absrate) and confidence intervals (interval tree) using bootstrapping. </li></ul>Begin RHINO; NUCMODEL TYPE=GTR; TREEMODEL TYPE=TIPDATES; SITEMODEL TYPE=GAMMA; OPTIMIZE; STATUS param; interval tree:absrate; End; <ul><li>Carry out the likelihood ratio test: is it significant? </li></ul>Rhino Version 1.2 http//evolve/ox.ac.uk Macintosh version - Runs on MacOS9 and MacOSX UNIX/Linux version - could be compiled for Windows
    20. 20. <ul><li>The likelihood ratio test tells us whether we are justified in assuming a molecular clock. If a clock exists then the difference is not significant. </li></ul><ul><li>LRT =  dist (2 x (ABS (lnL (VR) - lnL (clock)) </li></ul><ul><li>df = n - 2 </li></ul><ul><li>This is a very strict measure of a molecular clock. Look at root- to tip regression lines. </li></ul>
    21. 21. Using The Clock: 1. Timing the origin of the epidemic TMRCA = tree node height = years since MRCA substitution rate  Not significant difference between timing of two subtypes. Subtype C has a slightly lower point estimate for rate but broader CIs Can apply the rates to other data sets, provided it is the same gene region
    22. 22. Population Dynamics <ul><li>Determine the maximum likelihood population growth model. </li></ul><ul><li>Estimate the parameter of Rho under the best-fit growth model </li></ul><ul><li>Scale skyline plots according to the substitution rate. </li></ul>4. Estimate parameter R, which is the growth rate in units year -1 , or rho/  5. Estimate the doubling time : Doubling time (years) = LN (2) R
    23. 23. Results <ul><li>Within subtype C confidence intervals overlap. Subtype B and C show different demographic histories. </li></ul><ul><li>Subtype C has a slower exponential phase than subtype B </li></ul><ul><li>Subtype C on a global level is showing a logistic trend, not yet significant, but in Africa it is still exponentially growing. </li></ul>Potential Applications: Comparing growth rates within different groups, e.g. risk group, HLA type, or the spread of different clades. Detecting decreases in epidemic growth rate.
    24. 24. Conclusions Molecular Clocks can be used to: a.) time the origin of an epidemic b.) determine population dynamics c.) Your estimates are only as good as your clock. d.) HIV is subject to variable rates of evolution among branches: needs new models which allow for this (relaxed clocks).