• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon
 

Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon

on

  • 425 views

In this talk, I will present new models of molecular evolution suitable for phylogeny estimation. These models provide an improved description of the variation of rates of evolution along genomes and ...

In this talk, I will present new models of molecular evolution suitable for phylogeny estimation. These models provide an improved description of the variation of rates of evolution along genomes and during the course of evolution. I will present examples that demonstrate the superiority of these new models and show how to use them in PhyML

Statistics

Views

Total Views
425
Views on SlideShare
425
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon Improved Models of Molecular Evolution in Statistical Phylogenetics - Stephane Guindon Presentation Transcript

    • Improved models of molecular evolution in statistical phylogenetics St´phane Guindon e Department of Statistics The University of Auckland New Zealand
    • Outline 1 Introduction 2 Variability of rates across sites 3 Variablity of rates across sites and lineages 4 Variability of selection regimes across sites and lineages 5 Conclusion Introduction 2/52
    • Data used in phylogenetics Multiple sources: (Alignment of) homologous sequences Introduction 3/52
    • Data used in phylogenetics Multiple sources: (Alignment of) homologous sequences Calibration (typically fossil data) Geography, environment (e.g., GIS data) In this talk, I will mainly focus on pre-determined alignment(s) of orthologous coding sequences. Introduction 3/52
    • Phylogenetic model “Hybrid” object made of a discrete parameter, the tree topology, and multiple continuous parameters such as branch lengths, substitution rates between pairs of characters (nucleotide, amino-acids, codons), populations size, migration rates, etc. Estimation relies on the likelihood, i.e., the probability of the data given the model parameter values. Bayesian or maximum-likelihood inference, depending on the type of problem and the amount of time/computing resources available. Introduction 4/52
    • Likelihood Pr(S1 , S2 , . . . , S6 |M) = Pr(S1 = AAAA|M) × . . . Introduction 5/52
    • Likelihood Pr(S1 , S2 , . . . , S6 |M) = . . . × Pr(S2 = CGGC|M) × . . . Introduction 6/52
    • Likelihood Pr(S1 , S2 , . . . , S6 |M) = . . . × Pr(S6 = GGAA|M) Introduction 7/52
    • Likelihood 4n combinations (for nucleotides): not computationally tractable for the vast majority of data sets... Clever tree traversal algorithm by Felsenstein (1981): 4 × 4 × n operations required → can go up to 5,000 - 10,000 sequences! Introduction 8/52
    • Core of the likelihood N (t): number of substitutions in short time interval [0, t]. Poisson process: Pr(N (t + dt) − N (t) = 1) ≃ λdt Pr(N (t + dt) − N (t) = 0) ≃ 1 − λdt Pr(N (t + dt) − N (t) ≥ 2) ≃ 0 Poisson probability: Pr(N (t) = k ) = Introduction (λt)k −λt e k! 9/52
    • Core of the likelihood 0/1 data, 2 substitutions Introduction 10/52
    • Core of the likelihood 0/1 data, 2 substitutions p0→0 × p0→0 + p0→1 × p1→0 p1→0 × p0→0 + p1→1 × p1→0 Introduction p0→0 × p0→1 + p0→1 × p1→1 p1→0 × p0→1 + p1→1 × p1→1 11/52
    • Core of the likelihood 0/1 data, 2 substitutions p0→0 p1→0 Introduction p0→1 p1→1 × p0→0 p1→0 p0→1 p1→1 12/52
    • Core of the likelihood 0/1 data, 2 substitutions p0→0 p1→0 p0→1 p1→1 × p0→0 p1→0 p0→1 p1→1 In general, for k ≥ 0 substitutions, the probabilities of change from one state to another is given by Rk . Introduction 12/52
    • Core of the likelihood Combine Poisson probability for k substitutions happening, to Rk , in order to derive P(t), the matrix of transition between states in time t: ∞ (Rk ) P(t) = k =0 ∞ = e −µt =e (µt)k e −µt k! (Rµt)k k! k =0 −µt Rµt e = e (R−I)µt Introduction 13/52
    • Core of the likelihood Combine Poisson probability for k substitutions happening, to Rk , in order to derive P(t), the matrix of transition between states in time t: ∞ (Rk ) P(t) = k =0 ∞ = e −µt k =0 (µt)k e −µt k! (Rµt)k k! = e −µt e Rµt = e (R−I)µt P(µt) = e Qµt Introduction 14/52
    • Core of the likelihood Combine Poisson probability for k substitutions happening, to Rk , in order to derive P(t), the matrix of transition between states in time t: ∞ (Rk ) P(t) = k =0 ∞ = e −µt k =0 (µt)k e −µt k! (Rµt)k k! = e −µt e Rµt = e (R−I)µt P(l ) = e Ql Introduction 15/52
    • Outline 1 Introduction 2 Variability of rates across sites 3 Variablity of rates across sites and lineages 4 Variability of selection regimes across sites and lineages 5 Conclusion Variability of rates across sites 16/52
    • Simplest model Same rate matrix Q throughout the tree Sites are independent and identically distributed (iid) Edges all have the same length l Variability of rates across sites 17/52
    • Standard model, no variation across sites Same rate matrix throughout the tree Sites are iid Each edge has its own length Variability of rates across sites 18/52
    • Standard model, variation across sites Same rate matrix throughout the tree Each edge has its own length Sites are still iid, π fast + π slow = 1, π fast r fast + π slow r slow = 1 Variability of rates across sites 19/52
    • Continuous Gamma model Variability of rates across sites 20/52
    • Discrete Gamma model (Yang, 1994) Variability of rates across sites 21/52
    • Discrete Gamma model (Yang, 1994) Variability of rates across sites 22/52
    • Discrete Gamma model (Yang, 1994) Variability of rates across sites 23/52
    • Transition to new models Benefit of Yang’s discrete Gamma approach: one parameter (α) determines what the values of all the ri s are. Limiting the number of parameters to estimate is convenient from a computational perspective. In practice, the variation of rates across sites is a strong feature of molecular evolution, i.e., estimating α is easy. Modern genetic data sets are much bigger than they were in the 90’s. Computers are also much faster. It is ample time we move on... Designing more flexible models of rate variation is relatively straightforward. Variability of rates across sites 24/52
    • FreeRate model Non-parametric estimation of πi ’s and ri ’s: estimate these parameters under the two constraints i πi = 1 and i πi ri = 1. Variability of rates across sites 25/52
    • FreeRate model Main drawback is the greater number of parameters to estimate: one for discrete gamma model vs. 2C − 2 for FreeRate. Benefits: more flexibility in modelling the variability of rates across sites. possibility to select the “best” number of rate classes using sound statistical approach (e.g., likelihood ratio tests) Variability of rates across sites 26/52
    • Results: nucleotide data sets Variability of rates across sites 27/52
    • Results: amino-acid data sets Variability of rates across sites 28/52
    • Prediction of amino-acid diversity Variability of rates across sites 29/52
    • Summary FreeRate generally fits data better than +Γ4. Similar computational costs. Soubrier et al. (2012, Mol. Biol. Evol.): FreeRate returns more accurate estimates of node ages compared to +Γ4. In PhyML: command-line option --freerate (or --freerates). Variability of rates across sites 30/52
    • Outline 1 Introduction 2 Variability of rates across sites 3 Variablity of rates across sites and lineages 4 Variability of selection regimes across sites and lineages 5 Conclusion Variablity of rates across sites and lineages 31/52
    • Actual rate patterns (?) Variablity of rates across sites and lineages 32/52
    • Modelling site-specific rate patterns Each site and each edge has its own rate of evolution. No-common-mechanism model: poor statistical properties. Alternative: each edge has the same distribution of rates. Variablity of rates across sites and lineages 33/52
    • The Integrated Length (IL) approach Variablity of rates across sites and lineages 34/52
    • The Integrated Length (IL) approach The length of a branch is a random variable, characterized by a mean (number of substitutions) and a variance. In the current implementation, the variance is proportional to the mean (one extra parameter for the whole tree compared to the standard approach). Variablity of rates across sites and lineages 35/52
    • The Integrated Length (IL) approach Integrate over a wide range of scenarios... Including the good ones... Variablity of rates across sites and lineages 36/52
    • The Integrated Length (IL) approach Integrate over a wide range of scenarios... Including the good ones... ...and the not so good ones. Variablity of rates across sites and lineages 36/52
    • Theory Standard approach: P(l) = e Ql IL approach: ∞ e Ql p(l)dl P(l) = 0 If l is distributed as Γ(α, β), then: P(α, β) = (I − βQ)−α Same computational cost as that of the standard approach. Variablity of rates across sites and lineages 37/52
    • Results: nucleotide data sets Variablity of rates across sites and lineages 38/52
    • Results: amino-acid data sets Variablity of rates across sites and lineages 39/52
    • Summary IL incurs approximately the same computational cost as the standard model. IL is nested within the standard model: avenue for hypothesis testing. Gamma distribution is a good model for the branch length if the rate of evolution fluctuates according to a (geometric) Brownian process (Guindon, Syst. Biol., 2013). Large improvement for a small proportion of data sets. In PhyML: command-line option --il. Variablity of rates across sites and lineages 40/52
    • Outline 1 Introduction 2 Variability of rates across sites 3 Variablity of rates across sites and lineages 4 Variability of selection regimes across sites and lineages 5 Conclusion Variability of selection regimes across sites and lineages 41/52
    • Codon models Use alignments of homologous coding sequences to estimate the ratio of non-synonymous to synonymous (dN/dS) substitution rates. The Q matrix is now 61 by 61 (instead of 4×4 or 20×20). We are no longer interested in the variation of the overall rate at which substitutions accumulate. Rather, we focus on the variation of dN/dS. Variability of selection regimes across sites and lineages 42/52
    • Variation across sites: M2a model Variability of selection regimes across sites and lineages 43/52
    • M2a model: a different viewpoint Consider a new model where each state of the Markov model is a combination of a codon state and a selection regime. The M2a model is then defined by a rate matrix Q with dimension (3 × 61) by (3 × 61): Q =   Q ω0 0 0 Variability of selection regimes across sites and lineages 0 Q ω1 0 0 0 Q ω2   44/52
    • Extending M2a: branch-site model Q =   Q ω0 0 0 Variability of selection regimes across sites and lineages 0 Q ω1 0 0 0 Q ω2   45/52
    • Extending M2a: branch-site model Q ω0 0 0 0 Q ω1 0 0 0 Q ω2   Q =  - πω1 I πω1 I πω2 I πω2 I -   + ̺ πω0 I πω0 I   Guindon, Rodrigo, Dyer, Huelsenbeck (2004, PNAS) Variability of selection regimes across sites and lineages 45/52
    • Shan et al. (2009, Mol. Biol. Evol.) Variability of selection regimes across sites and lineages 46/52
    • Standard branch-site model Gold standard: branch-site model (PAML), where the user specifies which branches are likely to be affected by positive selection at some sites a priori. The stochastic branch-site model does not require such prior information. Also, the standard branch-site model assumes that the same branches undergo positive selection in different regions of the alignment. The stochastic branch-site approach does not impose that constraint. Variability of selection regimes across sites and lineages 47/52
    • Simulations Variability of selection regimes across sites and lineages 48/52
    • Power to detect positive selection Truth: 50% A+ 50% F std-BS with tree A std-BS with tree B std-BS with tree C std-BS with tree D std-BS with tree E std-BS multi sto-BS XU 0.912 0.020 0.346 0.006 0.172 0.022 0.148 XV 0.916 0.036 0.346 0.010 0.150 0.040 0.178 XW 1.000 0.190 0.976 0.166 0.000 0.264 0.682 XU, XV: 20% of the sites evolve under positive selection (on green edges only). XW: 40% of the sites evolve under positive selection (on green edges only). Lu & Guindon (2013, Mol. Biol. Evol.) Variability of selection regimes across sites and lineages 49/52
    • Summary Strong prior on where positive selection might have occurred: use the standard branch-site model. Exploratory analysis: the stochastic branch-site model performs better. Also clearly outperforms the MEME approach implemented in HyPhy (Murrell et al., 2012) Variability of selection regimes across sites and lineages 50/52
    • Summary Strong prior on where positive selection might have occurred: use the standard branch-site model. Exploratory analysis: the stochastic branch-site model performs better. Also clearly outperforms the MEME approach implemented in HyPhy (Murrell et al., 2012) Stochastic branch-site model implemented in fitmodel: http://code.google.com/p/fitmodel. Variability of selection regimes across sites and lineages 50/52
    • Outline 1 Introduction 2 Variability of rates across sites 3 Variablity of rates across sites and lineages 4 Variability of selection regimes across sites and lineages 5 Conclusion Conclusion 51/52
    • Conclusion FreeRate model almost systematically outperforms the standard (+Γ4) one. IL approach brings significant improvement for a smaller fraction of the alignments (why?) Stochastic branch-site model of codon evolution is well suited for exploratory analysis where one does not have a clear idea about the lineages evolving under positive selection a priori. Conclusion 52/52
    • Conclusion FreeRate model almost systematically outperforms the standard (+Γ4) one. IL approach brings significant improvement for a smaller fraction of the alignments (why?) Stochastic branch-site model of codon evolution is well suited for exploratory analysis where one does not have a clear idea about the lineages evolving under positive selection a priori. The future? More data means more variability to account for → improving models of molecular evolution is (still) essential. Better models for other sources of data, in particular spatial coordinates of collected sequences and fossils. Conclusion 52/52