Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Life Exists Beyond
GTR
Ben Kaehler
John Curtin School of Medical Research
Australian National University

© Ben Kaehler 20...
Overview
•
•

Relaxing some
assumptions

•

© Ben Kaehler 2013

Modelling evolution

Quantifying plausibility
Motivation
GTR Overestimates Genetic Distance
© Ben Kaehler 2013
Programming in Science
•

Use source control

•

Write unit tests

•

Read this http://software-carpentry.org/

© Ben Kaeh...
Setup
•

Consider only genetic data (DNA or protein
sequences)

•

Take “genes” to be orthologous sequences

•

Assume tha...
.org

Knight et al. (2007). PyCogent: a toolkit for making
sense from sequence. Genome Biol 8, R171.
© Ben Kaehler 2013
Sequences in PyCogent
•

Loading and saving sequence collections

•

Mapping between DNA and protein sequences for
alignme...
The Substitution Process
•

Evolution along each
branch is a continuoustime, time-homogenous
Markov process
•

Markov proc...
Some Maths
Marginal probability vector:


•

⇡=

⇡A

⇡G

Transition probability matrix:

0

•




•

⇡C

P(A|A)


B P(A|C)...
Submodels
•

It is common practice to place constraints on Q to reduce the
parameter space and impose
•
•

desirable phylo...
Stationarity and Time
Reversibility
For all t:


•




⇡(t) = ⇡(0) = ⇡

so

⇡Q = 0
•

This means the base composition of e...
Constraints on Q
GTR

SYM

TN93
HKY85

F84

F81

CS05

JC69

K3ST

K80

The Felsenstein Hierarchy
Pachter, L., & Sturmfels...
Constraints on Q

GTR

TN93

The Felsenstein Hierarchy

Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics ...
Constraints on Q

HKY85
The Felsenstein Hierarchy
Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for co...
Fitting Models in PyCogent

•

Fit HKY85, GTR

•

PyCogent parameters relate to the matrices above

•

Generalising models...
Non-Stationarity
•

We introduce non-stationarity by allowing Q to vary
independently of π

•

The base composition is no ...
Identifiability
Nature gives us frequencies for each possible column:


•




Dog
Pangolin
Rhino
FalseVamp
TombBat








...
Consistency
•

If the maximum likelihood (ML) estimate of a model converges
to the true model as sequence length increases...
The Three Taxon Topology
Dog
TombBat
Q D tD

Q T tT
⇡(0)

Q P tP

Pangolin
© Ben Kaehler 2013
Fitting Non-Stationary
Models in PyCogent

•

Demonstration of model generalisation

•

Fit GTR and General

© Ben Kaehler...
Mild Constraints
•

For the General model to be identifiable (and
consistent), its Ps must be Reconstructible from
Rows

•
...
Diagonal Largest in Column

•

Check Diagonal Largest in Column

•

Availability in PyCogent is forthcoming

© Ben Kaehler...
Quantifying Plausibility
•

Likelihood ratio tests between rungs on the
Felsenstein Hierarchy have been used to justify th...
Parametric Bootstrap

•

A really simple parametric bootstrap

© Ben Kaehler 2013
Life Exists Beyond GTR

© Ben Kaehler 2013
Application: Genetic
Distance
•

Genetic distance is fundamental in the field of molecular evolution

•

Common to use the ...
Doing it Right
•

How to extract Q, π, and t in PyCogent

•

The Van Loan method for exponential integration

•

Availabil...
Conclusion
•

The PyCogent API provides programmatic access
to leading edge phylogenetic tools

•

Python is a great langu...
Upcoming SlideShare
Loading in …5
×

Life Exists Beyond GTR - Ben Kaehler

834 views

Published on

  • Be the first to comment

  • Be the first to like this

Life Exists Beyond GTR - Ben Kaehler

  1. 1. Life Exists Beyond GTR Ben Kaehler John Curtin School of Medical Research Australian National University © Ben Kaehler 2013
  2. 2. Overview • • Relaxing some assumptions • © Ben Kaehler 2013 Modelling evolution Quantifying plausibility
  3. 3. Motivation GTR Overestimates Genetic Distance © Ben Kaehler 2013
  4. 4. Programming in Science • Use source control • Write unit tests • Read this http://software-carpentry.org/ © Ben Kaehler 2013
  5. 5. Setup • Consider only genetic data (DNA or protein sequences) • Take “genes” to be orthologous sequences • Assume that each position (nucleotide or codon) in a gene has evolved on the same bifurcating tree • Assume that each position in a gene has evolved independently under identical processes © Ben Kaehler 2013
  6. 6. .org Knight et al. (2007). PyCogent: a toolkit for making sense from sequence. Genome Biol 8, R171. © Ben Kaehler 2013
  7. 7. Sequences in PyCogent • Loading and saving sequence collections • Mapping between DNA and protein sequences for alignment • Filtering by position and for gaps © Ben Kaehler 2013
  8. 8. The Substitution Process • Evolution along each branch is a continuoustime, time-homogenous Markov process • Markov process has no memory • Between speciation events substitution rates are constant © Ben Kaehler 2013 (Thanks Gavin)
  9. 9. Some Maths Marginal probability vector:
 • ⇡= ⇡A ⇡G Transition probability matrix:
 0 • 
 • ⇡C P(A|A) 
 B P(A|C) 
 P =B @ P(A|G) P(A|T ) P(C|A) P(C|C) P(C|G) P(C|T ) P(G|A) P(G|C) P(G|G) P(G|T ) Markov process:
 ⇡(t) = ⇡(0)P (t) • Substitution rate matrix:
 Qt P (t) = e © Ben Kaehler 2013 ⇡T 1 P(T |A) P(T |C) C C P(T |G) A P(T |T )
  10. 10. Submodels • It is common practice to place constraints on Q to reduce the parameter space and impose • • desirable phylogenetic properties, and • • computational tractability, interpretability Most general constrained model is called GTR (aka REV) • time reversibility means statistical agnosticism to direction of time • time reversibility implies stationarity © Ben Kaehler 2013
  11. 11. Stationarity and Time Reversibility For all t:
 • 
 ⇡(t) = ⇡(0) = ⇡ so ⇡Q = 0 • This means the base composition of every gene must be approximately equal • GTR has the desirable properties that: • it can be reconstructed from just two genes and • it doesn’t matter where you put the root © Ben Kaehler 2013
  12. 12. Constraints on Q GTR SYM TN93 HKY85 F84 F81 CS05 JC69 K3ST K80 The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press. © Ben Kaehler 2013
  13. 13. Constraints on Q GTR TN93 The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press. © Ben Kaehler 2013
  14. 14. Constraints on Q HKY85 The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press. © Ben Kaehler 2013
  15. 15. Fitting Models in PyCogent • Fit HKY85, GTR • PyCogent parameters relate to the matrices above • Generalising models © Ben Kaehler 2013
  16. 16. Non-Stationarity • We introduce non-stationarity by allowing Q to vary independently of π • The base composition is no longer uniquely determined by Q so it is free to vary across the phylogeny • There are issues regarding identifiability • The location of the root matters • But life exists beyond GTR © Ben Kaehler 2013
  17. 17. Identifiability Nature gives us frequencies for each possible column:
 • 
 Dog Pangolin Rhino FalseVamp TombBat 
 
 
 A A A A A C A A A A G A A A A T A A A A A C A A A C C A A A G C A A A T C A A A … … … … … x 45 • If there are two sets of parameters that fit the observed frequencies equally well, we say that the model is not identifiable • GTR is identifiable for two or more sequences © Ben Kaehler 2013
  18. 18. Consistency • If the maximum likelihood (ML) estimate of a model converges to the true model as sequence length increases, we say that the estimate is consistent • If we do not constrain Q, • the ML estimates are consistent for three or more sequences • a sensible, continuous-time model that achieves provable consistency but is more general than this non-stationary model would be difficult to devise • some mild constraints on P and hence Q are still necessary © Ben Kaehler 2013
  19. 19. The Three Taxon Topology Dog TombBat Q D tD Q T tT ⇡(0) Q P tP Pangolin © Ben Kaehler 2013
  20. 20. Fitting Non-Stationary Models in PyCogent • Demonstration of model generalisation • Fit GTR and General © Ben Kaehler 2013
  21. 21. Mild Constraints • For the General model to be identifiable (and consistent), its Ps must be Reconstructible from Rows • If the Ps are Reconstructible from Rows, you can’t relabel states at internal nodes to achieve the same likelihood values • We can check a criterion that implies that a matrix is Reconstructible from Rows: Diagonal Largest in Column © Ben Kaehler 2013
  22. 22. Diagonal Largest in Column • Check Diagonal Largest in Column • Availability in PyCogent is forthcoming © Ben Kaehler 2013
  23. 23. Quantifying Plausibility • Likelihood ratio tests between rungs on the Felsenstein Hierarchy have been used to justify the use of more general models • Best of a bad lot is still bad • We can use parametric bootstrap to at least outright reject implausible models • We use the G-statistic to quantify goodness-of-fit for the bootstraps © Ben Kaehler 2013
  24. 24. Parametric Bootstrap • A really simple parametric bootstrap © Ben Kaehler 2013
  25. 25. Life Exists Beyond GTR © Ben Kaehler 2013
  26. 26. Application: Genetic Distance • Genetic distance is fundamental in the field of molecular evolution • Common to use the Expected Number of Substitutions (ENS) along a branch
 Z t 
 dENS = ⇡(s)ds diag(Q) • 0 For stationary models π is constant so
 dENS = PyCogent automatically calibrates Q so that
 • 
 • ⇡ diag(Q)t ⇡ diag(Q) = 1 Which means that in PyCogent, t, the branch length, always equals ENS, but only for stationary models © Ben Kaehler 2013
  27. 27. Doing it Right • How to extract Q, π, and t in PyCogent • The Van Loan method for exponential integration • Availability in PyCogent is forthcoming © Ben Kaehler 2013
  28. 28. Conclusion • The PyCogent API provides programmatic access to leading edge phylogenetic tools • Python is a great language for writing any extensions you need • We will put it all together in this afternoon’s workshop © Ben Kaehler 2013

×