Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Quantifying MCMC exploration
of phylogenetic tree space
Christopher Whidden and Frederick “Erick” A. Matsen IV
Fred Hutchi...
Phylogenetics: reconstruct evolutionary history from DNA

armadillo

DNA or RNA
sequence data

"phylogenetics"

human

rat...
Phylogenetics helps us learn how HIV-1 came to be

Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host &
Microbe, 2013
We are fond of statistical approaches to phylogenetics

These are important when one would like a clear notion of
uncertai...
We are fond of statistical approaches to phylogenetics
In particular, Bayesian methods fall into this category and have
be...
Markov chain Monte Carlo (MCMC)

Metropolis et al., 1953.

Set up a simulation such that the amount of time spent in a giv...
Here we want a posterior on trees
If we want to use the same strategy to get a posterior on
phylogenetic trees. . .
ACATGG...
Subtree-prune-regraft (SPR) definition

1 2 3 4 5 6

1

4 5 6
2 3

1 4 5 2 3 6
The set of trees as a graph connected by SPR moves
(Figure from Mossel and Vigoda, Science, 2005).
This graph is connected, and every tree has nonzero
posterior probability, so MCMC works†

We are guaranteed to converge t...
We can’t run it forever.

News flash:
5 million < ∞
With pathological data, can be hard to traverse peaks
goodness
We wanted to know: does this happen in real data sets?

Lots of discussion in literature, but few clear conclusions.

In o...
dSPR : how many SPR moves from one tree to another?
Say T1

T2 if there is an SPR transformation of T1 to T2 .
dSPR (T , S...
Meet Chris Whidden, algorithms strongman

In a series of four very technical papers, Chris took exact
computation of dSPR ...
Let’s take some common data sets and see what we see

These are completely standard data sets of the sort that biologists
...
Interested in high probability subsets of the SPR graph
Summarize by subsetting to high probability nodes

node size proportional to
posterior probability, and
color shows distan...
The top 4096 trees for a data set
The top 4096 trees for a data set

What's up with this stuff?
Is it important? Is it difficult
for the MCMC to see?
Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probabi...
Commute time definition
Commute time for a node y : how long to make the round trip
from y to the highest posterior probabi...
Commute time plot for this data set
The separation is problematic indeed

Yep, those parts of the posterior
are important and MCMC has
trouble entering them.
Trees with 95% of posterior probability for another data set
We can use our methods to identify source of bottlenecks
Hyla_cinerea

Hyla_cinerea

Bufo_valliceps

Bufo_valliceps

Nesom...
Multidimensional scaling visualizations via dSPR
In general, a new way to explore tree space
Our applications: it’s party time
Automatic identification of (multiple) peaks in posteriors
Performance of Metropolis-coup...
Thank you

Robert Beiko (Dalhousie University)
Aaron Darling (University of Technology, Sydney)
Connor McCoy (Fred Hutchin...
Upcoming SlideShare
Loading in …5
×

Quantifying MCMC exploration of phylogenetic tree space

1,841 views

Published on

A talk given at the Algorithms for Threat Detection Program Review, March 10th, 2014.

Published in: Technology
  • Be the first to comment

Quantifying MCMC exploration of phylogenetic tree space

  1. 1. Quantifying MCMC exploration of phylogenetic tree space Christopher Whidden and Frederick “Erick” A. Matsen IV Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org @ematsen
  2. 2. Phylogenetics: reconstruct evolutionary history from DNA armadillo DNA or RNA sequence data "phylogenetics" human rat giraffe
  3. 3. Phylogenetics helps us learn how HIV-1 came to be Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host & Microbe, 2013
  4. 4. We are fond of statistical approaches to phylogenetics These are important when one would like a clear notion of uncertainty (like medicine, epidemiology, and biodefense!)
  5. 5. We are fond of statistical approaches to phylogenetics In particular, Bayesian methods fall into this category and have become quite popular. ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... We can’t solve for this posterior distribution, but we can satisfy our needs by getting a big sample from it.
  6. 6. Markov chain Monte Carlo (MCMC) Metropolis et al., 1953. Set up a simulation such that the amount of time spent in a given state is proportional to the posterior probability of that state.
  7. 7. Here we want a posterior on trees If we want to use the same strategy to get a posterior on phylogenetic trees. . . ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... we need a way to move from one phylogenetic tree to another.
  8. 8. Subtree-prune-regraft (SPR) definition 1 2 3 4 5 6 1 4 5 6 2 3 1 4 5 2 3 6
  9. 9. The set of trees as a graph connected by SPR moves (Figure from Mossel and Vigoda, Science, 2005).
  10. 10. This graph is connected, and every tree has nonzero posterior probability, so MCMC works† We are guaranteed to converge to the posterior distribution on trees by using Metropolis-Hastings moves built on these SPRs. That is, by bouncing around “tree space” we can get a good idea of a set of good trees. † That is, it works if we run the MCMC forever
  11. 11. We can’t run it forever. News flash: 5 million < ∞
  12. 12. With pathological data, can be hard to traverse peaks goodness
  13. 13. We wanted to know: does this happen in real data sets? Lots of discussion in literature, but few clear conclusions. In order to understand the reasons differentiating “easy” and “difficult” data sets for phylogenetic MCMC, we wanted to make it possible to visualize tree space with a relevant geometry. So, what trees are close to each other in terms of SPR moves?
  14. 14. dSPR : how many SPR moves from one tree to another? Say T1 T2 if there is an SPR transformation of T1 to T2 . dSPR (T , S) = T1 min ··· Tk =S k This distance is NP-hard to compute. That’s no fun!
  15. 15. Meet Chris Whidden, algorithms strongman In a series of four very technical papers, Chris took exact computation of dSPR from O(infeasible) to O(feasible). Then he joined my group!
  16. 16. Let’s take some common data sets and see what we see These are completely standard data sets of the sort that biologists analyze every day: slowly evolving nuclear, mitochondrial, or chloroplast genes. Also used as examples in: Lakner et al., Syst. Biol., 2008 Hohna and Drummond, Syst. Biol., 2012 Larget, Syst. Biol., 2013
  17. 17. Interested in high probability subsets of the SPR graph
  18. 18. Summarize by subsetting to high probability nodes node size proportional to posterior probability, and color shows distance to the highest PP tree.
  19. 19. The top 4096 trees for a data set
  20. 20. The top 4096 trees for a data set What's up with this stuff? Is it important? Is it difficult for the MCMC to see?
  21. 21. Commute time definition Commute time for a node y : how long to make the round trip from y to the highest posterior probability tree and back? Any round trip path counts!
  22. 22. Commute time definition Commute time for a node y : how long to make the round trip from y to the highest posterior probability tree and back? Any round trip path counts!
  23. 23. Commute time plot for this data set
  24. 24. The separation is problematic indeed Yep, those parts of the posterior are important and MCMC has trouble entering them.
  25. 25. Trees with 95% of posterior probability for another data set
  26. 26. We can use our methods to identify source of bottlenecks Hyla_cinerea Hyla_cinerea Bufo_valliceps Bufo_valliceps Nesomantis_thomasseti Hypogeophis_rostratus Eleutherodactylus_cuneatus Grandisonia_alternans Gastrophryne_carolinensis Amphiuma_tridactylum Hypogeophis_rostratus Ichthyophis_bannanicus Grandisonia_alternans Ambystoma_mexicanum Amphiuma_tridactylum Siren_intermedia Ichthyophis_bannanicus Typhlonectes_natans Plethodon_yonhalossee Discoglossus_pictus Scaphiopus_holbrooki Plethodon_yonhalossee Discoglossus_pictus Scaphiopus_holbrooki Ambystoma_mexicanum Nesomantis_thomasseti Siren_intermedia Eleutherodactylus_cuneatus Typhlonectes_natans Gastrophryne_carolinensis Xenopus_laevis Xenopus_laevis Homo_sapiens Homo_sapiens Mus_musculus Mus_musculus Rattus_norvegicus Rattus_norvegicus Oryctolagus_cuniculus Oryctolagus_cuniculus Turdus_migratorius Turdus_migratorius Gallus_gallus Gallus_gallus Heterodon_platyrhinos Heterodon_platyrhinos Sceloporus_undulatus Sceloporus_undulatus Alligator_mississippiensis Alligator_mississippiensis Trachemys_scripta Trachemys_scripta Latimeria_chalumnae Latimeria_chalumnae These are the trees at the two peaks of the connected components. Indeed, it’s very tricky to get between them!
  27. 27. Multidimensional scaling visualizations via dSPR
  28. 28. In general, a new way to explore tree space
  29. 29. Our applications: it’s party time Automatic identification of (multiple) peaks in posteriors Performance of Metropolis-coupled Markov chain Monte Carlo for getting between peaks Accuracy of new “mean-field” posterior probability approximations The first topological convergence diagnostic These empirical investigations set the stage for additional theoretical development, and suggest new ways to move around tree space. This will translate into better phylogenetic uncertainty estimates, and hence better preparedness and response to biological threats.
  30. 30. Thank you Robert Beiko (Dalhousie University) Aaron Darling (University of Technology, Sydney) Connor McCoy (Fred Hutchinson Cancer Research Center) NSF award 1223057

×