A probabilistic parsimonious model for species tree reconstruction

  • 205 views
Uploaded on

Talk presented at the Evolution Meeting 2013 (http://www.evolutionmeeting.org/engine/search/index.php?func=detail&aid=478)

Talk presented at the Evolution Meeting 2013 (http://www.evolutionmeeting.org/engine/search/index.php?func=detail&aid=478)

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
205
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A probabilistic parsimonious model for species tree reconstruction Leonardo de Oliveira Martins David Posada ● leomrtns@uvigo.es ● dposada@uvigo.es with invaluable help from Klaus Schliep and Diego Mallo
  • 2. What do we want ● To estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all ● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon ● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
  • 3. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 4. Model for the evolution of gene families S G1 D1 G2 D2 Gn Dn . . .
  • 5. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the P(G/S) Our assumption: difference between the gene and species trees we may use several such simple explanations ● distance between G and S
  • 6. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the difference between the gene and species trees P(G/S) Our assumption: Rodrigo and Steel. 2008. SystBiol 57: 243 ML supertrees we may use several such simple explanations ● work with unrooted gene trees ● penalize gene trees very different from species tree ● distance between G and S
  • 7. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 8. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event
  • 9. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 10. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 11. Quantifying the disagreement – other measures mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
  • 12. Quantifying the disagreement – other measures de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.
  • 13. Quantifying the disagreement – other measures see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1
  • 14. Quantifying the disagreement – other measures Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119
  • 15. Now we have estimates for these assuming deepcoal: 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 16. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 17. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  • 18. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event Stochastic error/nonparametric
  • 19. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event RF, Hdist Stochastic error/nonparametric
  • 20. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future
  • 21. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future Problem: the normalization constant Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426 Solution: importance sampling estimate of Z(.) E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.
  • 22. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 23. Distribution of gene trees: probabilistic model G1 D1 Q1 . . . Gn Dn Qn S
  • 24. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 . . . λdupprior Gn Dn Qn λdupn
  • 25. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 λloss1 . . . λspr1 λdupprior Gn Dn Qn . . . λdupn λlossn . . λsprn . λlossprior λsprprior
  • 26. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior
  • 27. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Input λlossprior λsprprior
  • 28. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Output λlossprior λsprprior
  • 29. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference We should not rely on single estimates of gene phylogenies λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior Output E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.
  • 30. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 31. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works
  • 32. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works RF Hdist SPR
  • 33. Example: distances between gene families RF Hdist SPR
  • 34. Example: distances between gene families Posterior samples RF Hdist SPR
  • 35. Example: distances between gene families Posterior samples best estimate RF Hdist SPR
  • 36. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 37. Analysis of simulated data sets ● Fully probabilistic simulation of gene trees by Diego Mallo and David Posada ● Birth and death of new loci, conditioned on a multispecies coalescent, followed by sequence evolution We use gene trees only, and simulate tree inference error Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765
  • 38. Analysis of simulated data sets – results
  • 39. Analysis of simulated data sets – results
  • 40. Analysis of simulated data sets – results
  • 41. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  • 42. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  • 43. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  • 44. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain
  • 45. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain ● Only one unrooted topology
  • 46. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips
  • 47. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips best species tree: ~100%
  • 48. To recap, our model can ● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. The larger, the better – specially for rooting the species tree Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all Do not assume gene trees are known – embrace ignorance! ● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon Different gene families may be product of distinct processes ● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless It's parallelized, and all distances can be calculated very fast.
  • 49. Check out http://darwin.uvigo.es for announcements, code, slides... Thank you!