A probabilistic parsimonious model
for species tree reconstruction

Leonardo de Oliveira Martins
David Posada

●

leomrtns...
What do we want
●

To estimate species trees given arbitrary gene families ←

can contain paralogous, missing data, etc.

...
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Model for the evolution of gene families
S

G1
D1

G2
D2

Gn
Dn

.
.
.
Model for the evolution of gene families
S

G1
D1

We just need to consider the
simplest explanation for the

P(G/S)

Our ...
Model for the evolution of gene families
S

G1
D1

We just need to consider the
simplest explanation for the
difference be...
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Quantifying the disagreement
assuming deepcoal:

gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:

1 d...
Quantifying the disagreement
assuming deepcoal:

gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:

1 d...
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Quantifying the disagreement – other measures

mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferri...
Quantifying the disagreement – other measures

de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination w...
Quantifying the disagreement – other measures

see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-r...
Quantifying the disagreement – other measures

Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web...
Now we have estimates for these
assuming deepcoal:

1 deepcoal
assuming duplosses:

1 dup
3 losses
assuming HGT:

1 event
...
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

1 dup
3 losses
assu...
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony...
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony...
Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:

Gene tree parsimony...
Considering several measures of disagreement:

Thus we can incorporate e.g. duplications
and losses while accounting for H...
Considering several measures of disagreement:

Thus we can incorporate e.g. duplications
and losses while accounting for H...
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Distribution of gene trees: probabilistic model
G1
D1
Q1

.
.
.
Gn

Dn
Qn

S
Distribution of gene trees: probabilistic model
G1

S
λdup1

D1
Q1

.
.
.

λdupprior
Gn

Dn
Qn

λdupn
Distribution of gene trees: probabilistic model
G1

S
λdup1

D1
Q1

λloss1

.
.
.

λspr1

λdupprior
Gn

Dn
Qn

.
.
.

λdup...
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art s...
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art s...
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art s...
Distribution of gene trees: probabilistic model
G1

S
λdup1

Importance
Sampling
So we can use complex,
state-of-the-art s...
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Example: distances between gene families
●

567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (20...
Example: distances between gene families
●

567 single-copy gene trees for 23 species

Data from.: Salichos L, Rokas A (20...
Example: distances between gene families

RF

Hdist

SPR
Example: distances between gene families
Posterior samples

RF

Hdist

SPR
Example: distances between gene families
Posterior samples
best estimate

RF

Hdist

SPR
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Analysis of simulated data sets
●

Fully probabilistic simulation of gene trees by Diego Mallo and

David Posada
●

Birth ...
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Analysis of simulated data sets – results
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
H...
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

●

(TreeFam database has 14250...
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

●

(TreeFam database has 14250...
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

Estimated species tree:

●

Ro...
Single copy genes from Drosophila (TreeFam)
●

4591 informative, single-copy gene families

Estimated species tree:

●

Ro...
Large gene families from Drosophila (TreeFam)
●

43 gene families with 102~295 tips
Large gene families from Drosophila (TreeFam)
●

43 gene families with 102~295 tips
best species tree:

~100%
To recap, our model can
●

Estimate species trees given arbitrary gene families ← can

contain paralogous, missing data, e...
Check out http://darwin.uvigo.es for announcements, code, slides...

Thank you!
Upcoming SlideShare
Loading in …5
×

A probabilistic parsimonious model for species tree reconstruction

595 views

Published on

Talk presented at the Evolution Meeting 2013 (http://www.evolutionmeeting.org/engine/search/index.php?func=detail&aid=478)

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
595
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A probabilistic parsimonious model for species tree reconstruction

  1. 1. A probabilistic parsimonious model for species tree reconstruction Leonardo de Oliveira Martins David Posada ● leomrtns@uvigo.es ● dposada@uvigo.es with invaluable help from Klaus Schliep and Diego Mallo
  2. 2. What do we want ● To estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all ● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon ● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
  3. 3. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  4. 4. Model for the evolution of gene families S G1 D1 G2 D2 Gn Dn . . .
  5. 5. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the P(G/S) Our assumption: difference between the gene and species trees we may use several such simple explanations ● distance between G and S
  6. 6. Model for the evolution of gene families S G1 D1 We just need to consider the simplest explanation for the difference between the gene and species trees P(G/S) Our assumption: Rodrigo and Steel. 2008. SystBiol 57: 243 ML supertrees we may use several such simple explanations ● work with unrooted gene trees ● penalize gene trees very different from species tree ● distance between G and S
  7. 7. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  8. 8. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event
  9. 9. Quantifying the disagreement assuming deepcoal: gene tree species tree reconciliation 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  10. 10. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  11. 11. Quantifying the disagreement – other measures mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
  12. 12. Quantifying the disagreement – other measures de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.
  13. 13. Quantifying the disagreement – other measures see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1
  14. 14. Quantifying the disagreement – other measures Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119
  15. 15. Now we have estimates for these assuming deepcoal: 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  16. 16. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  17. 17. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: 1 event Stochastic error/nonparametric
  18. 18. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event Stochastic error/nonparametric
  19. 19. Now we have estimates for these assuming deepcoal: Gene tree parsimony 1 deepcoal assuming duplosses: Gene tree parsimony 1 dup 3 losses assuming HGT: (approximate) dSPR 1 event RF, Hdist Stochastic error/nonparametric
  20. 20. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future
  21. 21. Considering several measures of disagreement: Thus we can incorporate e.g. duplications and losses while accounting for HGT and random errors Easy to include other distances in the future Problem: the normalization constant Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426 Solution: importance sampling estimate of Z(.) E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.
  22. 22. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  23. 23. Distribution of gene trees: probabilistic model G1 D1 Q1 . . . Gn Dn Qn S
  24. 24. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 . . . λdupprior Gn Dn Qn λdupn
  25. 25. Distribution of gene trees: probabilistic model G1 S λdup1 D1 Q1 λloss1 . . . λspr1 λdupprior Gn Dn Qn . . . λdupn λlossn . . λsprn . λlossprior λsprprior
  26. 26. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior
  27. 27. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Input λlossprior λsprprior
  28. 28. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . Output λlossprior λsprprior
  29. 29. Distribution of gene trees: probabilistic model G1 S λdup1 Importance Sampling So we can use complex, state-of-the-art software for phylogenetic inference We should not rely on single estimates of gene phylogenies λloss1 . . . λspr1 . . . λdupprior Gn λdupn λlossn . . λsprn . λlossprior λsprprior Output E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.
  30. 30. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  31. 31. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works
  32. 32. Example: distances between gene families ● 567 single-copy gene trees for 23 species Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331 ● Analysis under a model where only RF, Hdist and dSPR are considered ● Not interested in data set per se (unreliable) ● Use it just as a didactical tool about how the model works RF Hdist SPR
  33. 33. Example: distances between gene families RF Hdist SPR
  34. 34. Example: distances between gene families Posterior samples RF Hdist SPR
  35. 35. Example: distances between gene families Posterior samples best estimate RF Hdist SPR
  36. 36. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  37. 37. Analysis of simulated data sets ● Fully probabilistic simulation of gene trees by Diego Mallo and David Posada ● Birth and death of new loci, conditioned on a multispecies coalescent, followed by sequence evolution We use gene trees only, and simulate tree inference error Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765
  38. 38. Analysis of simulated data sets – results
  39. 39. Analysis of simulated data sets – results
  40. 40. Analysis of simulated data sets – results
  41. 41. Outline Model of gene family evolution Parsimonious estimation of disagreement * reconciliation * distance between trees Hierarchical Bayesian model Examples * comparing many trees * simulation * TreeFam data set
  42. 42. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  43. 43. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families ● (TreeFam database has 14250 informative gene families)
  44. 44. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain
  45. 45. Single copy genes from Drosophila (TreeFam) ● 4591 informative, single-copy gene families Estimated species tree: ● Root location uncertain ● Only one unrooted topology
  46. 46. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips
  47. 47. Large gene families from Drosophila (TreeFam) ● 43 gene families with 102~295 tips best species tree: ~100%
  48. 48. To recap, our model can ● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc. The larger, the better – specially for rooting the species tree Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or ● maybe we don't have signal at all Do not assume gene trees are known – embrace ignorance! ● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon Different gene families may be product of distinct processes ● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless It's parallelized, and all distances can be calculated very fast.
  49. 49. Check out http://darwin.uvigo.es for announcements, code, slides... Thank you!

×