Phylogenomic Supertrees. ORP Bininda-Emond

1,660 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,660
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Phylogenomic Supertrees. ORP Bininda-Emond

  1. 1. Phylogenomic supertrees: the end of the road or the light at the end of the tunnel? Olaf R. P. Bininda-Emonds Friedrich-Schiller-Universität Jena
  2. 2. <ul><li>what are supertrees? </li></ul><ul><ul><li>“ traditional” supertrees </li></ul></ul><ul><ul><li>the threat from phylogenomics </li></ul></ul><ul><li>supertrees in the future </li></ul><ul><ul><li>a paradigm shift </li></ul></ul><ul><ul><li>deconstructing divide-and-conquer </li></ul></ul><ul><li>challenges for the future </li></ul>Outline
  3. 3. What is a supertree? <ul><li>results from the combination of many smaller, overlapping trees to form a single larger one </li></ul><ul><ul><li>allows inferences of relationships that cannot be made from any single source tree </li></ul></ul><ul><li>as old as systematics itself? </li></ul><ul><ul><li>“vertical” (taxonomic) substitution </li></ul></ul><ul><li>still in use </li></ul><ul><ul><li>e.g., Tree of Life, larger supertrees </li></ul></ul>
  4. 4. Formal supertree construction A B D C E I J F G H K L E J F G H K L A B C K L D C E I H K Agreement Optimization consensus-like techniques coding technique optimization criterion
  5. 5. “ Traditional” supertrees
  6. 6. A supertree of extant mammals 4510 of the 4554 species listed in Wilson and Reeder (1993) <ul><li>from Bininda-Emonds et al . (2007) </li></ul>Monotremata Marsupialia Afrotheria Xenarthra Laurasiatheria Euarchontoglires You are here
  7. 7. A supertree of extant birds <ul><li>5985 extant species (Davis and Page, semi-publ. data) </li></ul><ul><li>phylogeny from Johnson (2001) </li></ul>
  8. 8. Criticisms of supertrees <ul><li>one step removed from the real data </li></ul><ul><ul><li>loss of information reduces accuracy </li></ul></ul><ul><ul><li>prevents “signal enhancement” </li></ul></ul><ul><ul><li>potential for data duplication </li></ul></ul><ul><li>can produce unsupported clades </li></ul><ul><li>invalid as phylogenetic hypotheses </li></ul><ul><ul><li>summary statement (i.e., consensus) </li></ul></ul><ul><ul><li>cannot interpret supertree biologically </li></ul></ul><ul><li>not necessary due to the molecular revolution ( stop-gap method ) </li></ul>
  9. 9. <ul><li>“ Not many people build them [supertrees], and my sense is that their lifetime is limited : as gene sequence data becomes increasingly easy to acquire, supertrees will lose their value.” </li></ul><ul><li>Anonymous review of proposed supertree book (2001) </li></ul>
  10. 10. MRP supertree of extant Carnivora <ul><li>all 271 extant species </li></ul><ul><li>274 source trees from 177 literature sources </li></ul><ul><li>13 nested supertrees </li></ul><ul><li>from Bininda-Emonds et al . (1999) </li></ul>
  11. 11. Carnivora sequences in GenBank 1 10 100 1000 10 000 1990 1995 2000 2005 Number Year <ul><li>as of January 1, 1996 </li></ul>100 000 1 000 000 10 000 000 677 sequences 48 species 12 new species / yr
  12. 12. Carnivora sequences in GenBank 1 10 100 1000 10 000 1990 1995 2000 2005 Number Year 1 984 623 sequences 197 species 13.1 new species / yr <ul><li>as of March 12, 2004 </li></ul>100 000 1 000 000 10 000 000 <ul><li>from Bininda-Emonds (2005) </li></ul>
  13. 13. Distribution of GenBank data 1 976 358 4365 3900 are for domestic dog (99.6%) are for domestic cat (0.2%) for remaining 195 species (or 20.0 sequences / species) 1 984 623 sequences 191 of the 219 Martes americana sequences are cyt b 225 of the 302 Phoca vitulina sequences are tRNA-Pro <ul><li>but: </li></ul>
  14. 14. The molecular revolution <ul><li>molecular databases are currently highly incomplete and data are not randomly distributed </li></ul><ul><ul><li>33+ genome projects for mammals </li></ul></ul><ul><ul><li>ESTs: lots of bps, but comparatively few species </li></ul></ul><ul><li>“ data availability matrix” for green plants (from Sanderson and Driskell, 2003) </li></ul>Genes Species
  15. 15. A paradigm shift <ul><li>traditional, literature-based supertree construction probably ultimately endangered </li></ul><ul><ul><li>but more so for some groups than for others </li></ul></ul><ul><li>any future role in phylogenetics likely as an analytical tool </li></ul><ul><ul><li>traditional  mixed data analyses </li></ul></ul><ul><ul><li>divide-and-conquer  homogeneous data analyses </li></ul></ul>
  16. 16. Partitioned analyses <ul><li>utility of pure sequence-based analyses for large, taxonomically broad studies questioned increasingly </li></ul><ul><ul><li>alignment problems  loss of data </li></ul></ul><ul><ul><li>saturation / signal dropout </li></ul></ul><ul><li>increasing trend for mixed analyses using data that require different models and assumptions: </li></ul><ul><ul><li>e.g., morphology, DNA sequence data, AA alignments, RCGs, gene order, gene content, … </li></ul></ul><ul><li>mixed-data analyses might benefit from a “traditional” supertree approach </li></ul><ul><ul><li>i.e., supertree represents end result of analysis </li></ul></ul>supertree construction conventional analysis
  17. 17. Analyzing DNA supermatrices <ul><li>partitioned approach incorporating supertrees needed around turn of century </li></ul><ul><li>less need today through advances in hardware (clusters and parallel computing) and software (faster algorithms and “tricks”) </li></ul><ul><ul><li>ever larger phylogenetic problems now increasingly feasible (esp. in a likelihood framework), with bootstraps and mixed model analyses </li></ul></ul>supertree construction conventional analysis conventional analysis
  18. 18. Archimedean phylogenetics “ Give me a cluster large enough and a data set on which to work on, and I shall derive the phylogeny.”
  19. 19. <ul><li>adapted from Roshan et al . (2004) </li></ul>supertree construction global optimization (conventional analysis) subtree optimization (conventional analysis)
  20. 20. BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
  21. 21. subsample data (4, 8, 16, …, 1024, 2048 taxa) simulate data (K2P, ti:tv = 2.0,  = 0.5,  = 0.1, 2000 bp) phylogenetic analysis (NJ, weighted MP, ML, or ML-DCM3) compare to pruned model tree
  22. 22. <ul><li>“ clade sampling” </li></ul>Sampling schemes <ul><li>“ random sampling” </li></ul>
  23. 23. BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
  24. 24. Divide step <ul><li>investigated chiefly by Daniel Huson, Tandy Warnow, Usman Roshan and colleagues </li></ul><ul><ul><li>developed disk-covering methods (DCMs) </li></ul></ul><ul><ul><li>fastest current implementation is Recursive-Iterative-DCM3 (Rec-I-DCM3) </li></ul></ul><ul><li>sampling strategy for divide step crucial </li></ul><ul><ul><li>Roshan et al . (2004) noted that performance gain dependent on quality of initial decomposition </li></ul></ul><ul><ul><li>due to effects on analysis times of subtree optimization step </li></ul></ul>
  25. 25. Scaling of accuracy <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>0.750 0.800 0.850 0.900 0.950 1.000 Average similarity to model tree (1 – d S ) 1 10 100 1000 10000 Size of subsampled tree ML-DCM3 (random) ML-DCM3 (clade) MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade)
  26. 26. Accuracy and sampling strategy <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>0.95 1.00 1.05 1.10 1.15 1 10 100 1000 10000 Size of subsampled tree Ratio of average similarity (clade / random sampling) MP NJ ML ML-DCM3
  27. 27. Scaling of analysis time <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>0.01 0.1 1 10 100 1000 10000 100000 Average analysis time (seconds) 1 10 100 1000 10000 Size of subsampled tree ML-DCM3 (random) ML-DCM3 (clade) MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade)
  28. 28. Analysis time and sampling strategy <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>0.0 0.5 1.0 1.5 1 10 100 1000 10000 Size of subsampled tree Ratio of average analysis time (clade / random sampling) MP NJ ML ML-DCM3
  29. 29. BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
  30. 30. Supertree step <ul><li>two main alternative strategies: BUILD-based vs. matrix representation / optimization based </li></ul><ul><li>problem: </li></ul><ul><ul><li>BUILD is fast , but shows poor accuracy </li></ul></ul><ul><ul><li>MR / O shows good accuracy , but is deadly slow </li></ul></ul><ul><li>can we devise a supertree method that combines speed and accuracy ? </li></ul><ul><ul><li>BUILD shows more promise  MR / O will always be slow </li></ul></ul><ul><ul><li>NB: accuracy ≠ resolution ! </li></ul></ul>
  31. 31. Problems with BUILD <ul><li>lot of BUILD-derived algorithms: </li></ul><ul><ul><li>BUILD, MinCutSupertree, BUILD-with-Distances , AncestralBUILD, MultiLevelSupertree, PhySIC , … </li></ul></ul><ul><li>MinCut the most widely known and basis for many other methods </li></ul><ul><ul><li>tends to approximate Adams consensus (at least empirically) </li></ul></ul><ul><ul><li>tends to favour larger source trees (= size bias ) </li></ul></ul><ul><ul><li>tends to spit out single conflicting taxa at each step yielding very unbalanced, comb-like trees </li></ul></ul>
  32. 32. Does divide-and-conquer work? <ul><li>it should / could: </li></ul><ul><ul><li>tremendous speed gain to analyzing many, smaller problems: </li></ul></ul><ul><ul><li>accuracy ~flat with respect to problem size </li></ul></ul>1 n time  x << time n x
  33. 33. Number of analyses Size of subsampled tree <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>= 4096 taxa 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade) <ul><li>e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa (≈ 4 000 000 taxa in total) in the time taken to analyze 4096 taxa simultaneously </li></ul>
  34. 34. Does divide-and-conquer work? <ul><li>it should / could: </li></ul><ul><ul><li>tremendous speed gain to analyzing many, smaller problems: </li></ul></ul><ul><ul><li>accuracy ~flat with respect to problem size </li></ul></ul><ul><li>but these potential savings aren’t realized in full empirically … </li></ul>1 n time  x << time n x
  35. 35. Analyses of full 4096-taxon data set 1.55x <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>195 371 0.921 ML-DCM3 303 450 0.923 ML (“standard hill climbing”) 69 392 0.917 MP 193 0.857 NJ Time taken (seconds) Accuracy (1 – d S ) Method
  36. 36. Analyses of full data set 5.04x <ul><li>from Bininda-Emonds and Stamatakis (2006) </li></ul>195 371 0.921 ML-DCM3 38 737 0.912 ML (“fast hill climbing”) 303 450 0.923 ML (“standard hill climbing”) 69 392 0.917 MP 193 0.857 NJ Time taken (seconds) Accuracy (1 – d S ) Method
  37. 37. What’s the problem? <ul><li>bottleneck remains terminal global optimization step </li></ul><ul><ul><li>any excessive branch swapping will slow it down </li></ul></ul><ul><ul><li>but branching swapping crucial for accuracy </li></ul></ul><ul><li>therefore, key is to provide as accurate of a starting tree as possible </li></ul><ul><ul><li>DCM3 method seems to be providing only a slightly better tree than NJ (PHYML) or greedy MP (RAxML) </li></ul></ul>
  38. 38. Possible solutions: input <ul><li>improve accuracy of supertree by any of all of: </li></ul><ul><ul><li>increasing coverage by analyzing more subtrees with more overlap </li></ul></ul><ul><ul><li>including several larger backbone trees </li></ul></ul><ul><ul><li>deriving support values for subtrees (e.g., fast bootstrapping) to enable weighted supertree analysis </li></ul></ul><ul><li>time is available for these steps </li></ul><ul><ul><li>also lend themsevles to parallelization </li></ul></ul>
  39. 39. Possible solutions: analysis <ul><li>optimize global optimization step using constraints </li></ul><ul><ul><li>minimize amount of intensive branch-swapping and tree surfing </li></ul></ul><ul><ul><li>idea in DCM-based methods (“refinement” of SCM supertree) </li></ul></ul><ul><li>supertree serves as starting tree and constraint tree </li></ul><ul><ul><li>crucial that supertree is accurate (NB accuracy ≠ resolution!) </li></ul></ul><ul><ul><li>can also judge node support empirically and constraint only well supported nodes </li></ul></ul>
  40. 40. What’s the answer? <ul><li>increasing technological sophistication will keep increasing range of conventional analyses </li></ul><ul><ul><li>analyses of ≤10000 taxa now feasible and usually without parallelization </li></ul></ul><ul><li>but, does a divide-and-conquer + supertree framework have a role beyond this? </li></ul><ul><ul><li>theoretically yes, but only by solving a number of challenges </li></ul></ul>?
  41. 41. <ul><li>divide (+ subtree optimization) steps </li></ul><ul><ul><li>find subtree size(s) or combinations thereof that maximize speed and especially accuracy </li></ul></ul><ul><ul><li>find optimal sampling scheme : clade and backbone vs clade-like sampling </li></ul></ul><ul><ul><li>do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis? </li></ul></ul><ul><ul><li>alternatives to disk-covering methods? </li></ul></ul><ul><li>supertree step </li></ul><ul><ul><li>can we find a method that is fast like BUILD and accurate like MR / O methods?  PhySIC??? </li></ul></ul><ul><li>global-optimization step </li></ul><ul><ul><li>have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints </li></ul></ul>Challenges for the future
  42. 42. Bicliques A B D F C G E Taxa Genes 6 3 4 1 7 5 8 2 1 2 3 4 5 6 7 8 A + – – – – – – – B + + – – – – – – C – + + + + – – – D – + + + + + – – E – + + + + – – – F – – – – – + + – G – – – – – – + + Taxa Genes maximal biclique = K 4,3
  43. 43. Extending bicliques <ul><li>quasi-bicliques </li></ul><ul><ul><li>allow a certain proportion of missing edges </li></ul></ul><ul><li>as input for a supertree analysis </li></ul><ul><ul><li>essentially build bicliques of bicliques  bicliques that overlap for at least two taxa, but no sequences </li></ul></ul>A B D F C G E Taxa Genes 6 3 4 1 7 5 8 2
  44. 44. Challenges for the future <ul><li>divide (+ subtree optimization) steps </li></ul><ul><ul><li>find subtree size(s) or combinations thereof that maximize speed and especially accuracy </li></ul></ul><ul><ul><li>find optimal sampling scheme : clade and backbone vs clade-like sampling </li></ul></ul><ul><ul><li>do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis? </li></ul></ul><ul><ul><li>alternatives to disk-covering methods? </li></ul></ul><ul><li>supertree step </li></ul><ul><ul><li>can we find a method that is fast like BUILD and accurate like MR / O methods?  PhySIC??? </li></ul></ul><ul><li>global-optimization step </li></ul><ul><ul><li>have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints </li></ul></ul>

×