• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Phylogenomic Supertrees. ORP Bininda-Emond
 

Phylogenomic Supertrees. ORP Bininda-Emond

on

  • 1,627 views

 

Statistics

Views

Total Views
1,627
Views on SlideShare
1,625
Embed Views
2

Actions

Likes
0
Downloads
25
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Phylogenomic Supertrees. ORP Bininda-Emond Phylogenomic Supertrees. ORP Bininda-Emond Presentation Transcript

    • Phylogenomic supertrees: the end of the road or the light at the end of the tunnel? Olaf R. P. Bininda-Emonds Friedrich-Schiller-Universität Jena
      • what are supertrees?
        • “ traditional” supertrees
        • the threat from phylogenomics
      • supertrees in the future
        • a paradigm shift
        • deconstructing divide-and-conquer
      • challenges for the future
      Outline
    • What is a supertree?
      • results from the combination of many smaller, overlapping trees to form a single larger one
        • allows inferences of relationships that cannot be made from any single source tree
      • as old as systematics itself?
        • “vertical” (taxonomic) substitution
      • still in use
        • e.g., Tree of Life, larger supertrees
    • Formal supertree construction A B D C E I J F G H K L E J F G H K L A B C K L D C E I H K Agreement Optimization consensus-like techniques coding technique optimization criterion
    • “ Traditional” supertrees
    • A supertree of extant mammals 4510 of the 4554 species listed in Wilson and Reeder (1993)
      • from Bininda-Emonds et al . (2007)
      Monotremata Marsupialia Afrotheria Xenarthra Laurasiatheria Euarchontoglires You are here
    • A supertree of extant birds
      • 5985 extant species (Davis and Page, semi-publ. data)
      • phylogeny from Johnson (2001)
    • Criticisms of supertrees
      • one step removed from the real data
        • loss of information reduces accuracy
        • prevents “signal enhancement”
        • potential for data duplication
      • can produce unsupported clades
      • invalid as phylogenetic hypotheses
        • summary statement (i.e., consensus)
        • cannot interpret supertree biologically
      • not necessary due to the molecular revolution ( stop-gap method )
      • “ Not many people build them [supertrees], and my sense is that their lifetime is limited : as gene sequence data becomes increasingly easy to acquire, supertrees will lose their value.”
      • Anonymous review of proposed supertree book (2001)
    • MRP supertree of extant Carnivora
      • all 271 extant species
      • 274 source trees from 177 literature sources
      • 13 nested supertrees
      • from Bininda-Emonds et al . (1999)
    • Carnivora sequences in GenBank 1 10 100 1000 10 000 1990 1995 2000 2005 Number Year
      • as of January 1, 1996
      100 000 1 000 000 10 000 000 677 sequences 48 species 12 new species / yr
    • Carnivora sequences in GenBank 1 10 100 1000 10 000 1990 1995 2000 2005 Number Year 1 984 623 sequences 197 species 13.1 new species / yr
      • as of March 12, 2004
      100 000 1 000 000 10 000 000
      • from Bininda-Emonds (2005)
    • Distribution of GenBank data 1 976 358 4365 3900 are for domestic dog (99.6%) are for domestic cat (0.2%) for remaining 195 species (or 20.0 sequences / species) 1 984 623 sequences 191 of the 219 Martes americana sequences are cyt b 225 of the 302 Phoca vitulina sequences are tRNA-Pro
      • but:
    • The molecular revolution
      • molecular databases are currently highly incomplete and data are not randomly distributed
        • 33+ genome projects for mammals
        • ESTs: lots of bps, but comparatively few species
      • “ data availability matrix” for green plants (from Sanderson and Driskell, 2003)
      Genes Species
    • A paradigm shift
      • traditional, literature-based supertree construction probably ultimately endangered
        • but more so for some groups than for others
      • any future role in phylogenetics likely as an analytical tool
        • traditional  mixed data analyses
        • divide-and-conquer  homogeneous data analyses
    • Partitioned analyses
      • utility of pure sequence-based analyses for large, taxonomically broad studies questioned increasingly
        • alignment problems  loss of data
        • saturation / signal dropout
      • increasing trend for mixed analyses using data that require different models and assumptions:
        • e.g., morphology, DNA sequence data, AA alignments, RCGs, gene order, gene content, …
      • mixed-data analyses might benefit from a “traditional” supertree approach
        • i.e., supertree represents end result of analysis
      supertree construction conventional analysis
    • Analyzing DNA supermatrices
      • partitioned approach incorporating supertrees needed around turn of century
      • less need today through advances in hardware (clusters and parallel computing) and software (faster algorithms and “tricks”)
        • ever larger phylogenetic problems now increasingly feasible (esp. in a likelihood framework), with bootstraps and mixed model analyses
      supertree construction conventional analysis conventional analysis
    • Archimedean phylogenetics “ Give me a cluster large enough and a data set on which to work on, and I shall derive the phylogeny.”
      • adapted from Roshan et al . (2004)
      supertree construction global optimization (conventional analysis) subtree optimization (conventional analysis)
    • BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
    • subsample data (4, 8, 16, …, 1024, 2048 taxa) simulate data (K2P, ti:tv = 2.0,  = 0.5,  = 0.1, 2000 bp) phylogenetic analysis (NJ, weighted MP, ML, or ML-DCM3) compare to pruned model tree
      • “ clade sampling”
      Sampling schemes
      • “ random sampling”
    • BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
    • Divide step
      • investigated chiefly by Daniel Huson, Tandy Warnow, Usman Roshan and colleagues
        • developed disk-covering methods (DCMs)
        • fastest current implementation is Recursive-Iterative-DCM3 (Rec-I-DCM3)
      • sampling strategy for divide step crucial
        • Roshan et al . (2004) noted that performance gain dependent on quality of initial decomposition
        • due to effects on analysis times of subtree optimization step
    • Scaling of accuracy
      • from Bininda-Emonds and Stamatakis (2006)
      0.750 0.800 0.850 0.900 0.950 1.000 Average similarity to model tree (1 – d S ) 1 10 100 1000 10000 Size of subsampled tree ML-DCM3 (random) ML-DCM3 (clade) MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade)
    • Accuracy and sampling strategy
      • from Bininda-Emonds and Stamatakis (2006)
      0.95 1.00 1.05 1.10 1.15 1 10 100 1000 10000 Size of subsampled tree Ratio of average similarity (clade / random sampling) MP NJ ML ML-DCM3
    • Scaling of analysis time
      • from Bininda-Emonds and Stamatakis (2006)
      0.01 0.1 1 10 100 1000 10000 100000 Average analysis time (seconds) 1 10 100 1000 10000 Size of subsampled tree ML-DCM3 (random) ML-DCM3 (clade) MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade)
    • Analysis time and sampling strategy
      • from Bininda-Emonds and Stamatakis (2006)
      0.0 0.5 1.0 1.5 1 10 100 1000 10000 Size of subsampled tree Ratio of average analysis time (clade / random sampling) MP NJ ML ML-DCM3
    • BUILD MR / O Global optimization Supertree construction Subtree optimization Divide Accuracy Speed Stage n/a
    • Supertree step
      • two main alternative strategies: BUILD-based vs. matrix representation / optimization based
      • problem:
        • BUILD is fast , but shows poor accuracy
        • MR / O shows good accuracy , but is deadly slow
      • can we devise a supertree method that combines speed and accuracy ?
        • BUILD shows more promise  MR / O will always be slow
        • NB: accuracy ≠ resolution !
    • Problems with BUILD
      • lot of BUILD-derived algorithms:
        • BUILD, MinCutSupertree, BUILD-with-Distances , AncestralBUILD, MultiLevelSupertree, PhySIC , …
      • MinCut the most widely known and basis for many other methods
        • tends to approximate Adams consensus (at least empirically)
        • tends to favour larger source trees (= size bias )
        • tends to spit out single conflicting taxa at each step yielding very unbalanced, comb-like trees
    • Does divide-and-conquer work?
      • it should / could:
        • tremendous speed gain to analyzing many, smaller problems:
        • accuracy ~flat with respect to problem size
      1 n time  x << time n x
    • Number of analyses Size of subsampled tree
      • from Bininda-Emonds and Stamatakis (2006)
      = 4096 taxa 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 MP (random) MP (clade) NJ (random) ML (random) NJ (clade) ML (clade)
      • e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa (≈ 4 000 000 taxa in total) in the time taken to analyze 4096 taxa simultaneously
    • Does divide-and-conquer work?
      • it should / could:
        • tremendous speed gain to analyzing many, smaller problems:
        • accuracy ~flat with respect to problem size
      • but these potential savings aren’t realized in full empirically …
      1 n time  x << time n x
    • Analyses of full 4096-taxon data set 1.55x
      • from Bininda-Emonds and Stamatakis (2006)
      195 371 0.921 ML-DCM3 303 450 0.923 ML (“standard hill climbing”) 69 392 0.917 MP 193 0.857 NJ Time taken (seconds) Accuracy (1 – d S ) Method
    • Analyses of full data set 5.04x
      • from Bininda-Emonds and Stamatakis (2006)
      195 371 0.921 ML-DCM3 38 737 0.912 ML (“fast hill climbing”) 303 450 0.923 ML (“standard hill climbing”) 69 392 0.917 MP 193 0.857 NJ Time taken (seconds) Accuracy (1 – d S ) Method
    • What’s the problem?
      • bottleneck remains terminal global optimization step
        • any excessive branch swapping will slow it down
        • but branching swapping crucial for accuracy
      • therefore, key is to provide as accurate of a starting tree as possible
        • DCM3 method seems to be providing only a slightly better tree than NJ (PHYML) or greedy MP (RAxML)
    • Possible solutions: input
      • improve accuracy of supertree by any of all of:
        • increasing coverage by analyzing more subtrees with more overlap
        • including several larger backbone trees
        • deriving support values for subtrees (e.g., fast bootstrapping) to enable weighted supertree analysis
      • time is available for these steps
        • also lend themsevles to parallelization
    • Possible solutions: analysis
      • optimize global optimization step using constraints
        • minimize amount of intensive branch-swapping and tree surfing
        • idea in DCM-based methods (“refinement” of SCM supertree)
      • supertree serves as starting tree and constraint tree
        • crucial that supertree is accurate (NB accuracy ≠ resolution!)
        • can also judge node support empirically and constraint only well supported nodes
    • What’s the answer?
      • increasing technological sophistication will keep increasing range of conventional analyses
        • analyses of ≤10000 taxa now feasible and usually without parallelization
      • but, does a divide-and-conquer + supertree framework have a role beyond this?
        • theoretically yes, but only by solving a number of challenges
      ?
      • divide (+ subtree optimization) steps
        • find subtree size(s) or combinations thereof that maximize speed and especially accuracy
        • find optimal sampling scheme : clade and backbone vs clade-like sampling
        • do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?
        • alternatives to disk-covering methods?
      • supertree step
        • can we find a method that is fast like BUILD and accurate like MR / O methods?  PhySIC???
      • global-optimization step
        • have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints
      Challenges for the future
    • Bicliques A B D F C G E Taxa Genes 6 3 4 1 7 5 8 2 1 2 3 4 5 6 7 8 A + – – – – – – – B + + – – – – – – C – + + + + – – – D – + + + + + – – E – + + + + – – – F – – – – – + + – G – – – – – – + + Taxa Genes maximal biclique = K 4,3
    • Extending bicliques
      • quasi-bicliques
        • allow a certain proportion of missing edges
      • as input for a supertree analysis
        • essentially build bicliques of bicliques  bicliques that overlap for at least two taxa, but no sequences
      A B D F C G E Taxa Genes 6 3 4 1 7 5 8 2
    • Challenges for the future
      • divide (+ subtree optimization) steps
        • find subtree size(s) or combinations thereof that maximize speed and especially accuracy
        • find optimal sampling scheme : clade and backbone vs clade-like sampling
        • do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?
        • alternatives to disk-covering methods?
      • supertree step
        • can we find a method that is fast like BUILD and accurate like MR / O methods?  PhySIC???
      • global-optimization step
        • have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints