Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Interpreting ‘tree space’ in the 
context of very large empirical 
datasets 
Joe Parker 
School of Biological and Chemical...
Topics 
• What evolutionary biology is 
– And what we do in the lab 
• Introducing phylogenies (trees / digraphs) 
• Molec...
Introduction to our work (1/5)
A Tale of Bats and Whales
The Prestin gene & high-frequency hearing
Evolution
Prestin evolution 
Human NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ 
Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETT...
Introduction to phylogenies (2/5)
Phylogenies 
• Phylogenies are directed graphs that show 
evolutionary relations between taxa 
• Or our hypotheses about t...
Comparative approaches
Tree space 
• Phylogeneticists often talk about tree space - 
the set of all possible trees 
• Within tree space two graph...
Introduction to molecular evolution 
(3/5)
Molecular evolution 
• Molecular evolution is the study of the processes by 
which DNA sequences change over time 
• Stoch...
Simple model: Jukes-Cantor 69 
• Letters {A,C,G,T} 
• Equal frequencies at equilibrium 
• Transition probabilities u / 3 i...
Maximum likelihood 
• One of the most popular frameworks for 
understanding and modelling molecular 
evolution and phyloge...
mΠ 
L = Pr(D |T) = Pr(D(i) |T) 
i=1 
w Σ 
z Σ 
y Σ 
x Σ 
Independence of sites (1) Independence of branches (2) 
   
= Pr(...
Phylogenomics 
• Advances mean data sets several orders of 
magnitude larger 
• Shift in emphasis from ML on specific 
phy...
Phylogenomics 
• Stochastic property of 
molecular evolution 
becomes apparent in 
large datasets 
• Goodness-of-fit varie...
Hypothesis-comparison tests using 
multiple phylogenies (4/5)
Convergence detection by ΔSSLS - 
Parker e t al. (2013) 
• De novo genomes: 
– four taxa 
– 2,321 protein-coding loci 
– 8...
Our pipeline for detecting genome-wide convergence
mean = 0.05
mean = 0.05 mean = -0.01 mean = -0.08 

Continuous distributions 
• Output approximates a continuous distribution 
• Comparing alternative hypotheses it is appare...
Significance by simulation 
• Very common technique in evolutionary 
biology – simulate a large dataset under the 
null mo...
Problems in multiple-hypothesis 
phylogeny comparisons (5/5)
Multiple hypotheses 
• Alternative hypotheses drawn from tree space 
• Same dataset different Ha, different U 
• What U ex...
Tree space 
• In the context of ML tree 
space can be thought of as the 
distance in lnL units (or any 
other related stat...
Multiple comparisons 
• However…. We recall that distance in tree space, 
or shape of tree space, not well determined. 
• ...
Tree space 
• Previously with small empirical datasets 
assume a single phylogeny a good descriptor 
of most/many sites 
•...
Tree distance properties 
• Scalar distances informative 
• Triagonality 
• Proportional to L for a given model(?) 
• Vect...
Tree distance candidates 
• Statistic or model-based measures: 
– Parsimony, ML or amino-acid/nucleotide distance 
– ΔlnL ...
Acknowledgements 
• School of Biological and Chemical Sciences, Queen Mary, University of 
London – Rossiter Group 
– Prof...
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
Upcoming SlideShare
Loading in …5
×

of

Interpreting ‘tree space’ in the context of very large empirical datasets Slide 1 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 2 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 3 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 4 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 5 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 6 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 7 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 8 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 9 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 10 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 11 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 12 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 13 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 14 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 15 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 16 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 17 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 18 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 19 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 20 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 21 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 22 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 23 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 24 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 25 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 26 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 27 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 28 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 29 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 30 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 31 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 32 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 33 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 34 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 35 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 36 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 37 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 38 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 39 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 40 Interpreting ‘tree space’ in the context of very large empirical datasets Slide 41
Upcoming SlideShare
Using field-based DNA sequencing to accelerate phylogenomics
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Interpreting ‘tree space’ in the context of very large empirical datasets

Download to read offline

Seminar presented at the Maths Department, University of Portsmouth, 19th November 2014

Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.

Related Books

Free with a 30 day trial from Scribd

See all

Interpreting ‘tree space’ in the context of very large empirical datasets

  1. 1. Interpreting ‘tree space’ in the context of very large empirical datasets Joe Parker School of Biological and Chemical Sciences Queen Mary University of London
  2. 2. Topics • What evolutionary biology is – And what we do in the lab • Introducing phylogenies (trees / digraphs) • Molecular evolution • Tests involving phylogeny comparison • Problems in phylogeny comparison • Conclusion / thanks / questions
  3. 3. Introduction to our work (1/5)
  4. 4. A Tale of Bats and Whales
  5. 5. The Prestin gene & high-frequency hearing
  6. 6. Evolution
  7. 7. Prestin evolution Human NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETTVLPPQ Dog NDLTQNRFFENPALKELLFH… SIHDAVLGSQLREALAEQEASALPPQ Dolphin SDLTRNQFFENPALLDLLFH… SIHDAVLGSLVREALAEKEAAAATPQ Horseshoe Bat SDLTRNRFFENPALLDLLFH… SIHDAVLGSLVREALEEKEAAAATPQ
  8. 8. Introduction to phylogenies (2/5)
  9. 9. Phylogenies • Phylogenies are directed graphs that show evolutionary relations between taxa • Or our hypotheses about them
  10. 10. Comparative approaches
  11. 11. Tree space • Phylogeneticists often talk about tree space - the set of all possible trees • Within tree space two graphs are said to be adjacent if they differ at e.g. one internal node • Trees are said to be ‘near’ if they are similar e.g. only a few rearrangements • It is not actually a well-defined concept however
  12. 12. Introduction to molecular evolution (3/5)
  13. 13. Molecular evolution • Molecular evolution is the study of the processes by which DNA sequences change over time • Stochastic changes dominate over short time-scales but over longer ones directional natural selection is apparent • Normally modelled as stochastic process • Unlike classical physical phenomena largely understood as a statistical not mechanical phenomenon
  14. 14. Simple model: Jukes-Cantor 69 • Letters {A,C,G,T} • Equal frequencies at equilibrium • Transition probabilities u / 3 in time t • e.g. A  C: ut ⎛ More generally: Felsenstein (2004) Inferring Phylogenies. Springer, NY (Following model figures and formulae: ibid.)   Pr(C | A • u • t) = 1 4 1−e − 4 3 ⎝ ⎜ ⎞ ⎠ ⎟
  15. 15. Maximum likelihood • One of the most popular frameworks for understanding and modelling molecular evolution and phylogenies • Likelihood of data given model, phylogeny: mΠ • Likelihood-maximisation gives a way to parametize model and/or phylogeny   L = Pr(D |T) = Pr(D(i) |T) i=1
  16. 16. mΠ L = Pr(D |T) = Pr(D(i) |T) i=1 w Σ z Σ y Σ x Σ Independence of sites (1) Independence of branches (2)   = Pr(A,C,C,C,G, x, y,z,w,T)
  17. 17. Phylogenomics • Advances mean data sets several orders of magnitude larger • Shift in emphasis from ML on specific phylogenies to statistics of all flickr/stephenjjohnson Illumina.com spectrum.ieee.org
  18. 18. Phylogenomics • Stochastic property of molecular evolution becomes apparent in large datasets • Goodness-of-fit varies by site / gene for a single phylogeny / model • Corollary: goodness-of-fit varies amongst models for a single genome
  19. 19. Hypothesis-comparison tests using multiple phylogenies (4/5)
  20. 20. Convergence detection by ΔSSLS - Parker e t al. (2013) • De novo genomes: – four taxa – 2,321 protein-coding loci – 801,301 codons • Published: – 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores DSSLSi = ln Li,H0 − ln Li,Ha
  21. 21. Our pipeline for detecting genome-wide convergence
  22. 22. mean = 0.05
  23. 23. mean = 0.05 mean = -0.01 mean = -0.08 
  24. 24. Continuous distributions • Output approximates a continuous distribution • Comparing alternative hypotheses it is apparent that selection of tree gives largely determines location skew etc (perhaps as expected) • But given that distribution tails are considered significant meaning of values in these tails problematic / comparable
  25. 25. Significance by simulation • Very common technique in evolutionary biology – simulate a large dataset under the null model, compare w/empirical • in this context simulate data get unexpectedness U: U = 1 – cdf ( ΔSSLSH0-Ha | j )
  26. 26. Problems in multiple-hypothesis phylogeny comparisons (5/5)
  27. 27. Multiple hypotheses • Alternative hypotheses drawn from tree space • Same dataset different Ha, different U • What U expected for Ha? • More simulation – multiple draws from tree space: Uc,= U – mean Uc
  28. 28. Tree space • In the context of ML tree space can be thought of as the distance in lnL units (or any other related statistic*) between two trees with otherwise identical models / data • In our previous results this appeared continuous. • This may be misleading; in reality tree space, or derived statistics, can be highly discontinuous.
  29. 29. Multiple comparisons • However…. We recall that distance in tree space, or shape of tree space, not well determined. • How to sample effectively to control U (as Uc)? • How to compare Uc for Ha? • Sample every point (tree)? • Sample lots? • Sample systematically? Inverse-distance? Etc
  30. 30. Tree space • Previously with small empirical datasets assume a single phylogeny a good descriptor of most/many sites • With large datasets this may not be true – Both small adjustments better fit for many sites – And also some large rearrangements • Perhaps a better definition of tree space • Considering two Ha equidistant from H0
  31. 31. Tree distance properties • Scalar distances informative • Triagonality • Proportional to L for a given model(?) • Vectors informative (?)
  32. 32. Tree distance candidates • Statistic or model-based measures: – Parsimony, ML or amino-acid/nucleotide distance – ΔlnL • Topology-based measures: – Number / type of rearrangement moves, e.g. • Nearest-neighbour interchange • Subtree prune-and-regraft • Tree bisection-and-reconnection • Algorithm-based measures: – # Of algorithm move steps – Wall clock time
  33. 33. Acknowledgements • School of Biological and Chemical Sciences, Queen Mary, University of London – Rossiter Group – Prof. Steve Rossiter (PI) – Drs Kalina Davies, Georgia Tsagkogeorga, Michael McGowen, Mao Xiuguang – Seb Bailey, Kim Warren • Others: – Profs Richard Nichols, Andrew Leitch (SBCS) – Drs Yannick Wurm, Richard Buggs, Chris Faulkes, Steve Le Comber (SBCS) – Drs Chris Walker & Rob Horton (GridPP HTC) • Sanger Centre – Dr James Cotton (L-R): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
  • dolorsumaoang

    Feb. 19, 2020

Seminar presented at the Maths Department, University of Portsmouth, 19th November 2014 Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.

Views

Total views

607

On Slideshare

0

From embeds

0

Number of embeds

21

Actions

Downloads

6

Shares

0

Comments

0

Likes

1

×