• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Species and Gene Trees: History, Inference, and Visualization - Joseph Heled
 

Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

on

  • 531 views

Computer cycles are getting cheaper exponentially for the past 40 years, and the cost of DNA sequencing is declining even faster. Both technological achievements are responsible for the re-birth of ...

Computer cycles are getting cheaper exponentially for the past 40 years, and the cost of DNA sequencing is declining even faster. Both technological achievements are responsible for the re-birth of phylogenetics, where the digital information in genetic data is used to infer organisms relatedness by employing powerful statistical methods. In my talk I will briefly review the history of inferring species family trees, introduce a recent Bayesian method which uses data from multiple organisms in closely related species, and show how the method output can be visualized to better understand the results.

Statistics

Views

Total Views
531
Views on SlideShare
531
Embed Views
0

Actions

Likes
1
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Species and Gene Trees: History, Inference, and Visualization - Joseph Heled Species and Gene Trees: History, Inference, and Visualization - Joseph Heled Presentation Transcript

    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Pre Darwin phylogenetic trees Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference The Origin sole figure MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation The Cytochrome C Gene Tree (Fitch, 1967) Tree Shape Taxa Position
    • Introduction • • • • • • The Coalescent Bayesian Inference MCMC Processes of speciation Evolution of traits Biogeography Epidemiology Co-Evolution (host/parasite) Domestication Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference Selecting a “Duck” MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC The Molecular Clock (early ’60s) Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Models of Sequence Evolution JC69 model (Jukes and Cantor, 1969) Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC The Kingman Coalescent (1982) Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Wright-Fisher Population (1931) • • The individuals were randomly sampled from a population of size N. The parent of any individual is chosen uniformly at random from all potential parents Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position The Coalescent The larger the population, the longer (on average) you have to travel back in time for the common ancestor.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position The Coalescent for multiple individuals The waiting time for the first common ancestor of two individuals out of m (going backwards in time) m is exponential with a rate of ( 2 ) /Ne . Ne is the Wright-Fisher effective population size.
    • Introduction The Coalescent Bayesian Inference From Models to Inference MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Bayes’ Theorem (a Reminder) P(A ∧ B) = P(A)P(B|A) Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Bayes’ Theorem (2) P(B)P(A|B) = P(A ∧ B) = P(A)P(B|A) P(B|A)P(A) P(A|B) = P(B)
    • Introduction The Coalescent Bayesian Inference Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Models (so far) Substitution model: A stochastic process for the evolution (change) of genetic data (sequences) over time. Clock model: How substitution rates change over time. Coalescent model: A stochastic process for the ancestral relationship between a group of homologous sequences from several individuals.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Models (Math Notation) Coalescent model: f (T |Ne ) Substitution model: f (G |T ) Where G is the gene (sequence data) and T is the ancestral relationships (tree).
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape The Biological Species Concept The conventional definition of “a species” amongst evolutionary biologists is “a group of organisms whose members interbreed among themselves, but are separated from other groups by genetically-based barriers to gene flow.” Jerry Coyne “Why Evolution is True” blog. Taxa Position
    • Introduction The Coalescent Bayesian Inference The Species “tree” MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent The Gene(s) tree Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Species Tree Ancestral Reconstruction Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Multiple Individuals from each Species Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Multispecies coalescent – Kingman Coalescent per Species Tree Branch
    • Introduction The Coalescent Bayesian Inference Multiple Independent Loci MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape The Multispecies Posterior P(S|D) = g ∝ P(S, g |D) P(D|S, g )P(S, g ) g = P(D|g )P(S, g ) g = g P(D|g )P(g |S)P(S) Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape A Complex Posterior g f (S, g |D) g f (D|S, g )f (S, g ) f (D) P(S|D) = = Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Problem 1: f(D) The prior probability of obtaining data D. P(S|D) = g f (D|S, g )f (S, g ) f(D) We don’t know the value of f (D). f (D) = f (D|S, g )f (S, g ) g ,S However, it is a constant. Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Problem 2: The Whole Damn Thing The posterior is a distribution defined by a complex multidimensional integral.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Enter MCMC Markov Chain Monte Carlo (MCMC) is a class of methods for stochasticly sampling from probability distributions based on constructing a Markov chain.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Very short history of MCMC 1953: Metropolis algorithm published in Journal of Chemical Physics (Metropolis et al.) 1970: Hastings algorithms in Biometrika (Hastrings) 1974: Gibbs sampler and Hammersley-Clifford theorem paper by Besag 1980s: Image analysis and spatial statistics enjoyed MCMC algorithms, not popular with others due to the lack of computing power 1995: Reversible jump algorithm in Biometrika (Green) groundtruth.info/AstroStat/slog/2008/mcmc-historyo
    • Introduction The Coalescent Bayesian Inference MCMC in a nutshell MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC in a nutshell (2) MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC in a nutshell (3) MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC in a nutshell (4) MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape MCMC in a nutshell (5) If we propose to go from B or A to either A or B with equal probability, then 2 1 A B Flow from A to B is 2/3 · 1/4 = 1/6, and from B to B is 1/3 · 1/2 = 1/6, 1/3 in total. Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape MCMC in a nutshell (6) Hastings Ratio (x to y ) = p(y → x)f (y ) p(x → y )f (x) So far we had p(y → x) = p(x → y ), that is the probability going from x to y was equal to going back. Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position MCMC in a nutshell (7) If at A we always propose to go to B but from B we go to A or B with equal propability, that is, Table: p(x → y ) A A B 0 1 1/ 2 1/ Then HR(A → B) = And B 2 1/ 2·1 = 1/4, 1·2 HR(B → A) = 4.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position MCMC in a nutshell (8) 2 1 A B Flow from B to A is 1/3 · 1/2 and from A to A is 2/3 · 3/4 = 1/2, 2 / in total. 3
    • Introduction The Coalescent Bayesian Inference Tree(s) Visualisation MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Traditional Tree Visualization Cyan con 1 oax FLSJ 0.99 1 0.43 sum 0.47 0.38 0.45 int coast ins ult 0.96 0.88 couc pot 1 woll 0.62 0.91 aria arib
    • Introduction The Coalescent Bayesian Inference MCMC Traditional Tree Visualization (2) Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Species Tree with Population Sizes Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference Species Tree with Gene Trees MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference Species Tree (Densitree) MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent The “Star Tree” Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference Species Tree (Densitree) MCMC Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Taxa Order Matters (When Drawing Multiple Trees) 73% 17% 10% 0 1 2 3 4 5 6 0 1 5 6 2 3 4 2 3 4 5 6 0 1
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Some Orders are Better than Others 73% 17% 10% 0 1 2 3 4 5 6 0 1 5 6 2 3 4 2 3 4 5 6 0 1 234 0 156 Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Disadvantages: • Population size changes in one branch have a visual effect on other branches. • Fails when trying to show the whole posterior ` la a DensiTree. • No obvious way to extend for trees with constant population size per branch.
    • Introduction The Coalescent Bayesian Inference MCMC The Imperial AT-AT Tree Provide some space between branches. Visualisation Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape A double Act Target species tree (blue) and BEAST posterior summary (orange). Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Species Tree with Constant Population Sizes To extend to constant branches, we need a rule to place the bottom of the branch on top of the descendant branches. We use the proportion rule.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Position,Position,Position A species tree specifies heights and widths. The challenge is to pick good X-axis positions. The star tree builds the tree from root towards the tips. Building from the tips towards the root is simpler when drawing species trees. When building from the tips the descendants X-positions determine the parent position.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Position,Position,Position However, there are many ways to place nodes. Here are four of them: Descendants Mean Halfway between direct descendants. Tips Mean Average of all tips in the sub-tree. Middle Halfway between rightmost tip of left sub-tree and leftmost tip of right sub-tree. Balanced by Population At point minimizing the difference between branch bottom and top centers.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Node Positioning The methods are similar for balanced trees. The difference is in the handling of unbalanced trees. D-Mean T-Mean Middle Balanced
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Node Positioning D-Mean Balanced Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Species Tress Posterior 15 10 0 20 40 60 oax con int ins coast sum FLSJ ult couc pot woll arib aria 0 Cyan 5 80 Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Gene Trees Within Species Trees: Preliminary Next we would like to draw the gene tree within the species tree. Hurdle 1: Obtain a suitable gene tree. The gene tree has to be compatible with the species tree. This is not a problem when drawing a specific MCMC state, but is a problem when using summary trees.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Tips Positioning Hurdle 2: Branches with non-constant width complicates positioning of tips. 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 −1 0 1 2 3 4 5 −2 6 3.0 10 1.0 0.5 8 1.5 1.0 6 2.0 1.5 4 2.5 2.0 2 3.0 2.5 0 0.5 0.0 0.0 −2 0 2 4 6 8 10 −5 0 5 10 15 20 Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Tips Positioning (automatic) The placing insures that extrema points are at least (horizontally). 3.0 2.5 2.0 1.5 1.0 0.5 0.0 −2 0 2 4 6 8 10 apart Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Tips Positioning (automatic) Even with Python, it is far from trivial to implement. Remember, we want a tight fit, and placing should work for all modes of internal positioning. Basically, we build the tree from bottom to top up by joining clades. the X position of extrema points for the clade is a linear function of the spacing between the sub-trees (where the spacing inside the two sub trees are fixed). So each extrema points sets a lower limit on the spacing, and the largest is taken as the final separation. The best way to pick for a tree still needs to be worked out.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Drawing the Gene Tree Hurdle 3: A suitable policy for drawing the gene tree. We reuse the ideas of the Star Tree. Given the position of an internal node, the left/right branches is drawn as a straight lines towards the “middle” of the left/right sub-trees. But we still need to handle the species transitions. From the bottom up, we (linearly) map the lineages leaving the branch to the top of the branch. The top of the clade is then put in the middle of the mapped taxa.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation g12 g10 g11 speciesD g7 g9 g8 speciesC Visual Clutter 0.0 100.0 g13 g15 g14 speciesE g4 g6 speciesB g5 g0 g3 g2 g1 speciesA 200.0 300.0 400.0 500.0 600.0 Contained Tree 700.0 800.0 900.0 branches 300 generations Tree Shape Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Gene Tree Inside Species Tree: As-Is 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 ID2 0 ID3 ID4 2 ID0 4 ID1 6 ID5 8 ID6 10 Taxa Position
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Reducing Visual Clutter (1) a b x b a x The size (number of tips) of the sub-tree (b,x) is 2, but the span (number of tips between leftmost and rightmost) is 3.
    • Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position Reducing Visual Clutter (2) For every sequential arrangement of the gene tree taxa we can get a rough measure of the amount of crossings, n∈Internal Nodes span(n) − size(n) size(n) is the number of taxa in the sub tree. span(n) is the number of taxa in the group bounded by the leftmost and rightmost tips of the sub tree. The difference is the excess taxa, the number of potential lineages that may need to cross out of the clade. Note that valid arrangements depend on the orientation of the species tree, so optimization should be over both species ordering and gene tips arrangements compatible with that order. Since the number is typically large, we resort to multiple tries of hill climbing. Number of tries might be fixed or bounded by time.
    • Introduction The Coalescent Bayesian Inference Unresolved Conflicts Optimized tree on the right. MCMC Visualisation Tree Shape Taxa Position