Phylogenetics Analysis in R

3,542 views
3,235 views

Published on

Phylogenetics Analysis in R

  1. 1. Phylogenetic Analyses in R Klaus Schliep Universidad de Vigo Porto, 15–16 July 2013
  2. 2. Outline Getting started Data Structures Distance based methods Maximum Parsimony Maximum likelihood
  3. 3. Section 1 Getting started
  4. 4. About This slides should give a short introduction into phylogenetic reconstruction in R. It focuses mostly on the packages ape and phangorn. I have to thank Emmanuel Paradis for his work on ape. The slides are produced with literate programming using Latex, Beamer, Sweave and R. So all the code and graphics are ”real”!
  5. 5. Help To install an R package it is good to have administrator rights. Download R from www.cran.r-project.org. You can easily install packges from within R: > install.packages("phangorn") > install.packages("phytools") > install.packages("pegas") > install.packages("seqLogo") > q() Then you can load the packages you need: > library("phangorn") > library("seqLogo")
  6. 6. Help The R homepage provides lots of general documentation, faqs, etc. There are help pages for all the functions and most of them contain examples. > library(help="phangorn") > help.start() > ?pml > help(pml) > example(pml) > vignette("Ancestral") Copy and paste the parts of the code in the examples is a good start. If you prefer reading a book (even they are fast outdated): Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R (Second Edition) New York: Springer There is a mailing list stat.ethz.ch/mailman/listinfo/r-sig-phylo where you can ask questions, after browsing through the archive.
  7. 7. Section 2 Data Structures
  8. 8. Data Structures Reminder: 1. Data in R are made of vector + attribute(s) (and combinations of these). Vector: a series of elements all of the same kind (a list is a vector of pointers). 2. The class is the attribute determining the action of generic functions (plot, summary, etc.) We will make heavily use of the following 3 data structures: 1. phyDat: sequences (DNA, AA, codons, user defined) in phangorn 2. DNAbin: DNA sequences (ape format) 3. phylo: phylogenetic trees
  9. 9. Class phylo This class represents phylogenetic trees. The tip labels may be replicated, the node labels (which may be absent). Input: 1. read.tree: Newick files 2. read.nexus: NEXUS files If the file contains several trees, these two functions return an object of class multiPhylo which is a list of trees of class phylo. And you can write objects of class phylo using write.tree or write.nexus.
  10. 10. Plotting trees ape has great plotting capabilities. > help(plot.phylo) Some simple example > tree <- rtree(10) > par(mfrow=c(2,2), mar=rep(0,4)) > plot(tree) > plot(tree, type="fan") > plot(tree, type="unrooted") > plot(tree, type="cladogram")
  11. 11. Plotting trees t9 t10 t4 t8 t5 t3 t6 t1 t2 t7 t9 t10 t4 t8 t5 t3 t6 t1 t2 t7 t9 t10 t4 t8 t5 t3 t6 t1 t2 t7 t9 t10 t4 t8 t5 t3 t6 t1 t2 t7
  12. 12. Transforming trees There are many functions in ape and phangorn to transform trees (i.e. objects of class phylo) > root(tree, outgroup) > drop.tip(tree, "t1") > extract.clade(phy, 1) > bind.tree(tree1, tree2) > unroot(tree) > multi2di(tree) > di2multi(tree) > nni(tree) > rSPR(tree)
  13. 13. Class phyDat The starting point for phylogenetic reconstruction are sequence alignments. ape can call clustal,tcoffee and muscle and phyloch can call mafft, prank and gblocks. More frequently you will just read in an alignment > align1 <- read.phyDat("myfile") phangorn (phyDat) and ape (DNAbin) use different formats to represent alignments, but it is easy to convert formats. > align2 <- read.dna("myfile") # ape format > align3 <- as.phyDat(align1) # phangorn format
  14. 14. Section 3 Distance based methods
  15. 15. Distance based methods Distance methods take a distance or dissimilarity matrix as input. Ultrametric Additive upgmaa fastme.ols wpgmaa fastme.bal nj UNJa bionj a in phangorn the rest in ape. Fast methods O(n2) or O(n3) → big data sets can be analysed. Distances can be calculated for different kinds of data. In phylogenetics often used to compute starting trees for ML, MP or inside species tree methods.
  16. 16. Distance based methods > set.seed(1) > bs <- bootstrap.phyDat(Laurasiatherian, FUN = function(x > class(bs) <- 'multiPhylo' > cnet = consensusNet(bs, .3) > plot(cnet, show.tip.label=FALSE, show.nodes=TRUE)
  17. 17. Consensusnetwork
  18. 18. Section 4 Maximum Parsimony
  19. 19. Maximum parsimony In contrast to the distance methods (maximum) parsimony uses sequence alignments as input. The target is to minimize an optimality criterion, i.e. a score to a tree, given the data. For the parsimony method the score is the minimal number of substitutions needed to account for the data on a phylogeny. > data(Laurasiatherian) > tree = nj(dist.ml(Laurasiatherian)) > parsimony(tree, Laurasiatherian) [1] 9776 > tree2 = optim.parsimony(tree, Laurasiatherian, trace=FALSE, rearrangement="SPR") > parsimony(tree2, Laurasiatherian) [1] 9713 > tree3 = pratchet(Laurasiatherian, rearrangement="SPR", t
  20. 20. Branch and bound Normally it is not possible to evaluate an optimality criterion for all trees, as there are just too many trees. > sapply(3:10, howmanytrees, FALSE) [1] 1 3 15 105 945 10395 [7] 135135 2027025 > howmanytrees(20, FALSE) [1] 2.216431e+20 For small datasets it is possible to find all most parsimonious trees using a branch and bound algorithm. For datasets with more than 10 taxa this can take a long time and depends strongly on how tree like the data are. > besttree <- bab(subset(Laurasiatherian,1:10), trace=0) > parsimony(besttree, Laurasiatherian) [1] 2695
  21. 21. Ancestral reconstruction To reconstruct ancestral sequences we first load some data and reconstruct a tree: > primates = read.phyDat("primates.dna") > tree = pratchet(primates, trace=0) > tree = acctran(tree, primates) > parsimony(tree, primates) [1] 746 In parsimony analysis the edge length represent the observed number of changes. Reconstructiong ancestral states therefore defines also the edge lengths of a tree. However there can exist several equally parsimonious reconstructions or states can be ambiguous and therefore edge length can differ (e.g. ”MPR”or ”ACCTRAN”). > anc.acctran = ancestral.pars(tree, primates, "ACCTRAN") > anc.mpr = ancestral.pars(tree, primates, "MPR")
  22. 22. Ancestral reconstruction > seqLogo( t(subset(anc.mpr, getRoot(tree), 1:20)[[1]]), i 1 2 3 4 5 6 7 8 910 12 14 16 18 20 Position 0 0.2 0.4 0.6 0.8 1Probability
  23. 23. Ancestral reconstruction MPR > plotAnc(tree, anc.mpr, 17) > title("MPR") Mouse Bovine Lemur Tarsier Squir Monk Jpn Macaq Rhesus Mac Crab−E.Mac BarbMacaq Gibbon Orang Gorilla Chimp Human a c g t MPR
  24. 24. Ancestral reconstruction ACCTRAN > plotAnc(tree, anc.acctran, 17) > title("ACCTRAN") Mouse Bovine Lemur Tarsier Squir Monk Jpn Macaq Rhesus Mac Crab−E.Mac BarbMacaq Gibbon Orang Gorilla Chimp Human a c g t ACCTRAN
  25. 25. Section 5 Maximum likelihood
  26. 26. Maximum Likelihood ”[In 1961] I had visions of evolutionary tree estimation being much the same [than linkage estimation] but with the addition of the need to estimate the form of the tree itself, surely a fatal complexity: my intuition was that there would be insufficient data for the task.” —A.W.F. Edwards (2009) Phylogenetic likelihood is the probability f (x|θ, τ) of observing an alignment X given a model of (nucleotide) substitution with parameters θ and phylogenetic tree τ. L(θ, τ, x) = N i=1 f (xi |θ, τ) where N is the number of sites in the alignment. It is common to maximise the log-likelihood function (θ, τ, x) = N i=1 log (f (xi |θ, τ)) which also maximises L(θ, τ, x).
  27. 27. Applications in phylogenetics Felsenstein (1981) introduced the pruning algorithm which made the computation of the likelihood feasible. Let nodes j and k have a direct ancestor h. We can estimate the conditional likelihood Lh(xh) =   xj Lj (xj )pxj ,xh (tj )   × xk Lk(xk)pxk ,xh (tk) The likelihood of the tree is evaluated by traversing the tree in postorder fashion from the tips towards the root. For unrooted trees, a root can be chosen arbitrarily as our models are time-reversible. We get the likelihood of the tree if we multiply the conditional likelihood of the root node r with the base composition π, as fh(x|θ, τ) = xr πxr Lr (xr ), These formulas can be adapted to estimate ancestral sequences.
  28. 28. ML in phylogenetics 5 6 7 human chimp gorilla orangutan
  29. 29. ML in phylogenetics a a g t
  30. 30. ML in phylogenetics 1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1
  31. 31. ML in phylogenetics 1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1 0.000988|0.000031|0.000595|0.000744 0.027161|0.000559|0.016240|0.000559 0.923613|0.000168|0.000168|0.000169
  32. 32. Finding the best topology A binary unrooted tree has 5 edges and 3 distinct topologies. Here are the general formulas for binary unrooted trees: 2n − 3 edges (2n − 5)!! = 1 × 3 × 5 × · · · × (2n − 3) topologies Rooted binary trees have 2n − 2 edges and (2n − 5)!! topologies. A function exists for this: > howmanytrees(4, rooted=FALSE) [1] 3 > howmanytrees(10, rooted=FALSE) [1] 2027025 > howmanytrees(20, rooted=FALSE) [1] 2.216431e+20
  33. 33. Finding the best trees The strategy of evaluating the likelihood criterion for all trees in order to find the best tree topoology is in most cases highly impracticable. Instead, local tree rearrangements are used to search locally within the tree space. The idea behind such a heuristic is to use a starting tree and search locally for improved scores (parsimony, maximum likelihood, Least-Squares), until no further rearrangements can lead to a tree with a better score.
  34. 34. Nearest neighbor interchange For any internal edge of a binary tree there exist three different ways to connect its four subtrees, one of which is the current tree. A B C D A C B D A D B C
  35. 35. Modelling rate variation We assume that the substitution rate varies between different sites (intron vs. exon, codon positions, etc). Two approaches are commonly used: define different partitions model rate variation with different rate categories, with a (discrete) Γ distribution and/or proportion of variables sites
  36. 36. Comparing trees and models The phylogenetic likelihood allows us to compare many different models or trees. There is often a bias vs. variance trade-off. Simple models are easy to interpret but can often be biased. MSE Variance Bias2 number of parameters
  37. 37. Comparing trees and models The phylogenetic likelihood allows us to compare many different models or trees. If two models are nested - that is, one model can be described as a special case of the other – then we can directly compare their likelihoods under their ML parameter estimates for a fixed tree using a likelihood ratio test (LRT) For non nested models we can use the Akaike Information Criteria (AIC) or the Bayesian Information Criteria (BIC): AIC = − (θ, τ, x) + 2 ∗ df BIC = − (θ, τ, x) + ln(n) ∗ df where df is the number of parameters of the model and n the number of sites. Or use the Shimodaira-Hasegawa test or similar bootstrap approaches.
  38. 38. Detection of molecular adaptation We look at each triplet of nucletides and assume that only one nucleotide can be replaced at a time. Then we can distinguish between nucleotide substitutions that result in the same amino acid (synonymous substitutions) or a different amino acid (non-synonymous substitutions). The ratio dN/dS of non-synonymous to synonymous substitutions can be an indication of the kind of selective pressure acting on the codon site. Under negative selection, we expect that non-synonymous substitutions will accumulate more slowly than synonymous ones. And under positive or diversifying selection, we expect more amino acid changing replacements.
  39. 39. Applications with phangorn The two main functions are pml to set up the model and optim.pml for optimising parameters and the tree with ML. Example session for Jukes Cantor, GTR and GTR+Γ+I model: > data(Laurasiatherian) > tr <- nj(dist.ml(Laurasiatherian)) > m0 <- pml(tr, Laurasiatherian) > m.jc69 <- optim.pml(m0, optNni=TRUE) > m.gtr <- optim.pml(m0, optNni=TRUE, model="GTR") > m.gtr.G.I <- optim.pml(update(m.gtr, k=4), model= "GTR", optNni=TRUE, optGamma=TRUE, optInv=TRUE) By default, only the edge lengths are optimized. Currently phangorn only supports NNI tree rearrangements (equivalent to PhyML vers. 2)
  40. 40. There exist several useful generic functions like update, anova or AIC for objects of class pml. > methods(class="pml") [1] anova.pml logLik.pml plot.pml print.pml [5] update.pml vcov.pml For example we can compare the different models as they are nested with likelihood ratio test: > anova(m.jc69, m.gtr, m.gtr.G.I) Likelihood Ratio Test Table Log lik. Df Df change Diff log lik. Pr(>|Chi|) 1 -54113 91 2 -50603 99 8 7020 < 2.2e-16 *** 3 -44527 101 2 12151 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
  41. 41. Partition models pmlPart(global ∼ local, object, model) global local bf bf Q Q inv inv shape shape edge edge rate nni Each component can be only used once in the formula.
  42. 42. Partition models There are two different ways to set up partition models. 1. Setting up partition models for different genes. > fit1 <- pml(tree, g1) > fit2 <- pml(tree, g2) > fit3 <- pml(tree, g3) > fit4 <- pml(tree, g4) > genePart <- pmlPart(Q + bf ∼ edge, list(fit1, fit2, fit3, fit4), optRooted=TRUE) > trees <- lapply(genePart$fits, function(x)x$tree) > class(trees) <- "multiPhylo" > densiTree(trees, type="phylogram", col="red") where g1, g2, g3 and g4 are objects of class phyDat.
  43. 43. ML in phylogenetics Scer Spar Smik Skud Sbay Scas Sklu Calb
  44. 44. Partition models 2. Partitioning via a weight matrix. > woody <- phyDat(woodmouse) > tree <- nj(dist.ml(woody)) > fit <- pml(tree, woody) > w <- attr(woody, "index") > weight <- table(w, rep(c(1,2,3), length=length(w))) > codonPart <- pmlPart(edge ∼ rate, fit, model=c("JC", "JC", "GTR"), weight=weight)
  45. 45. Model / tree comparison Alternatively we can use the Shimodaira-Hasegawa test to check for differences between models: > SH.test(m.jc69, m.gtr, m.gtr.G.I) Trees ln L Diff ln L p-value [1,] 1 -54112.74 9585.685 0.0000 [2,] 2 -50602.74 6075.683 0.0000 [3,] 3 -44527.06 0.000 0.5911
  46. 46. Model selection Two possibilities ape: phymltest > write.phyDat(woody, "woody.phy") > out <- phymltest("woody.phy", execname = "~/phyml") phangorn: modelTest > mt <- modelTest(Laurasiatherian, model=c("JC", "F81", "HKY", "GTR")) modelTest works also for amino acid models similar to ProtTest. > mt <- modelTest(myAAData, model=c("WAG", "JTT", "LG","Dayhoff"))
  47. 47. Model Selection Model df logLik AIC BIC 1 JC 91.00 -54303.67 108789.35 109341.20 2 JC+I 92.00 -50673.32 101530.63 102088.55 3 JC+G 92.00 -48684.10 97552.19 98110.11 4 JC+G+I 93.00 -48605.03 97396.06 97960.05 5 F81 94.00 -54212.64 108613.27 109183.32 6 F81+I 95.00 -50549.53 101289.06 101865.17 7 F81+G 95.00 -48500.49 97190.99 97767.10 8 F81+G+I 96.00 -48416.26 97024.51 97606.69 9 HKY 95.00 -51275.86 102741.72 103317.83 10 HKY+I 96.00 -47451.73 95095.45 95677.63 11 HKY+G 96.00 -44893.11 89978.23 90560.40 12 HKY+G+I 97.00 -44770.18 89734.36 90322.60 13 GTR 99.00 -50759.89 101717.79 102318.16 14 GTR+I 100.00 -47081.77 94363.55 94969.98 15 GTR+G 100.00 -44759.49 89718.99 90325.42 16 GTR+G+I 101.00 -44624.02 89450.04 90062.54
  48. 48. Bootstrap > bs <- bootstrap.pml(m.gtr, bs=100, optNni=TRUE) > plotBS(m.gtr$tree, bs, type="phylo", bs.adj=c(.5,0)) Platypus Wallaroo Possum Bandicoot Opposum Armadillo Elephant Aardvark Tenrec Hedghog Gymnure Mole Shrew Rbat FlyingFox RyFlyFox FruitBat LongTBat Horse Donkey WhiteRhino IndianRhin Pig Alpaca Cow Sheep Hippo FinWhale BlueWhale SpermWhale Rabbit Pika Squirrel Dormouse GuineaPig Mouse Vole CaneRat Baboon Human Loris Cebus Cat Dog HarbSeal FurSeal GraySeal 10058 100 100 100 58 93 100100 100100 64 58 100 86 100 100 98 96 100100 87 100 44 79 100 88 97 64 86 73 75 100 5489 100 70 47 91 55 68 67 100 100
  49. 49. Codon Models qij =    0 if i and j differ in more than one position πj for synonymous transversion πj κ for synonymous transition πj ω for non-synonymous transversion πj ωκ for non-synonymous transition or if we make abstraction of pij (frequency of base j): qij =    0 if i and j differ in more than one position 1 for synonymous transversion κ for synonymous transition ω for non-synonymous transversion ωκ for non-synonymous transition where ω is the dN/dS ratio, κ the transition transversion ratio and πj is the the equilibrium frequencies of codon j.
  50. 50. Codon Models > (dat <- phyDat(as.character(yeast), "CODON")) > tree <- nj(dist.ml(yeast)) > fit <- pml(tree, dat) > ctr <- pml.control(trace=0) > fit0 <- optim.pml(fit, control = ctr) > fit1 <- optim.pml(fit0, model="codon1", control=ctr) > fit2 <- optim.pml(fit0, model="codon2", control=ctr) > fit3 <- optim.pml(fit0, model="codon3", control=ctr) Model κ ω codon0 1 1 codon1 free free codon2 1 free codon3 free 1 Additionally, the equilibrium frequencies of the codons πj can be estimated setting the parameter optBf=TRUE.
  51. 51. Codon Models > anova(fit0, fit2, fit1) Likelihood Ratio Test Table Log lik. Df Df change Diff log lik. Pr(>|Chi|) 1 -1054762 13 2 -648282 14 1 812961 < 2.2e-16 *** 3 -642807 15 1 10949 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ > anova(fit0, fit3, fit1) Likelihood Ratio Test Table Log lik. Df Df change Diff log lik. Pr(>|Chi|) 1 -1054762 13 2 -708674 14 1 692176 < 2.2e-16 *** 3 -642807 15 1 131735 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

×