Comparative analysis including
phylogeny in R
AE Zanne et al. Nature 000, 1-4 (2013) doi:10.1038/nature12872
Time-calibrated maximum-likelihood estimate of the molecular phylogeny for 31,749 species of seed plants.
Bianca A. Santini
@myoldowlisdead
b.santini@sheffield.ac.uk
What is a phylogeny?
• Hypothesis that explains the evolutionary relationship among taxa*
*taxa: species, or higher taxonomic levels.
Also genes, or sequences
node
terminals/tips/leaves
root
A B C
D
Internal
branch
branch
External branch
ModifiedfromNatureScitable:
http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-
tree-the-meaning-of-41956#
Why use phylogenies…
…in comparative analysis?
• Comparative analyses are used to assess the ecological significance of a
particular trait
• However, because there is a shared history…
a) they are not statistically independent data
b) the feature under study might exist
because of shared ancestry
One example of data analyzed without a
phylogeny
• Salisbury’s data (1927, yes 88yrs ago!)
• Observations
– Differences in stomata density (SD)
between sun(>) and shade (<) leaves
• Measured stomatal density and related
them to life-form, habitat type
• Conclusion: SD increases with exposure
SD
Trees Shrubs Herbs Woody Herbs
plants
SD
Marginal
herbs
Understory
herbs
Photo by A. Vazquez-Lobo
Re-analysis of Salisbury’s data
• Independent contrasts (Felsenstein,
1985)
– Introduces a phylogeny
– The trait changes along the branches of the
tree, should be associated to changes in the
explanatory variable
– no. of times the traits changes in concert
with the environmental variable
(agreements)
vs.
no. of times they do not (disagreements)
– Sign test
SD
Trees Shrubs Herbs
Woody Herbs
plants
SD
Marginal
herbs
Understory
herbs
Kelly and Beerling, 1995 (68 yrs after)
How to get your own phylogeny ?
phylomatic (uses a megatree)
http://phylodiversity.net/phylomatic/
Use and trim and already published
phylogeny:
- Dryad: http://datadryad.org/
- Ecological Archives (from the esa)
phyloGenerator (uses gen bank sequences)
http://willpearse.github.io/phyloGenerator/
This is if you don’t have the sequences, or are
not planning to get them.
Package CAPER : pgls()
Similar approach as in Independent Contrasts, but uses a matrix
of variances and covariances (tree)
N.B. If interested in phylogenies and evolution
analyses: geiger, adephylo, picante, phylolm, ape…
pgls: phylogenetic generalized least squares
what do you need?
1. Phylogeny
2. Data
– Make sure the rows in your data frame are the same as the tips of your tree
i.e. your data: Juncus bufonius
tree: Juncus_bufonius
> my.data$underscore.name=gsub(" ","_",my.data$underscore.name)
– Make sure you have one observation per species per trait:
3. Put them into a comparative.data()
Species names Leaf area Seed mass
Juncus_bufonius 120.2 0.24
Setaria_pumila 91.2 6.91
#1)PHYLOGENY
> tree<-read.tree ("Vascular_Plants_rooted.dated.tre”) ##or read.nexus()
> tree <- congeneric.merge(tree,my.data$underscore.name) ##pez package
Number of species in tree before: 401
Number of species in tree now: 550
> tree
Phylogenetic tree with 550 tips and 393 internal nodes.
Tip labels:
Gladiolus_italicus, Juncus_squarrosus, Juncus_bufonius, Bolboschoenus_maritimus, Isolepis_setacea,
Cyperus_fuscus, ...
Node labels:
, , , , , , …
Rooted; includes branch lengths.
pgls()
#You can always check for synonyms and replace (taxize)
> my.data$underscore.name<-recode(my.data$underscore.name, "'Aegilops_geniculata' =
'Aegilops_ovata'")
> plot.phylo(tree, cex=0.45, type="radial", edge.color=c("red", "orange", "blue"))
pgls()
##trim your tree
> tree <- drop.tip(tree, setdiff(tree$tip.label, my.data$underscore.name))
#2) YOUR DATA
> dat<-data.frame(read.csv(“mydata.csv",header=T))
#3)PUT THEM together in comparative data, which will drop rows with NAs for you
and match the rows to the tips of the phylogeny :D
> cdat <- comparative.data(data = dat, phy = tree, names.col = ”underscore.name”,
scope=leaf.area~seed.mass, vcv=TRUE) #na.omit=FALSE #warn.dropped=TRUE
> cdat$dropped #to see what has been dropped.
#to get the phylogenetic signal : lambda=‘ML’
#0 is a star phylogeny (no phylo signal), and 1 is an structured phylogeny, or all is explained by the
phylogeny.
> fit= pgls(leaf.area~seed.mass, cdat, lambda='ML')
> summary(fit)
pgls()
> summary(fit)
Call:
pgls(formula = leaf.area ~ seed.mass, data = dat,
lambda = "ML")
Residuals:
Min 1Q Median 3Q Max
-0.176405 -0.046501 0.003632 0.047885 0.227434
Branch length transformations:
kappa [Fix] : 1.000
lambda [ ML] : 0.863
lower bound : 0.000, p = < 2.22e-16
upper bound : 1.000, p = < 2.22e-16
95.0% CI : (0.771, 0.919)
delta [Fix] : 1.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.730721 0.318898 8.563 4.441e-16 ***
seed.mass 0.442324 0.042318 10.452 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07308 on 373 degrees of freedom
Multiple R-squared: 0.2265, Adjusted R-squared: 0.2245
F-statistic: 109.2 on 1 and 373 DF, p-value: < 2.2e-16
But what if you want to analyze
factors?
#CAPER has some bugs…
> cdat <- comparative.data(data = dat, phy = tree, names.col = ”underscore.name”,scope =
leaf area ~nitro.class, vcv=TRUE)
> fit= pgls(leaf area ~nitro.class, cdat, lambda='ML')
> anova(fit)
Error in terms.formula(formula,data=data):
invalid model formula in ExtractVars
#solve it like (below)
> fit= pgls(leaf.area~nitro.class, cdat, lambda='ML')
> fit1= pgls(leaf.area~1, cdat, lambda='ML')
> anova(fit, fit1)
Error in anova.pglslist(object, ...) :
models were fitted with different branch length transformations.
##If you click on summary, you’ll see
Call:
pgls(formula = leaf.area~ nitro.class, data = dat,
lambda = "ML")
Residuals:
Min 1Q Median 3Q Max
-0.200441 -0.049287 -0.002017 0.051002 0.200019
Branch length transformations:
kappa [Fix] : 1.000
lambda [ ML] : 0.867
lower bound : 0.000, p = < 2.22e-16
upper bound : 1.000, p = < 2.22e-16
95.0% CI : (0.772, 0.926)
delta [Fix] : 1.000
Call:
pgls(formula = leaf.area ~ 1, data = dat, lambda = "ML")
Residuals:
Min 1Q Median 3Q Max
-0.274585 -0.061252 0.003683 0.053449 0.253368
Branch length transformations:
kappa [Fix] : 1.000
lambda [ ML] : 0.896
lower bound : 0.000, p = < 2.22e-16
upper bound : 1.000, p = < 2.22e-16
95.0% CI : (0.825, 0.940)
delta [Fix] : 1.000
#giving both models the same value
> fit= pgls(leaf.area~nitro.class, cdat, lambda=0.885)
> fit1= pgls(leaf.area~1, cdat, lambda=0.885)
> anova(fit, fit1)
 You can also use gls(), instead of lambda do method=‘ML’
 Visualize your tree, always exciting!
 In R
> plot(tree)
> help(plot.phylo) #install ape
#and: http://www.r-phylo.org/wiki/Main_Page
 Use FigTree (drop the file and it will do the phylogeny for you)
 phytools
>plot.phylo(tree, cex=0.3)
>plot.pylo(tree, cex=0.45, type="cladogram", show.Ep.label=FALSE
>plot.phylo(tree, cex=0.45, type="fan", edge.color=c("red", "orange", "green","blue"), edge.lty=5)
Thanks
Bianca A. Santini
@myoldowlisdead
b.santini@sheffield.ac.uk
From Freckleton et al. 2002.

Phylogeny in R - Bianca Santini Sheffield R Users March 2015

  • 1.
    Comparative analysis including phylogenyin R AE Zanne et al. Nature 000, 1-4 (2013) doi:10.1038/nature12872 Time-calibrated maximum-likelihood estimate of the molecular phylogeny for 31,749 species of seed plants. Bianca A. Santini @myoldowlisdead b.santini@sheffield.ac.uk
  • 2.
    What is aphylogeny? • Hypothesis that explains the evolutionary relationship among taxa* *taxa: species, or higher taxonomic levels. Also genes, or sequences node terminals/tips/leaves root A B C D Internal branch branch External branch ModifiedfromNatureScitable: http://www.nature.com/scitable/topicpage/reading-a-phylogenetic- tree-the-meaning-of-41956#
  • 3.
    Why use phylogenies… …incomparative analysis? • Comparative analyses are used to assess the ecological significance of a particular trait • However, because there is a shared history… a) they are not statistically independent data b) the feature under study might exist because of shared ancestry
  • 4.
    One example ofdata analyzed without a phylogeny • Salisbury’s data (1927, yes 88yrs ago!) • Observations – Differences in stomata density (SD) between sun(>) and shade (<) leaves • Measured stomatal density and related them to life-form, habitat type • Conclusion: SD increases with exposure SD Trees Shrubs Herbs Woody Herbs plants SD Marginal herbs Understory herbs Photo by A. Vazquez-Lobo
  • 5.
    Re-analysis of Salisbury’sdata • Independent contrasts (Felsenstein, 1985) – Introduces a phylogeny – The trait changes along the branches of the tree, should be associated to changes in the explanatory variable – no. of times the traits changes in concert with the environmental variable (agreements) vs. no. of times they do not (disagreements) – Sign test SD Trees Shrubs Herbs Woody Herbs plants SD Marginal herbs Understory herbs Kelly and Beerling, 1995 (68 yrs after)
  • 6.
    How to getyour own phylogeny ? phylomatic (uses a megatree) http://phylodiversity.net/phylomatic/ Use and trim and already published phylogeny: - Dryad: http://datadryad.org/ - Ecological Archives (from the esa) phyloGenerator (uses gen bank sequences) http://willpearse.github.io/phyloGenerator/ This is if you don’t have the sequences, or are not planning to get them.
  • 7.
    Package CAPER :pgls() Similar approach as in Independent Contrasts, but uses a matrix of variances and covariances (tree) N.B. If interested in phylogenies and evolution analyses: geiger, adephylo, picante, phylolm, ape…
  • 8.
    pgls: phylogenetic generalizedleast squares what do you need? 1. Phylogeny 2. Data – Make sure the rows in your data frame are the same as the tips of your tree i.e. your data: Juncus bufonius tree: Juncus_bufonius > my.data$underscore.name=gsub(" ","_",my.data$underscore.name) – Make sure you have one observation per species per trait: 3. Put them into a comparative.data() Species names Leaf area Seed mass Juncus_bufonius 120.2 0.24 Setaria_pumila 91.2 6.91
  • 9.
    #1)PHYLOGENY > tree<-read.tree ("Vascular_Plants_rooted.dated.tre”)##or read.nexus() > tree <- congeneric.merge(tree,my.data$underscore.name) ##pez package Number of species in tree before: 401 Number of species in tree now: 550 > tree Phylogenetic tree with 550 tips and 393 internal nodes. Tip labels: Gladiolus_italicus, Juncus_squarrosus, Juncus_bufonius, Bolboschoenus_maritimus, Isolepis_setacea, Cyperus_fuscus, ... Node labels: , , , , , , … Rooted; includes branch lengths. pgls() #You can always check for synonyms and replace (taxize) > my.data$underscore.name<-recode(my.data$underscore.name, "'Aegilops_geniculata' = 'Aegilops_ovata'") > plot.phylo(tree, cex=0.45, type="radial", edge.color=c("red", "orange", "blue"))
  • 10.
    pgls() ##trim your tree >tree <- drop.tip(tree, setdiff(tree$tip.label, my.data$underscore.name)) #2) YOUR DATA > dat<-data.frame(read.csv(“mydata.csv",header=T)) #3)PUT THEM together in comparative data, which will drop rows with NAs for you and match the rows to the tips of the phylogeny :D > cdat <- comparative.data(data = dat, phy = tree, names.col = ”underscore.name”, scope=leaf.area~seed.mass, vcv=TRUE) #na.omit=FALSE #warn.dropped=TRUE > cdat$dropped #to see what has been dropped.
  • 11.
    #to get thephylogenetic signal : lambda=‘ML’ #0 is a star phylogeny (no phylo signal), and 1 is an structured phylogeny, or all is explained by the phylogeny. > fit= pgls(leaf.area~seed.mass, cdat, lambda='ML') > summary(fit) pgls() > summary(fit) Call: pgls(formula = leaf.area ~ seed.mass, data = dat, lambda = "ML") Residuals: Min 1Q Median 3Q Max -0.176405 -0.046501 0.003632 0.047885 0.227434 Branch length transformations: kappa [Fix] : 1.000 lambda [ ML] : 0.863 lower bound : 0.000, p = < 2.22e-16 upper bound : 1.000, p = < 2.22e-16 95.0% CI : (0.771, 0.919) delta [Fix] : 1.000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.730721 0.318898 8.563 4.441e-16 *** seed.mass 0.442324 0.042318 10.452 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.07308 on 373 degrees of freedom Multiple R-squared: 0.2265, Adjusted R-squared: 0.2245 F-statistic: 109.2 on 1 and 373 DF, p-value: < 2.2e-16
  • 12.
    But what ifyou want to analyze factors? #CAPER has some bugs… > cdat <- comparative.data(data = dat, phy = tree, names.col = ”underscore.name”,scope = leaf area ~nitro.class, vcv=TRUE) > fit= pgls(leaf area ~nitro.class, cdat, lambda='ML') > anova(fit) Error in terms.formula(formula,data=data): invalid model formula in ExtractVars
  • 13.
    #solve it like(below) > fit= pgls(leaf.area~nitro.class, cdat, lambda='ML') > fit1= pgls(leaf.area~1, cdat, lambda='ML') > anova(fit, fit1) Error in anova.pglslist(object, ...) : models were fitted with different branch length transformations. ##If you click on summary, you’ll see Call: pgls(formula = leaf.area~ nitro.class, data = dat, lambda = "ML") Residuals: Min 1Q Median 3Q Max -0.200441 -0.049287 -0.002017 0.051002 0.200019 Branch length transformations: kappa [Fix] : 1.000 lambda [ ML] : 0.867 lower bound : 0.000, p = < 2.22e-16 upper bound : 1.000, p = < 2.22e-16 95.0% CI : (0.772, 0.926) delta [Fix] : 1.000 Call: pgls(formula = leaf.area ~ 1, data = dat, lambda = "ML") Residuals: Min 1Q Median 3Q Max -0.274585 -0.061252 0.003683 0.053449 0.253368 Branch length transformations: kappa [Fix] : 1.000 lambda [ ML] : 0.896 lower bound : 0.000, p = < 2.22e-16 upper bound : 1.000, p = < 2.22e-16 95.0% CI : (0.825, 0.940) delta [Fix] : 1.000
  • 14.
    #giving both modelsthe same value > fit= pgls(leaf.area~nitro.class, cdat, lambda=0.885) > fit1= pgls(leaf.area~1, cdat, lambda=0.885) > anova(fit, fit1)
  • 15.
     You canalso use gls(), instead of lambda do method=‘ML’  Visualize your tree, always exciting!  In R > plot(tree) > help(plot.phylo) #install ape #and: http://www.r-phylo.org/wiki/Main_Page  Use FigTree (drop the file and it will do the phylogeny for you)  phytools
  • 16.
  • 17.
  • 18.
    >plot.phylo(tree, cex=0.45, type="fan",edge.color=c("red", "orange", "green","blue"), edge.lty=5)
  • 19.
  • 20.

Editor's Notes

  • #3 This are different trees of representing the same relationships Tips are our data, and Nodes represent common ancestors and speciation events External branches connect a tip an a node Internal branches (or internodes) connect two nodes Internal nodes , they represent putative ancestors for the sample viruses.