1. Precursors GLMMs Results Conclusions References
Open-source tools for estimation and inference
using generalized linear mixed models
Ben Bolker
McMaster University
Departments of Mathematics & Statistics and Biology
7 April 2011
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
2. Precursors GLMMs Results Conclusions References
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
3. Precursors GLMMs Results Conclusions References
Examples
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
4. Precursors GLMMs Results Conclusions References
Examples
Coral protection by symbionts
Number of predation events
10
8 2
Number of blocks
2
2
6 2
1
1
4
0
2 0 0
1
0
none shrimp crabs both
Symbionts
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
7. Precursors GLMMs Results Conclusions References
Examples
Glossary: data
Fixed effects Predictors where interest is in specific levels
Random effects (RE) predictors where interest is in distribution
rather than levels (blocks) (Gelman, 2005)
Crossed RE multiple REs where levels of one occur in more than
one level of another (ex.: block × year: cf. nested)
http://lme4.r-forge.r-project.org/book/,
Pinheiro and Bates (2000)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
8. Precursors GLMMs Results Conclusions References
Examples
Data challenges
Estimation Computation Inference
Small # RE levels (<5–6) Large n Small N (< 40)
Overdispersion Multiple REs Small n
Crossed REs Crossed REs
Spatial/temporal
correlation
Unusual distributions
(Gamma, neg. binom . . . )
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
9. Precursors GLMMs Results Conclusions References
Definitions
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
10. Precursors GLMMs Results Conclusions References
Definitions
Generalized linear models
Distributions from exponential family
(Poisson, binomial, Gaussian, Gamma,
neg. binomial (known k) . . . )
Means = linear functions of predictors
on scale of link function (identity, log, logit, . . . )
Y ∼ D(g −1 (Xβ), φ)
φ often set to 1 (Poisson, binomial) except for
quasilikelihood approaches
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
11. Precursors GLMMs Results Conclusions References
Definitions
Generalized linear mixed models
Add random effects:
Y ∼ D(g −1 (Xβ + Zu), φ)
u ∼ MVN(0, Σ)
Synonyms: multilevel, hierarchical models
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
12. Precursors GLMMs Results Conclusions References
Definitions
Marginal likelihood
Likelihood (Prob(data|parameters)) — requires integrating over
possible values of REs to get marginal likelihood e.g.:
likelihood of i th obs. in block j is L(xij |θi , σw )
2
2
likelihood of a particular block mean θj is L(θj |0, σb )
marginal likelihood is 2 2
L(xij |θj , σw )L(θj |0, σb ) dθj
Balance (dispersion of RE around 0) with (dispersion of data
conditional on RE)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
13. Precursors GLMMs Results Conclusions References
Definitions
Marginal likelihood
Likelihood (Prob(data|parameters)) — requires integrating over
possible values of REs to get marginal likelihood e.g.:
likelihood of i th obs. in block j is L(xij |θi , σw )
2
2
likelihood of a particular block mean θj is L(θj |0, σb )
marginal likelihood is 2 2
L(xij |θj , σw )L(θj |0, σb ) dθj
Balance (dispersion of RE around 0) with (dispersion of data
conditional on RE)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
15. Precursors GLMMs Results Conclusions References
Definitions
RE examples
Coral symbionts: simple experimental blocks, RE affects
intercept (overall probability of predation in block)
Glycera: applied to cells from 10 individuals, RE again affects
intercept (cell survival prob.)
Arabidopsis: region (3 levels, treated as fixed) / population /
genotype: affects intercept (overall fruit set) as well as
treatment effects (nutrients, herbivory, interaction)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
16. Precursors GLMMs Results Conclusions References
Estimation
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
17. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit (Breslow,
2004)
flexible (allows spatial/temporal correlations, crossed REs)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈
90% of small-unit-sample cases
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
18. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit (Breslow,
2004)
flexible (allows spatial/temporal correlations, crossed REs)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈
90% of small-unit-sample cases
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
19. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit (Breslow,
2004)
flexible (allows spatial/temporal correlations, crossed REs)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈
90% of small-unit-sample cases
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
20. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit (Breslow,
2004)
flexible (allows spatial/temporal correlations, crossed REs)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈
90% of small-unit-sample cases
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
21. Precursors GLMMs Results Conclusions References
Estimation
Laplace approximation
approximate marginal likelihood
for given β, θ (RE parameters), find conditional modes by
penalized, iterated reweighted least squares; then use
second-order Taylor expansion around the conditional modes
more accurate than PQL
reasonably fast and flexible
lme4:glmer, glmmML, glmmADMB, R2ADMB (AD Model Builder)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
22. Precursors GLMMs Results Conclusions References
Estimation
Gauss-Hermite quadrature (AGQ)
as above, but compute additional terms in the integral
(typically 8, but often up to 20)
most accurate
slowest, hence not flexible (2–3 RE at most, maybe only 1)
lme4:glmer, glmmML, repeated
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
23. Precursors GLMMs Results Conclusions References
Estimation
Bayesian approaches
Bayesians have to do nasty integrals anyway (to normalize the
posterior probability density)
various flavours of stochastic Bayesian computation (Gibbs
sampling, MCMC, etc.)
generally slower but more flexible
solves many problems of assessing confidence intervals
must specify priors, assess convergence
specialized: glmmAK, MCMCglmm (Hadfield, 2010), INLA
general: glmmBUGS, R2WinBUGS, BRugs
(WinBUGS/OpenBUGS), R2jags, rjags (JAGS)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
24. Precursors GLMMs Results Conclusions References
Estimation
Overdispersion (slight tangent)
Variance greater than expected from statistical model
Quasi-likelihood approaches: MASS:glmmPQL
Extended distributions (e.g. negative binomial): glmmADMB
Observation-level random effects (e.g. lognormal-Poisson):
lme4
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
26. Precursors GLMMs Results Conclusions References
Inference: tests
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
27. Precursors GLMMs Results Conclusions References
Inference: tests
Wald tests [non-quadratic likelihood surfaces]
For OLS/linear models, likelihood surface is quadratic; only
asymptotically true for GLM(M)s
Wald tests (e.g. typical results of summary) assume
quadratic, based on curvature (information matrix)
always approximate, sometimes awful (Hauck-Donner effect)
do model comparison (F , score or likelihood ratio tests [LRT])
instead
But . . .
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
28. Precursors GLMMs Results Conclusions References
Inference: tests
Conditional F tests [Uncertainty in scale parameters]
Model comparison: in general
−2 log L = D = deviancei /φ
Classical linear models: ˆ
deviance and φ are both χ2
distributed so D ∼ F (ν1 , ν2 )
Denominator degrees of freedom (df) (ν2 ) for complex
(unbalanced, crossed, R-side effects) models?
Approximations: Satterthwaite, Kenward-Roger (Kenward
and Roger, 1997; Schaalje et al., 2002)
Is D really ∼ F in these situations?
Scale parameters usually not estimated in GLMMs (Gamma,
quasi-likelihood cases only).
But . . .
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
29. Precursors GLMMs Results Conclusions References
Inference: tests
Likelihood ratio tests [non-normality of likelihood]
What about cases where φ is specified (e.g. ≡ 1)?
in GLM(M) case, numerator is only asymptotically χ2 anyway
Bartlett corrections (Cordeiro et al., 1994; Cordeiro and
Ferrari, 1998), higher-order asymptotics: cond [neither
extended to GLMMs!]
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
30. Precursors GLMMs Results Conclusions References
Inference: tests
Tests of random effects [boundary problems]
LRT depends on null hypothesis being within the parameter’s
feasible range (Goldman and Whelan, 2000; Molenberghs and
Verbeke, 2007)
violated e.g. by H0 : σ 2 = 0
In simple cases null distribution is a mixture of χ2
(e.g. 0.5χ2 + 0.5χ2 (emdbook:dchibarsq)
0 1
ignoring this leads to conservative tests (e.g. true p-value =
1
2 · nominal p-value)
simulation-based testing: RLRsim
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
31. Precursors GLMMs Results Conclusions References
Inference: tests
Information-theoretic approaches
Above issues apply, but less well understood (Greven, 2008;
Greven and Kneib, 2010)
AIC is asymptotic
“corrected” AIC (AICc ) (HURVICH and TSAI, 1989) derived
for linear models, widely used but not tested elsewhere
(Richards, 2005)
For comparing models with different REs,
or for AICc , what is p?
AICcmodavg, MuMIn
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
32. Precursors GLMMs Results Conclusions References
Inference: tests
Parametric bootstrapping
fit null model to data
simulate “data” from null model
fit null and working model, compute likelihood difference
repeat to estimate null distribution
> pboot <- function(m0, m1) {
s <- simulate(m0)
L0 <- logLik(refit(m0, s))
L1 <- logLik(refit(m1, s))
2 * (L1 - L0)
}
> replicate(1000, pboot(fm2, fm1))
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
33. Precursors GLMMs Results Conclusions References
Inference: tests
Finite-sample problems
How far are we from “asymptopia”?
How much data
(number of samples, number of RE levels)?
How many parameters
(number of fixed-effect parameters, number of RE levels,
number of RE parameters)?
Hope (#data) − (#parameters) 1 but if not?
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
34. Precursors GLMMs Results Conclusions References
Inference: tests
Levels of focus
how many parameters does a RE take?
Somewhere between q and r (e.g., 1 and the number of levels
for a variance) . . . shrinkage
Conditional vs. marginal AIC
Similar issues with Deviance Information Criterion
(Spiegelhalter et al., 2002)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
36. Precursors GLMMs Results Conclusions References
Inference: confidence intervals
Wald tests
a sometimes-crude approximation
computationally easy, especially for many-parameter models
use Wald Z (assume “residual df” large)? Or t, guessing at
the residual df?
Available from most packages
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
37. Precursors GLMMs Results Conclusions References
Inference: confidence intervals
Profile confidence intervals
Tedious to program
Computationally challenging
Inherits finite-size sample problems from LRT
lme4a (in development/soon!)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
38. Precursors GLMMs Results Conclusions References
Inference: confidence intervals
Bayesian posterior intervals
Marginal quantile or highest posterior density intervals
Computationally “free” with results of stochastic Bayesian
computation
Easily extended to confidence intervals on predictions, etc..
Post hoc Markov chain Monte Carlo sampling available for
some packages (glmmADMB, R2ADMB, eventually lme4a)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
39. Precursors GLMMs Results Conclusions References
Inference: confidence intervals
Summary
Large data
computation can be limiting
asymptotics better
Small data
RE variances may be poorly estimated/ set to zero
(informative priors can help)
inference tricky
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
40. Precursors GLMMs Results Conclusions References
Glycera
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
42. Precursors GLMMs Results Conclusions References
Glycera
Osm : Cu : H2S : Oxygen q
Osm : Cu : Oxygen q
Osm : H2S : Oxygen q
Cu : H2S : Oxygen q 3−way
Osm : Cu : H2S q
Osm : Cu q
H2S : Oxygen q
Osm : H2S q
2−way
Cu : Oxygen q
Osm : Oxygen q
Cu : H2S q
Oxygen q
Osm q
main effects
Cu q
H2S q
−20 −10 0 10 20 30
Effect on survival
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
43. Precursors GLMMs Results Conclusions References
Glycera
Parametric bootstrap results
0.02 0.04 0.06 0.08
H2S Anoxia
0.08
0.06
0.04
Inferred p value
0.02
Osm Cu
0.08
0.06
0.04
0.02
0.02 0.04 0.06 0.08
True p value
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
44. Precursors GLMMs Results Conclusions References
Arabidopsis
Outline
1 Precursors
Examples
Definitions
2 GLMMs
Estimation
Inference: tests
Inference: confidence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
45. Precursors GLMMs Results Conclusions References
Arabidopsis
Arabidopsis: AIC comparison of REs
nointeract q
int(popu) q
int(gen) X int(popu) q
int(gen) X nut(popu) q
int(gen) X clip(popu) q
nut(gen) X int(popu) q
nut(gen) X nut(popu) q
nut(gen) X clip(popu) q
clip(gen) X int(popu) q
clip(gen) X nut(popu) q
clip(gen) X clip(popu) q
0 2 4 6
∆AIC
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
46. Precursors GLMMs Results Conclusions References
Arabidopsis
Arabidopsis: fits with and without nutrient(genotype)
Regression estimates
−1.0 −0.5 0.0 0.5 1.0 1.5
q
nutrient8:amdclipped q
q
statusTransplant q
q
statusPetri.Plate q
q
rack2 q
q
amdclipped q
q
nutrient8 q
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
47. Precursors GLMMs Results Conclusions References
Primary tools
lme4: multiple/crossed REs, (profiling): fast
MCMCglmm: Bayesian, very flexible
glmmADMB: negative binomial, zero-inflated etc.
Most flexible: R2ADMB/AD Model Builder,
R2WinBUGS/WinBUGS/R2jags/JAGS, INLA
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
48. Precursors GLMMs Results Conclusions References
Loose ends
Overdispersion and zero-inflation: MCMCglmm, glmmADMB
Spatial and temporal correlation (R-side effects):
MASS:glmmPQL (sort of), GLMMarp, INLA;
WinBUGS, AD Model Builder
Additive models: amer, gamm4, mgcv
Penalized methods (Jiang, 2008) (?)
Hierarchical GLMs: hglm, HGLMMM
Marginal models: geepack, gee
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
49. Precursors GLMMs Results Conclusions References
To be done
Many holes in knowledge (but what can be done?)
Faster algorithms, more parallel computation
Lots of implementation and clean-up
Benefits & costs of staying within the GLMM framework
Benefits & costs of diversity
More info: glmm.wikidot.com
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
50. Precursors GLMMs Results Conclusions References
Acknowledgements
Data: Josh Banta and Massimo Pigliucci (Arabidopsis);
Adrian Stier and Sea McKeon (coral symbionts); Courtney
Kagan, Jocelynn Ortega, David Julian (Glycera);
Co-authors: Mollie Brooks, Connie Clark, Shane Geange, John
Poulsen, Hank Stevens, Jada White
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
51. Precursors GLMMs Results Conclusions References
References
Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in
biostatistics: Analysis of correlated data, pages 1–22. Springer. ISBN 0387208623.
Cordeiro, G.M. and Ferrari, S.L.P., 1998. Journal of Statistical Planning and Inference, 71(1-2):261–269. ISSN
0378-3758. doi:10.1016/S0378-3758(98)00005-6.
Cordeiro, G.M., Paula, G.A., and Botter, D.A., 1994. International Statistical Review / Revue Internationale de
Statistique, 62(2):257–274. ISSN 03067734. doi:10.2307/1403512.
Gelman, A., 2005. Annals of Statistics, 33(1):1–53. doi:doi:10.1214/009053604000001048.
Goldman, N. and Whelan, S., 2000. Molecular Biology and Evolution, 17(6):975–978.
Greven, S., 2008. Non-Standard Problems in Inference for Additive and Linear Mixed Models. Cuvillier Verlag,
G¨ttingen, Germany. ISBN 3867274916.
o
Greven, S. and Kneib, T., 2010. Biometrika, 97(4):773–789.
Hadfield, J.D., 2010. Journal of Statistical Software, 33(2):1–22. ISSN 1548-7660.
HURVICH, C.M. and TSAI, C., 1989. Biometrika, 76(2):297 –307. doi:10.1093/biomet/76.2.297.
Jiang, J., 2008. The Annals of Statistics, 36(4):1669–1692. ISSN 0090-5364. doi:10.1214/07-AOS517.
Kenward, M.G. and Roger, J.H., 1997. Biometrics, 53(3):983–997.
Molenberghs, G. and Verbeke, G., 2007. The American Statistician, 61(1):22–27.
doi:10.1198/000313007X171322.
Pinheiro, J.C. and Bates, D.M., 2000. Mixed-effects models in S and S-PLUS. Springer, New York. ISBN
0-387-98957-9.
Richards, S.A., 2005. Ecology, 86(10):2805–2814. doi:10.1890/05-0074.
Schaalje, G., McBride, J., and Fellingham, G., 2002. Journal of Agricultural, Biological & Environmental Statistics,
7(14):512–524.
Spiegelhalter, D.J., Best, N., et al., 2002. Journal of the Royal Statistical Society B, 64:583–640.
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs