Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Open source GLMM tools: Concordia
1. Precursors GLMMs Results Conclusions References
Open-source tools for estimation and inference
using generalized linear mixed models
Ben Bolker
McMaster University
Departments of Mathematics & Statistics and Biology
3 July 2011
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
2. Precursors GLMMs Results Conclusions References
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
3. Precursors GLMMs Results Conclusions References
Definitions
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
4. Precursors GLMMs Results Conclusions References
Definitions
Definitions
Fixed effects (FE) Predictors where interest is in specific levels
Random effects (RE) Predictors where interest is in distribution
rather than levels (blocks) 5
Mixed models Statistical models with both FEs and REs
Linear mixed models Linear effects, normal responses, normal REs
Generalized linear models Linearizable effects, exponential-family
responses, normal REs (on linearized scale)
Generalized linear mixed models GLMMs = LMMs + GLMs
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
5. Precursors GLMMs Results Conclusions References
Definitions
Definitions
Fixed effects (FE) Predictors where interest is in specific levels
Random effects (RE) Predictors where interest is in distribution
rather than levels (blocks) 5
Mixed models Statistical models with both FEs and REs
Linear mixed models Linear effects, normal responses, normal REs
Generalized linear models Linearizable effects, exponential-family
responses, normal REs (on linearized scale)
Generalized linear mixed models GLMMs = LMMs + GLMs
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
6. Precursors GLMMs Results Conclusions References
Definitions
GLMMs
Distributions from exponential family
(Poisson, binomial, Gaussian, Gamma, NegBinom(k), . . . )
Means = linear functions of predictors
on scale of link function (identity, log, logit, . . . )
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
7. Precursors GLMMs Results Conclusions References
Definitions
GLMMs (cont.)
Linear predictor:
η = Xβ + Zu
Random effects:
u ∼ MVN(0, Σ)
Response:
Y ∼ D g −1 η, φ (φ often ≡ 1)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
8. Precursors GLMMs Results Conclusions References
Definitions
Marginal likelihood
Likelihood (Prob(data|parameters)) — requires integrating over
possible values of REs to get marginal likelihood e.g.:
likelihood of i th obs. in block j is L(xij |θi , σw )
2
2
likelihood of a particular block mean θj is L(θj |0, σb )
marginal likelihood is 2 2
L(xij |θj , σw )L(θj |0, σb ) dθj
Balance (dispersion of RE around 0) with (dispersion of data
conditional on RE)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
9. Precursors GLMMs Results Conclusions References
Definitions
Marginal likelihood
Likelihood (Prob(data|parameters)) — requires integrating over
possible values of REs to get marginal likelihood e.g.:
likelihood of i th obs. in block j is L(xij |θi , σw )
2
2
likelihood of a particular block mean θj is L(θj |0, σb )
marginal likelihood is 2 2
L(xij |θj , σw )L(θj |0, σb ) dθj
Balance (dispersion of RE around 0) with (dispersion of data
conditional on RE)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
10. Precursors GLMMs Results Conclusions References
Definitions
Bayesian solution?
Bayesians should not feel smug: they are stuck with the
normalizing constant
Prior(β, θ, Σ)L(xij |β, θ)L(θ|Σ)
Posterior(β, θ, Σ) = (!!)
(. . .)dβ dθ dΣ
and similar issues with marginal posteriors
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
11. Precursors GLMMs Results Conclusions References
Examples
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
12. Precursors GLMMs Results Conclusions References
Examples
Coral protection by symbionts
Number of predation events
10
8 2
Number of blocks
2
2
6 2
1
1
4
0
2 0 0
1
0
none shrimp crabs both
Symbionts
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
15. Precursors GLMMs Results Conclusions References
Challenges
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
16. Precursors GLMMs Results Conclusions References
Challenges
Data challenges: estimation
Small # RE levels (<5–6) [modes at zero]
Crossed REs [unusual setup]
Spatial/temporal correlation structure (
“R-side” effects)
Overdispersion
Unusual distributions (Gamma, negative binomial . . . )
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
17. Precursors GLMMs Results Conclusions References
Challenges
Data challenges: computation
Large n (of course)
Multiple REs (dimensionality)
Crossed REs
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
18. Precursors GLMMs Results Conclusions References
Challenges
Inference
Any departures from classical LMMs
Small N (<40)
Small n
Inference on components of Σ (boundary effects, df)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
19. Precursors GLMMs Results Conclusions References
Challenges
RE examples
Coral symbionts: simple experimental blocks, RE affects
intercept (overall probability of predation in block)
Glycera: applied to cells from 10 individuals, RE again affects
intercept (cell survival prob.)
Arabidopsis: region (3 levels, treated as fixed) / population /
genotype: affects intercept (overall fruit set) as well as
treatment effects (nutrients, herbivory, interaction)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
20. Precursors GLMMs Results Conclusions References
Estimation
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
21. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit 2
flexible (e.g. spatial/temporal correlations)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL
marginal models: generalized estimating equations
(geepack, geese)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
22. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit 2
flexible (e.g. spatial/temporal correlations)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL
marginal models: generalized estimating equations
(geepack, geese)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
23. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit 2
flexible (e.g. spatial/temporal correlations)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL
marginal models: generalized estimating equations
(geepack, geese)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
24. Precursors GLMMs Results Conclusions References
Estimation
Penalized quasi-likelihood (PQL)
alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM fit 2
flexible (e.g. spatial/temporal correlations)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL
marginal models: generalized estimating equations
(geepack, geese)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
25. Precursors GLMMs Results Conclusions References
Estimation
Laplace approximation
approximate marginal likelihood
for given β, θ find conditional modes by penalized, iterated
reweighted least squares; then use second-order Taylor
expansion around the conditional modes
more accurate than PQL
reasonably fast and flexible
lme4:glmer, glmmML, glmmADMB, R2ADMB (AD Model Builder)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
26. Precursors GLMMs Results Conclusions References
Estimation
(adaptive) Gauss-Hermite quadrature (AGHQ)
as above, but compute additional terms in the integral
(typically 8, but often up to 20)
most accurate
slowest, hence not flexible (2–3 RE at most, maybe only 1)
lme4:glmer, glmmML, gamlss.mx:gamlssNP, repeated
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
27. Precursors GLMMs Results Conclusions References
Estimation
Variations
Hierarchical GLMS (hglm, HGLMMM)
Monte Carlo methods: MCEM 1 , MCMLE (bernor) 18 ,
sequential MC (pomp), data cloning (dclone)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
28. Precursors GLMMs Results Conclusions References
Estimation
Bayesian approaches
Monte Carlo approaches: MCMC (Gibbs sampling,
Metropolis-Hastings, etc.)
slow but flexible
makes marginal inference easy
must specify priors, assess convergence
specialized: glmmAK, MCMCglmm 9 , INLA
general: BUGS (glmmBUGS, R2WinBUGS, BRugs, WinBUGS,
OpenBUGS, R2jags, rjags, JAGS)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
29. Precursors GLMMs Results Conclusions References
Estimation
Extensions
Overdispersion Variance > expected from statistical model
Quasi-likelihood approaches: MASS:glmmPQL
Extended distributions (e.g. negative binomial):
glmmADMB, gamlss.mx:gamlssNP
Observation-level random effects (e.g.
lognormal-Poisson): lme4
Zero-inflation Overabundance of zeros in a discrete distribution
zero-inflated models: glmmADMB, MCMCglmm
hurdle models: MCMCglmm
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
30. Precursors GLMMs Results Conclusions References
Inference
Outline
1 Precursors
Definitions
Examples
Challenges
2 GLMMs
Estimation
Inference
3 Results
Coral symbionts
Glycera
Arabidopsis
4 Conclusions
Conclusions
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
31. Precursors GLMMs Results Conclusions References
Inference
Wald tests/CIs
Easy (e.g. typical results of summary): assume quadratic
surface, based on information matrix @ MLE
always approximate, sometimes awful (Hauck-Donner effect)
often bad for variance estimates
available from most direct-maximization packages
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
32. Precursors GLMMs Results Conclusions References
Inference
Likelihood ratio tests/profile confidence intervals
Model comparison is relatively easy
Profiling is expensive — and not (yet) available . . . (lme4a for
LMMs)
in GLM(M) case, numerator is only asymptotically χ2 anyway:
Bartlett corrections 3;4 , higher-order asymptotics: cond
[neither extended to GLMMs!]
OK if N − n, N 40?
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
33. Precursors GLMMs Results Conclusions References
Inference
Conditional F tests
What if scale parameter (φ) is estimated
(e.g. Gaussian, Gamma, quasi-likelihood) ?
In classical LMMs, −2 log L ∼ F (ν1 , ν2 )
For non-classical LMMs (unbalanced, crossed, R-side) or
GLMMs, ν2 poorly defined:
Kenward-Roger, Satterthwaite approximations 12;16
unimplemented except in SAS (partially in Genstat)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
34. Precursors GLMMs Results Conclusions References
Inference
Tests/CIs of variances [boundary problems]
LRT depends on null hypothesis being within the parameter’s
feasible range 6;13
violated e.g. by H0 : σ 2 = 0
In simple cases null distribution is a mixture of χ2
distributions (e.g. 0.5χ2 + 0.5χ2 : emdbook:dchibarsq)
0 1
simulation-based testing: RLRsim
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
35. Precursors GLMMs Results Conclusions References
Inference
Information-theoretic approaches
Above issues apply, but less well understood: 7;8
AIC is asymptotic
“corrected” AIC (AICc ) 10 derived for linear models, widely
used but not tested elsewhere 14
For comparing models with different REs,
or for AICc , what is p? conditional AIC: 8;19 (cAIC) (level of
focus issue: see also Deviance Information Criterion (DIC, 17 )
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
36. Precursors GLMMs Results Conclusions References
Inference
Bootstrapping
1 fit null model to data
2 simulate “data” from null model
3 fit null and working model, compute likelihood difference
4 repeat to estimate null distribution
confidence intervals?
simulate/refit methods; bootMer in lme4a (LMMs only!)
> pboot <- function(m0, m1) {
s <- simulate(m0)
2 * (logLik(refit(m1, s)) - logLik(refit(m0, s)))
}
> replicate(1000, pboot(fm2, fm1))
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
37. Precursors GLMMs Results Conclusions References
Inference
Bayesian inference
Marginal highest posterior density intervals (or quantiles)
Computationally “free” with results of stochastic Bayesian
computation
Easily extended to prediction intervals etc. etc.
Post hoc Markov chain Monte Carlo sampling available for
some packages (glmmADMB, R2ADMB, eventually lme4a)
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
38. Precursors GLMMs Results Conclusions References
Inference
Bottom line
Large data: computation slow (maximization methods
fastest), inference easy (asymptotics)
Bayesian computation slow, inference easy (posterior samples)
Small data: computation fast
RE variances may be poorly estimated/set to zero (upcoming:
penalty/prior term in blmer within arm)
inference tricky, may need bootstrapping
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
41. Precursors GLMMs Results Conclusions References
Glycera
Glycera: MCMCglmm fit
Osm : Cu : H2S : Oxygen q
Osm : Cu : Oxygen q
Osm : H2S : Oxygen q
Cu : H2S : Oxygen q 3−way
Osm : Cu : H2S q
Osm : Cu q
H2S : Oxygen q
Osm : H2S q
2−way
Cu : Oxygen q
Osm : Oxygen q
Cu : H2S q
Oxygen q
Osm q
main effects
Cu q
H2S q
−20 −10 0 10 20 30
Effect on survival
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
42. Precursors GLMMs Results Conclusions References
Glycera
Parametric bootstrap results
Osm Cu
0.5
0.1
0.05
0.01
0.005
Inferred p value
variable
0.001
normal
H2S Anoxia
t7
0.5 t14
0.1
0.05
0.01
0.005
0.001
0.001 0.0050.01 0.05 0.1 0.5 0.001 0.0050.01 0.05 0.1 0.5
True p value
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
43. Precursors GLMMs Results Conclusions References
Arabidopsis
Arabidopsis: AIC comparison of RE models
nointeract q
int(popu) q
int(gen) X int(popu) q
int(gen) X nut(popu) q
int(gen) X clip(popu) q
nut(gen) X int(popu) q
nut(gen) X nut(popu) q
nut(gen) X clip(popu) q
clip(gen) X int(popu) q
clip(gen) X nut(popu) q
clip(gen) X clip(popu) q
0 2 4 6
∆AIC
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
44. Precursors GLMMs Results Conclusions References
Arabidopsis
Arabidopsis: fits with and without nutrient(genotype)
Regression estimates
−1.0 −0.5 0.0 0.5 1.0 1.5
q
nutrient8:amdclipped q
q
statusTransplant q
q
statusPetri.Plate q
q
rack2 q
q
amdclipped q
q
nutrient8 q
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
45. Precursors GLMMs Results Conclusions References
Conclusions
Primary tools
lme4: multiple/crossed REs, (profiling): fast
MCMCglmm: Bayesian, very flexible
glmmADMB: negative binomial, zero-inflated etc.
Flexible tools:
AD Model Builder (and interfaces)
BUGS/JAGS (and interfaces)
INLA 15
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
46. Precursors GLMMs Results Conclusions References
Conclusions
Outlook
Computation: faster algorithms, parallel computation
Inference: mostly computational?
Implementation: extensions (e.g. L1-penalized approaches 11 ),
consistency (profile, simulate, predict)
Benefits & costs of staying within the GLMM framework
Benefits & costs of diversity
More info: http://glmm.wikidot.com
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
47. Precursors GLMMs Results Conclusions References
Conclusions
Acknowledgements
Data: Josh Banta and Massimo Pigliucci (Arabidopsis);
Adrian Stier and Sea McKeon (coral symbionts); Courtney
Kagan, Jocelynn Ortega, David Julian (Glycera);
Co-authors: Mollie Brooks, Connie Clark, Shane Geange, John
Poulsen, Hank Stevens, Jada White
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
48. Precursors GLMMs Results Conclusions References
[1] Booth JG & Hobert JP, 1999. Journal of the 3867274916. URL http://www.cuvillier.de/
Royal Statistical Society. Series B, 61(1):265–285. flycms/en/html/30/-UickI3zKPS,3cEY=
doi:10.1111/1467-9868.00176. URL http:// /Buchdetails.html?SID=wVZnpL8f0fbc.
links.jstor.org/sici?sici=1369-7412(1999) [8] Greven S & Kneib T, 2010. Biometrika,
61%3A1%3C265%3AMGLMML%3E2.0.CO%3B2-C. 97(4):773–789. URL http:
[2] Breslow NE, 2004. In DY Lin & PJ Heagerty, //www.bepress.com/jhubiostat/paper202/.
eds., Proceedings of the second Seattle [9] Hadfield JD, 2 2010. Journal of Statistical
symposium in biostatistics: Analysis of correlated Software, 33(2):1–22. ISSN 1548-7660. URL
data, pp. 1–22. Springer. ISBN 0387208623. http://www.jstatsoft.org/v33/i02.
[3] Cordeiro GM & Ferrari SLP, Aug. 1998. Journal [10] HURVICH CM & TSAI C, Jun. 1989. Biometrika,
of Statistical Planning and Inference, 76(2):297 –307.
71(1-2):261–269. ISSN 0378-3758. doi:10.1093/biomet/76.2.297. URL
doi:10.1016/S0378-3758(98)00005-6. URL http://biomet.oxfordjournals.org/content/
http://www.sciencedirect.com/science/ 76/2/297.abstract.
article/B6V0M-3V5CVRT-M/2/ [11] Jiang J, Aug. 2008. The Annals of Statistics,
190f68a684dd08c569a7836ff59568e4. 36(4):1669–1692. ISSN 0090-5364.
[4] Cordeiro GM, Paula GA, & Botter DA, 1994. doi:10.1214/07-AOS517. URL http:
International Statistical Review / Revue //projecteuclid.org/euclid.aos/1216237296.
Internationale de Statistique, 62(2):257–274. [12] Kenward MG & Roger JH, 1997. Biometrics,
ISSN 03067734. doi:10.2307/1403512. URL 53(3):983–997.
http://www.jstor.org/stable/1403512.
[13] Molenberghs G & Verbeke G, 2007. The
[5] Gelman A, 2005. Annals of Statistics, 33(1):1–53. American Statistician, 61(1):22–27.
doi:doi:10.1214/009053604000001048. doi:10.1198/000313007X171322.
[6] Goldman N & Whelan S, 2000. Molecular Biology [14] Richards SA, 2005. Ecology, 86(10):2805–2814.
and Evolution, 17(6):975–978. doi:10.1890/05-0074.
[7] Greven S, 2008. Non-Standard Problems in [15] Rue H, Martino S, & Chopin N, 2009. Journal of
Inference for Additive and Linear Mixed Models. the Royal Statistical Society, Series B,
Cuvillier Verlag, G¨ttingen, Germany. ISBN
o 71(2):319–392.
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
49. Precursors GLMMs Results Conclusions References
Conclusions
[16] Schaalje G, McBride J, & Fellingham G, 2002.
Journal of Agricultural, Biological &
Environmental Statistics, 7(14):512–524. URL
http://www.ingentaconnect.com/content/
asa/jabes/2002/00000007/00000004/art00004.
[17] Spiegelhalter DJ, Best N et al., 2002. Journal of
the Royal Statistical Society B, 64:583–640.
[18] Sung YJ, Jul. 2007. The Annals of Statistics,
35(3):990–1011. ISSN 0090-5364.
doi:10.1214/009053606000001389. URL
http:
//projecteuclid.org/euclid.aos/1185303995.
Mathematical Reviews number (MathSciNet):
MR2341695; Zentralblatt MATH identifier:
1124.62009.
[19] Vaida F & Blanchard S, Jun. 2005. Biometrika,
92(2):351–370.
doi:10.1093/biomet/92.2.351. URL
http://biomet.oxfordjournals.org/cgi/
content/abstract/92/2/351.
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs
50. Precursors GLMMs Results Conclusions References
Conclusions
Extras
Spatial and temporal correlation (R-side effects):
MASS:glmmPQL (sort of), GLMMarp, INLA;
WinBUGS, AD Model Builder
Additive models: amer, gamm4, mgcv
Penalized methods 11
Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs