open-source GLMM tools

Precursors GLMMs Results Conclusions References

Open-source tools for estimation and inference
using generalized linear mixed models

Ben Bolker

McMaster University
Departments of Mathematics & Statistics and Biology

7 April 2011

Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology
Open-source GLMMs


Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
Inference: conﬁdence intervals
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Examples

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Examples

Coral protection by symbionts

Number of predation events
10

8 2
Number of blocks

2
2
6 2
1
1
4
0
2 0 0
1
0
none shrimp crabs both

Symbionts

Open-source GLMMs


Examples

Environmental stress: Glycera cell survival
0 0.03 0.1 0.32 0 0.03 0.1 0.32

Anoxia Anoxia Anoxia Anoxia Anoxia
Osm=12.8 Osm=22.4 Osm=32 Osm=41.6 Osm=51.2 1.0

133.3

66.6 0.8

33.3

0.6
0
Copper

Normoxia Normoxia Normoxia Normoxia Normoxia
Osm=12.8 Osm=22.4 Osm=32 Osm=41.6 Osm=51.2
0.4

133.3

66.6
0.2

33.3

0 0.0

0 0.03 0.1 0.32 0 0.03 0.1 0.32 0 0.03 0.1 0.32

H2S

Open-source GLMMs


Examples

Arabidopsis response to fertilization & clipping
panel: nutrient, color: genotype

nutrient : 1 nutrient : 8
q
q
q q
q
q q
q q
5 q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q q
q q
q
q
q q
q q
q q
q q
q q
q q
q
q
q q q q
q
q q
q q
q
q
Log(1+fruit set)

q q q q
q
4 q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q q
q
q q
q q
q q q
q q
q
q q q
q
q q q
q q
q
q
q q q q
q
q q q
q q
q
q q
q q
3 q
q
q
q
q q
q
q
q
q
q
q

q
q q
q
q
q q
q q q
q q q
q
q
q
q q
q q q
q q
q
q
q q
q q
q q q
q
q q q q
q q q
q q
q
2 q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q q
q q
q
q q
q q q
q q q
q
q
q q
q q
q
1 q q

q
q q q q
q q

0 q
q
q
q
q
q
q
q
q
q
q
q

unclipped clipped unclipped clipped

Open-source GLMMs


Examples

Glossary: data

Fixed effects Predictors where interest is in specific levels
Random effects (RE) predictors where interest is in distribution
rather than levels (blocks) (Gelman, 2005)
Crossed RE multiple REs where levels of one occur in more than
one level of another (ex.: block × year: cf. nested)
http://lme4.r-forge.r-project.org/book/,
Pinheiro and Bates (2000)

Open-source GLMMs


Examples

Data challenges

Estimation Computation Inference
Small # RE levels (<5–6) Large n Small N (< 40)
Overdispersion Multiple REs Small n
Crossed REs Crossed REs
Spatial/temporal
correlation
Unusual distributions
(Gamma, neg. binom . . . )

Open-source GLMMs


Deﬁnitions

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Deﬁnitions

Generalized linear models

Distributions from exponential family
(Poisson, binomial, Gaussian, Gamma,
neg. binomial (known k) . . . )
Means = linear functions of predictors
on scale of link function (identity, log, logit, . . . )

Y ∼ D(g −1 (Xβ), φ)
φ often set to 1 (Poisson, binomial) except for
quasilikelihood approaches

Open-source GLMMs


Deﬁnitions

Generalized linear mixed models

Add random eﬀects:
Y ∼ D(g −1 (Xβ + Zu), φ)
u ∼ MVN(0, Σ)

Synonyms: multilevel, hierarchical models

Open-source GLMMs


Deﬁnitions

Marginal likelihood

Likelihood (Prob(data|parameters)) — requires integrating over
possible values of REs to get marginal likelihood e.g.:
likelihood of i th obs. in block j is L(xij |θi , σw )
2

2
likelihood of a particular block mean θj is L(θj |0, σb )
marginal likelihood is 2 2
L(xij |θj , σw )L(θj |0, σb ) dθj
Balance (dispersion of RE around 0) with (dispersion of data
conditional on RE)

Open-source GLMMs


Deﬁnitions

Shrinkage

Arabidopsis block estimates
5
11 2 5
7 9 4 9 q
3 6 10 5 q q q
4 2 q q q q
6 q q q
3
9 9 4 q q
q q q
Mean(log) fruit set

4 q q
10
8 q
q
2 q
0 q
3 10
q q
q

−3

−15 q q

0 5 10 15 20 25

Genotype

Open-source GLMMs


Definitions

RE examples

Coral symbionts: simple experimental blocks, RE affects
intercept (overall probability of predation in block)
Glycera: applied to cells from 10 individuals, RE again affects
intercept (cell survival prob.)
Arabidopsis: region (3 levels, treated as fixed) / population /
genotype: affects intercept (overall fruit set) as well as
treatment effects (nutrients, herbivory, interaction)

Open-source GLMMs


Estimation

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Estimation

Penalized quasi-likelihood (PQL)

alternate steps of estimating GLM using known RE variances
to calculate weights; estimate LMMs given GLM ﬁt (Breslow,
2004)
ﬂexible (allows spatial/temporal correlations, crossed REs)
biased for small unit samples (e.g. counts < 5, binary or
low-survival data)
widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈
90% of small-unit-sample cases

Open-source GLMMs


Estimation

Laplace approximation

approximate marginal likelihood
for given β, θ (RE parameters), ﬁnd conditional modes by
penalized, iterated reweighted least squares; then use
second-order Taylor expansion around the conditional modes
more accurate than PQL
reasonably fast and ﬂexible
lme4:glmer, glmmML, glmmADMB, R2ADMB (AD Model Builder)

Open-source GLMMs


Estimation

Gauss-Hermite quadrature (AGQ)

as above, but compute additional terms in the integral
(typically 8, but often up to 20)
most accurate
slowest, hence not ﬂexible (2–3 RE at most, maybe only 1)
lme4:glmer, glmmML, repeated

Open-source GLMMs


Estimation

Bayesian approaches

Bayesians have to do nasty integrals anyway (to normalize the
posterior probability density)
various flavours of stochastic Bayesian computation (Gibbs
sampling, MCMC, etc.)
generally slower but more flexible
solves many problems of assessing confidence intervals
must specify priors, assess convergence
specialized: glmmAK, MCMCglmm (Hadfield, 2010), INLA
general: glmmBUGS, R2WinBUGS, BRugs
(WinBUGS/OpenBUGS), R2jags, rjags (JAGS)

Open-source GLMMs


Estimation

Overdispersion (slight tangent)

Variance greater than expected from statistical model
Quasi-likelihood approaches: MASS:glmmPQL
Extended distributions (e.g. negative binomial): glmmADMB
Observation-level random eﬀects (e.g. lognormal-Poisson):
lme4

Open-source GLMMs


Estimation

Comparison of coral symbiont results

Regression estimates
−6 −4 −2 0 2

q
q
q
q
q
q
Added symbiont q

q
q
q
q
q
q
Crab vs. Shrimp q

q
q q GLM (fixed)
q
q
q GLM (pooled)
q q PQL
q q Laplace
Symbiont q
q AGQ

Open-source GLMMs


Inference: tests

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Inference: tests

Wald tests [non-quadratic likelihood surfaces]

For OLS/linear models, likelihood surface is quadratic; only
asymptotically true for GLM(M)s
Wald tests (e.g. typical results of summary) assume
quadratic, based on curvature (information matrix)
always approximate, sometimes awful (Hauck-Donner eﬀect)
do model comparison (F , score or likelihood ratio tests [LRT])
instead
But . . .

Open-source GLMMs


Inference: tests

Conditional F tests [Uncertainty in scale parameters]

Model comparison: in general
−2 log L = D = deviancei /φ
Classical linear models: ˆ
deviance and φ are both χ2
distributed so D ∼ F (ν1 , ν2 )
Denominator degrees of freedom (df) (ν2 ) for complex
(unbalanced, crossed, R-side eﬀects) models?
Approximations: Satterthwaite, Kenward-Roger (Kenward
and Roger, 1997; Schaalje et al., 2002)
Is D really ∼ F in these situations?
Scale parameters usually not estimated in GLMMs (Gamma,
quasi-likelihood cases only).
But . . .
Open-source GLMMs


Inference: tests

Likelihood ratio tests [non-normality of likelihood]

What about cases where φ is speciﬁed (e.g. ≡ 1)?
in GLM(M) case, numerator is only asymptotically χ2 anyway
Bartlett corrections (Cordeiro et al., 1994; Cordeiro and
Ferrari, 1998), higher-order asymptotics: cond [neither
extended to GLMMs!]

Open-source GLMMs


Inference: tests

Tests of random eﬀects [boundary problems]

LRT depends on null hypothesis being within the parameter’s
feasible range (Goldman and Whelan, 2000; Molenberghs and
Verbeke, 2007)
violated e.g. by H0 : σ 2 = 0
In simple cases null distribution is a mixture of χ2
(e.g. 0.5χ2 + 0.5χ2 (emdbook:dchibarsq)
0 1
ignoring this leads to conservative tests (e.g. true p-value =
1
2 · nominal p-value)
simulation-based testing: RLRsim

Open-source GLMMs


Inference: tests

Information-theoretic approaches

Above issues apply, but less well understood (Greven, 2008;
Greven and Kneib, 2010)
AIC is asymptotic
“corrected” AIC (AICc ) (HURVICH and TSAI, 1989) derived
for linear models, widely used but not tested elsewhere
(Richards, 2005)
For comparing models with diﬀerent REs,
or for AICc , what is p?
AICcmodavg, MuMIn

Open-source GLMMs


Inference: tests

Parametric bootstrapping

fit null model to data
simulate “data” from null model
fit null and working model, compute likelihood difference
repeat to estimate null distribution
> pboot <- function(m0, m1) {
s <- simulate(m0)
L0 <- logLik(refit(m0, s))
L1 <- logLik(refit(m1, s))
2 * (L1 - L0)
}
> replicate(1000, pboot(fm2, fm1))

Open-source GLMMs


Inference: tests

Finite-sample problems

How far are we from “asymptopia”?
How much data
(number of samples, number of RE levels)?
How many parameters
(number of ﬁxed-eﬀect parameters, number of RE levels,
number of RE parameters)?
Hope (#data) − (#parameters) 1 but if not?

Open-source GLMMs


Inference: tests

Levels of focus

how many parameters does a RE take?
Somewhere between q and r (e.g., 1 and the number of levels
for a variance) . . . shrinkage
Conditional vs. marginal AIC
Similar issues with Deviance Information Criterion
(Spiegelhalter et al., 2002)

Open-source GLMMs



Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs



Wald tests

a sometimes-crude approximation
computationally easy, especially for many-parameter models
use Wald Z (assume “residual df” large)? Or t, guessing at
the residual df?
Available from most packages

Open-source GLMMs



Profile confidence intervals

Tedious to program
Computationally challenging
Inherits finite-size sample problems from LRT
lme4a (in development/soon!)

Open-source GLMMs



Bayesian posterior intervals

Marginal quantile or highest posterior density intervals
Computationally “free” with results of stochastic Bayesian
computation
Easily extended to conﬁdence intervals on predictions, etc..
Post hoc Markov chain Monte Carlo sampling available for
some packages (glmmADMB, R2ADMB, eventually lme4a)

Open-source GLMMs



Summary

Large data
computation can be limiting
asymptotics better
Small data
RE variances may be poorly estimated/ set to zero
(informative priors can help)
inference tricky

Open-source GLMMs


Glycera

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Glycera

qq qq
Osm:Cu:H2S:Anoxia q
q
q
Cu:H2S:Anoxia q q
q
qq
q
Osm:H2S:Anoxia q
q
q
qq
q
Osm:Cu:Anoxia q

q q qq
Osm:Cu:H2S q
qqq
qq
H2S:Anoxia
q
qq q
Cu:Anoxia q
q
q
Osm:Anoxia qq
q
q q q
Cu:H2S q
q
q
q
Osm:H2S qq
q
q q
q q
Osm:Cu q
q MCMCglmm
qqq
q
Anoxia q q glmer(OD:2)
q qq
H2S q
q q glmer(OD)
qq q
Cu q
q q glmmML
q
Osm
qq
qq
q glmer

−60 −40 −20 0 20 40 60

Effect on survival

Open-source GLMMs


Glycera

Osm : Cu : H2S : Oxygen q

Osm : Cu : Oxygen q

Osm : H2S : Oxygen q

Cu : H2S : Oxygen q 3−way
Osm : Cu : H2S q

Osm : Cu q

H2S : Oxygen q

Osm : H2S q
2−way
Cu : Oxygen q

Osm : Oxygen q

Cu : H2S q

Oxygen q

Osm q
main effects
Cu q

H2S q

−20 −10 0 10 20 30
Effect on survival

Open-source GLMMs


Glycera

Parametric bootstrap results
0.02 0.04 0.06 0.08

H2S Anoxia

0.08

0.06

0.04
Inferred p value

0.02

Osm Cu

0.08

0.06

0.04

0.02

0.02 0.04 0.06 0.08

True p value

Open-source GLMMs


Arabidopsis

Outline
1 Precursors
Examples
Deﬁnitions
2 GLMMs
Estimation
Inference: tests
3 Results
Glycera
Arabidopsis
4 Conclusions

Open-source GLMMs


Arabidopsis

Arabidopsis: AIC comparison of REs

nointeract q

int(popu) q

int(gen) X int(popu) q

int(gen) X nut(popu) q

int(gen) X clip(popu) q

nut(gen) X int(popu) q

nut(gen) X nut(popu) q

nut(gen) X clip(popu) q

clip(gen) X int(popu) q

clip(gen) X nut(popu) q

clip(gen) X clip(popu) q

0 2 4 6
∆AIC

Open-source GLMMs


Arabidopsis

Arabidopsis: ﬁts with and without nutrient(genotype)

Regression estimates
−1.0 −0.5 0.0 0.5 1.0 1.5

q
nutrient8:amdclipped q

q
statusTransplant q

q
statusPetri.Plate q

q
rack2 q

q
amdclipped q

q
nutrient8 q

Open-source GLMMs


Primary tools

lme4: multiple/crossed REs, (profiling): fast
MCMCglmm: Bayesian, very flexible
glmmADMB: negative binomial, zero-inflated etc.
Most flexible: R2ADMB/AD Model Builder,
R2WinBUGS/WinBUGS/R2jags/JAGS, INLA

Open-source GLMMs


Loose ends

Overdispersion and zero-inﬂation: MCMCglmm, glmmADMB
Spatial and temporal correlation (R-side eﬀects):
MASS:glmmPQL (sort of), GLMMarp, INLA;
WinBUGS, AD Model Builder
Additive models: amer, gamm4, mgcv
Penalized methods (Jiang, 2008) (?)
Hierarchical GLMs: hglm, HGLMMM
Marginal models: geepack, gee

Open-source GLMMs


To be done

Many holes in knowledge (but what can be done?)
Faster algorithms, more parallel computation
Lots of implementation and clean-up
Beneﬁts & costs of staying within the GLMM framework
Beneﬁts & costs of diversity
More info: glmm.wikidot.com

Open-source GLMMs


Acknowledgements

Data: Josh Banta and Massimo Pigliucci (Arabidopsis);
Adrian Stier and Sea McKeon (coral symbionts); Courtney
Kagan, Jocelynn Ortega, David Julian (Glycera);
Co-authors: Mollie Brooks, Connie Clark, Shane Geange, John
Poulsen, Hank Stevens, Jada White

Open-source GLMMs


References
Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium in
biostatistics: Analysis of correlated data, pages 1–22. Springer. ISBN 0387208623.
Cordeiro, G.M. and Ferrari, S.L.P., 1998. Journal of Statistical Planning and Inference, 71(1-2):261–269. ISSN
0378-3758. doi:10.1016/S0378-3758(98)00005-6.
Cordeiro, G.M., Paula, G.A., and Botter, D.A., 1994. International Statistical Review / Revue Internationale de
Statistique, 62(2):257–274. ISSN 03067734. doi:10.2307/1403512.
Gelman, A., 2005. Annals of Statistics, 33(1):1–53. doi:doi:10.1214/009053604000001048.
Goldman, N. and Whelan, S., 2000. Molecular Biology and Evolution, 17(6):975–978.
Greven, S., 2008. Non-Standard Problems in Inference for Additive and Linear Mixed Models. Cuvillier Verlag,
G¨ttingen, Germany. ISBN 3867274916.
o
Greven, S. and Kneib, T., 2010. Biometrika, 97(4):773–789.
Hadﬁeld, J.D., 2010. Journal of Statistical Software, 33(2):1–22. ISSN 1548-7660.
HURVICH, C.M. and TSAI, C., 1989. Biometrika, 76(2):297 –307. doi:10.1093/biomet/76.2.297.
Jiang, J., 2008. The Annals of Statistics, 36(4):1669–1692. ISSN 0090-5364. doi:10.1214/07-AOS517.
Kenward, M.G. and Roger, J.H., 1997. Biometrics, 53(3):983–997.
Molenberghs, G. and Verbeke, G., 2007. The American Statistician, 61(1):22–27.
doi:10.1198/000313007X171322.
Pinheiro, J.C. and Bates, D.M., 2000. Mixed-eﬀects models in S and S-PLUS. Springer, New York. ISBN
0-387-98957-9.
Richards, S.A., 2005. Ecology, 86(10):2805–2814. doi:10.1890/05-0074.
Schaalje, G., McBride, J., and Fellingham, G., 2002. Journal of Agricultural, Biological & Environmental Statistics,
7(14):512–524.
Spiegelhalter, D.J., Best, N., et al., 2002. Journal of the Royal Statistical Society B, 64:583–640.
Open-source GLMMs

open-source GLMM tools

Recommended

Recommended

More Related Content

More from Ben Bolker

More from Ben Bolker (20)

Recently uploaded

Recently uploaded (20)

open-source GLMM tools