Generalized Nonlinear Models in R

Generalized Nonlinear Models in R
Heather Turner1,2
, David Firth2
and Ioannis Kosmidis3
1 Independent consultant
2 University of Warwick, UK
3 UCL, UK
Turner, Firth & Kosmidis GNM in R ERCIM 2013 1 / 30

Generalized Linear Models
A GLM is made up of a linear predictor
η = β0 + β1x1 + ... + βpxp
and two functions
a link function that describes how the mean, E(Y ) = µ,
depends on the linear predictor
g(µ) = η
a variance function that describes how the variance, V ar(Y )
depends on the mean
V ar(Y ) = φV (µ)
where the dispersion parameter φ is a constant

Generalized Nonlinear Models
A generalized nonlinear model (GNM) is the same as a GLM
except that we have
g(µ) = η(x; β)
where η(x; β) is nonlinear in the parameters β.
Equivalently an extension of nonlinear least squares model, where the
variance of Y is allowed to depend on the mean.
Using a nonlinear predictor can produce a more parsimonious and
interpretable model.

Example: Mental Health Status
A study of 1660 children from Manhattan recorded their mental
impairment and parents’ socioeconomic status (Agresti, 2002)
MHS
SES
FEDCBA
well mild moderate impaired

Independence
A simple analysis of these data might be to test for independence of
MHS and SES using a chi-squared test.
This is equivalent to testing the goodness-of-ﬁt of the independence
model
log(µrc) = αr + βc
Such a test compares the independence model to the saturated model
log(µrc) = αr + βc + γrc
which may be over-complex.

Row-column Association
One intermediate model is the Row-Column association model:
log(µrc) = αr + βc + φrψc
(Goodman, 1979), an example of a multiplicative interaction model.
For the Mental Health data:
## Analysis of Deviance Table
##
## Model 1: Freq ~ SES + MHS
## Model 2: Freq ~ SES + MHS + Mult(SES , MHS)
## Model 3: Freq ~ SES + MHS + SES:MHS
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 15 47.4
## 2 8 3.6 7 43.8 2.3e-07
## 3 0 0.0 8 3.6 0.89

Parameterisation
The independence model was defined earlier in an over-parameterised
form:
log(µrc) = αr + βc
= (αr + 1) + (βc − 1)
= α∗
r + β∗
c
Identifiability constraints may be imposed
to fix a one-to-one mapping between parameter values and
distributions
to enable interpretation of parameters

Standard Implementation
The standard approach of all major statistical software packages is to
apply the identiﬁability constraints in the construction of the model
g(µ) = Xβ
so that rank(X) is equal to the number of parameters p.
Then the inverse in the score equations of the IWLS algorithm
β(r+1)
= XT
W (r)
X
−1
XT
W (r)
z(r)
exists.

Alternative Implementation
The gnm package for R works with over-parameterised models, where
rank(X) < p, and uses the generalised inverse in the IWLS updates:
β(r+1)
= XT
W (r)
X
−
XT
W (r)
z(r)
This approach is more useful for GNMs, where it is much harder to
define standard rules for specifying identifiability constraints.
Rather, identifiability constraints can be applied post-fitting for
inference and interpretation.

Estimation of GNMs
GNMs present further technical diﬃculties vs. GLMs
automatic generation of starting values is hard
the likelihood may have multiple optima
The default approach of the gnm function in package gnm is to:
generate starting values randomly for nonlinear parameters and
using a GLM ﬁt for linear parameters
use one-parameter-at-a-time Newton method to update
nonlinear parameters
use the generalized IWLS to update all parameters
Consequently, the parameterisation returned is random.

Parameterisation of RC Model
The RC model is invariant to changes in scale or location of the
interaction parameters:
log(µrc) = αr + βc + φrψc
= αr + βc + (2φr)(0.5ψc)
= αr + (βc − ψc) + (φr + 1)(ψc)
One way to constrain these parameters is as follows
φ∗
r =
φr − r wrφr
r wr
r wr φr − r wrφr
r wr
where wr is the row probability, say, so that
r
wrφ∗
r = 0
r
wr(φ∗
r)2
= 1

Row and Column Scores
These scores and their standard errors can be obtained via the
getContrasts function in the gnm package
## Estimate Std. Error
## Mult(., MHS).SESA 1.11 0.30
## Mult(., MHS).SESB 1.12 0.31
## Mult(., MHS).SESC 0.37 0.32
## Mult(., MHS).SESD -0.03 0.27
## Mult(., MHS).SESE -1.01 0.31
## Mult(., MHS).SESF -1.82 0.28
## Estimate Std. Error
## Mult(SES , .).MHSwell 1.68 0.19
## Mult(SES , .).MHSmild 0.14 0.20
## Mult(SES , .). MHSmoderate -0.14 0.28
## Mult(SES , .). MHSimpaired -1.41 0.17

Stereotype Model
The stereotype model (Anderson, 1984) is suitable for ordered
categorical data. It is a special case of the multinomial logistic model:
pr(yi = c|xi) =
exp(β0c + βT
c xi)
r exp(β0r + βT
r xi)
in which only the scale of the relationship with the covariates changes
between categories:
pr(yi = c|xi) =
exp(β0c + γcβT
xi)
r exp(β0r + γrβT
xi)

Poisson Trick
The stereotype model can be ﬁtted as a GNM by re-expressing the
categorical data as category counts Yi = (Yi1, . . . , Yik).
Assuming a Poisson distribution for Yic, the joint distribution of Yi is
Multinomial(Ni, pi1, . . . , pik) conditional on the total count Ni.
The expected counts are then µic = Nipic and the parameters of the
sterotype model can be estimated through ﬁtting
log µic = log(Ni) + log(pic)
= αi + β0c + γc
r
βrxir
where the “nuisance” parameters αi ensure that the multinomial
denominators are reproduced exactly, as required.

Augmented Least Squares
A disadvantage of using the Poisson trick is that the number of
nuisance parameters can be large, making computation slow.
The algorithm can be adapted using augmented least squares.
For an ordinary least squares model,
(y|X)T
(y|X)
−1
=
yT
y yT
X
XT
y XT
X
−1
=
A11 A12
A21 A22
where A11, A12 and A22 are functions of yT
y, XT
y and XT
X.
Then it can be shown that
ˆβ = (XT
X)−1
XT
y = −
A21
A11
requiring only the ﬁrst row (column) of the inverse to be found.

Application to Nuisance Parameters I
The same approach can be applied to the IWLS algorithm, letting
˜X = W
1
2 (z|X)
Now let
˜X = (U|V )
where V is the part of the design matrix corresponding to the
nuisance factor.
U is an nk × p matrix where n is the number of nuisance parameters
and k is the number of categories and p is the number of model
parameters, typically with n >> p.
V is an nk × n matrix of dummy variables identifying each individual.

Application to Nuisance Parameters II
Then
( ˜X
T
˜X)−
=
UT
U UT
V
V T
U V T
V
−
=
B11 B12
B21 B22
Again, only the ﬁrst row (column) of this generalised inverse is
required to estimate ˆβ, so we are only interested in B11 and B12.
B11 = (UT
U − UT
V (V T
V )−1
V T
U)−
B12 = −(V T
V )−1
V T
UB11

Elimination of the Nuisance Factor
UT
U is p × p, therefore not expensive to compute.
V T
V and V T
U can be computed without constructing the large
nk × n matrix V , due to the stucture of V
V T
V is diagonal and the non-zero elements can be computed
directly
V T
U is equivalent to aggregating the rows of U by levels of the
nuisance factor
Thus we only need to construct the U matrix, saving memory and
reducing the computational burden.
This approach is invoked using the eliminate argument to gnm.

Example: Back Pain Data
For 101 patients, 3 prognostic variables were recorded at baseline,
then after 3 weeks the level of back pain was recorded (Anderson,
1984)
These data can be converted to counts using the
expandCategorical function, giving for the ﬁrst record:
## x1 x2 x3 pain count id
## 1 1 1 1 worse 0 1
## 1.1 1 1 1 same 1 1
## 1.2 1 1 1 slight. improvement 0 1
## 1.3 1 1 1 moderate.improvement 0 1
## 1.4 1 1 1 marked. improvement 0 1
## 1.5 1 1 1 complete.relief 0 1

Back Pain Model
The expanded data set has only 606 records and the total number of
parameters is only 115 (9 nonlinear). So the model is quick to ﬁt:
system.time({
m <- gnm(count ~ id + pain + Mult(pain, x1 + x2 + x3),
family = poisson, data = backPainLong, verbose = FALSE)
})[3]
## elapsed
## 0.268
However, eliminating the linear parameters reduces the run time by
more than two thirds, showing the potential of this technique.
system.time(m2 <- update(m, eliminate = id))[3]
## elapsed
## 0.088

Rasch Models
Rasch models are used in Item Response Theory to model the binary
responses of subjects over a set of items.
The simplest one parameter logistic (1PL) model has the form
log
πis
1 − πis
= αi + γs
The one-dimensional Rasch model extends the 1PL as follows:
log
πis
1 − πis
= αi + βiγs
where βi measures the discrimination of item i: the larger βi the
steeper the item-response function that maps γs to πis.

Example: US House of Representatives
Votes on 20 roll calls selected by Americans for Democratic Action (ADA)
BankruptcyOverhaul.Yes
ErgonomicsRuleDisapproval.No
IncomeTaxReduction.No
MarriageTaxReduction.Yes
EstateTaxRelief.Yes
FetalProtection.No
SchoolVouchers.No
TaxCutReconciliationBill.No
CampaignFinanceReform.No
FlagDesecration.No
FaithBasedInitiative.Yes
ChinaNormalizedTradeRelations.Yes
ANWRDrillingBan.Yes
PatientsRightsHMOLiability.No
PatientsBillOfRights.No
DomesticPartnerBenefits.No
USMilitaryPersonnelOverseasAbortions.Yes
AntiTerrorismAuthority.No
EconomicStimulus.No
TradePromotionAuthorityFastTrack.No
Vote For Against Party Democrat Republican Other

Complete Separation
For representatives that always vote “For” or “Against” the ASA
position, maximum likelihood will produce infinite γs estimates, so
that the fitted probabilities are 0 or 1.
Two possible remedies:
1. Add δ to yis and 2δ to the totals nis
hard to quantify effect of adjustment
different δ give different results
2. Bias reduction (Firth, 1993; Kosmidis and Firth, 2009)
requires identifiable parameterization

Bias Reduction in the 1D Rasch Model
ML estimates are obtained by solving the score equations, which for
the one dimensional Rasch model with θ = (αT
, βT
, γT
)T
are
Ut =
I
i=1
S
s=1
(yis − nisπis)zist = 0
where zist = ∂ηis/∂θt.
The bias reduction method of Kosmidis and Firth (2009) works by
adjusting the scores, in this case giving
U∗
t =
I
i=1
S
s=1
yis +
1
2
his + cisvis − (nis + his)πis zist = 0
where vis, his and cis are depend on the model parameters.

Identifiability in the 1D Rasch Model
In order to identify the parameters in 1D Rasch model
log
πis
1 − πis
= αi + βiγs
the scale of the βi and the location of the γs must be constrained.
This can be achieved by fixing one of the βi and one of the γs.
Here we will select one βi and one γs at random and fix them to their
ML estimates based on data that have been δ adjusted.

Bias Reduction Algorithm
The bias adjustment suggests the following iterative scheme
1. Evaluate bias adjusted responses and totals given θ(i)
2. Fit the 1D Rasch model to the adjusted data using ML
Unfortunately the cis quantities are unbounded and can produce
adjusted yis < 0 or > nis
redefine yis and nis to avoid this
Adding a further iteration loop to IWLS adds significantly to the
computation time, therefore good starting values are important
if ML estimates finite use these
else use ML estimates found by δ adjustment

Liberality of US Representatives
All the ˆβi are < 0, hence smaller ˆγs implies larger probability of
voting for the ADA position, i.e. more liberal.

Comparison Intervals
Adding intervals based on quasi-standard errors that are invariant to
the parameter constraints (Firth and de Menezes, 2004):

Summary
Working with over-parameterized models enables a general
framework to be implemented for GNMs
Some of the computational methods from GLMs can be applied
directly to GNMs. . .
. . . whilst others require much more work!
Further examples can be found in the help ﬁles and manual
accompanying the gnm package which is available on CRAN.

References
Agresti, A. (2002). Categorical Data Analysis (2nd ed.). New York: Wiley.
Anderson, J. A. (1984). Regression and Ordered Categorical Variables. J.
R. Statist. Soc. B 46(1), 1–30.
Firth, D. (1993). Bias reduction of maximum likelihood estimates.
Biometrika 80(1), 27–38.
Firth, D. and R. X. de Menezes (2004). Quasi-variances. Biometrika 91,
65–80.
Goodman, L. A. (1979). Simple models for the analysis of association in
cross-classiﬁcations having ordered categories. J. Amer. Statist.
Assoc. 74, 537–552.
Kosmidis, I. and D. Firth (2009). Bias reduction in exponential family
nonlinear models. Biometrika 96(4), 793–804.

Generalized Nonlinear Models in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Generalized Nonlinear Models in R

Similar to Generalized Nonlinear Models in R (20)

More from htstatistics

More from htstatistics (12)

Recently uploaded

Recently uploaded (20)

Generalized Nonlinear Models in R