Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From L to N: Nonlinear Predictors in Generalized Models

1,812 views

Published on

  • Be the first to comment

  • Be the first to like this

From L to N: Nonlinear Predictors in Generalized Models

  1. 1. From L to N: Nonlinear Predictors in Generalized Models Heather Turner Independent Statistical/R Consultant owing much to David Firth, University of WarwickHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 1 / 32
  2. 2. From L to N In a GLM we have g(µ) = β0 + β1 x1 + ... + βp xp and Var(Y ) = φV (µ) A generalized nonlinear model (GNM) is the same as a GLM except that we have g(µ) = η(x; β) where η(x; β) is nonlinear in the parameters β.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 2 / 32
  3. 3. Motivation GNMs may be thought of as... ... an extension of Nonlinear Least Squares using a nonlinear function of a continuous variable to model a non-Gaussian response ... an extension of GLMs using nonlinear functions of parameters to produce a more parsimonious model and interpretable model.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 3 / 32
  4. 4. Example: Mental Health Status The following contingency table cross-classifies a sample of 1660 residents of Manhattan by child’s mental impairment and parents’ socioeconomic status (Agresti, 2002) ## MHS ## SES well mild moderate impaired ## A 64 94 58 46 ## B 57 94 54 40 ## C 57 105 65 60 ## D 72 141 77 94 ## E 36 97 54 78 ## F 21 71 54 71Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 4 / 32
  5. 5. Independence A simple analysis of these data might be to test for independence of MHS and SES using a chi-squared test. This is equivalent to testing the goodness-of-fit of the independence model log(µrc ) = αr + βc Such a test compares the independence model to the saturated model log(µrc ) = αr + βc + γrc which may be over-complex.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 5 / 32
  6. 6. Row-column Association One intermediate model is the Row-Column association model: log(µrc ) = αr + βc + φr ψc (Goodman, 1979), an example of a multiplicative interaction model. For the Mental Health data: ## Analysis of Deviance Table ## ## Model 1: Freq ~ SES + MHS ## Model 2: Freq ~ SES + MHS + Mult ( SES , MHS ) ## Model 3: Freq ~ SES + MHS + SES : MHS ## Resid . Df Resid . Dev Df Deviance Pr ( > Chi ) ## 1 15 47.4 ## 2 8 3.6 7 43.8 2.3 e -07 ## 3 0 0.0 8 3.6 0.89Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 6 / 32
  7. 7. Parameterisation The independence model was defined earlier in an over-parameterised form: log(µrc ) = αr + βc = (αr + 1) + (βc − 1) ∗ ∗ = αr + βc Identifiability constraints may be imposed to fix a one-to-one mapping between parameter values and distributions to enable interpretation of parametersHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 7 / 32
  8. 8. Standard Implementation The standard approach of all major statistical software packages is to apply the identifiability constraints in the construction of the model g(µ) = Xβ so that rank(X) is equal to the number of parameters p. Then the inverse in the score equations of the IWLS algorithm −1 β (r+1) = X T W (r) X X T W (r) z (r) exists.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 8 / 32
  9. 9. Alternative Implementation An alternative is to keep models in their over-parameterised form, so that rank(X) < p, and use the generalised inverse in the IWLS updates: − β (r+1) = X T W (r) X X T W (r) z (r) This approach is more useful for GNMs, since in this case it is much harder to define standard rules for specifying identifiability constraints. Rather, identifiability constraints can be applied post-fitting for inference and interpretation.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 9 / 32
  10. 10. Estimation of GNMs GNMs present further technical difficulties vs. GLMs automatic generation of starting values is hard the likelihood may have multiple optima The default approach used in the gnm package for R is as follows: generate starting values randomly for nonlinear parameters and using a GLM fit for linear parameters use one-parameter-at-a-time Newton method to update nonlinear parameters use the generalized IWLS to update all parameters Consequently, the parameterisation returned is random.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 10 / 32
  11. 11. Parameterisation of RC Model The RC model is invariant to changes in scale or location of the interaction parameters: log(µrc ) = αr + βc + φr ψc = αr + βc + (2φr )(0.5ψc ) = αr + (βc − ψc ) + (φr + 1)(ψc ) One way to constrain these parameters is as follows wr φr φr − r wr φ∗ r = r wr φr r wr φr − r wr r where wr is the row probability, say, so that wr φ∗ = 0 r wr (φ∗ )2 = 1 r r rHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 11 / 32
  12. 12. Row and Column Scores The row and columns scores for the RC model are ## Estimate Std . Error ## Mult (. , MHS ) . SESA 1.11 0.30 ## Mult (. , MHS ) . SESB 1.12 0.31 ## Mult (. , MHS ) . SESC 0.37 0.32 ## Mult (. , MHS ) . SESD -0.03 0.27 ## Mult (. , MHS ) . SESE -1.01 0.31 ## Mult (. , MHS ) . SESF -1.82 0.28 ## Estimate Std . Error ## Mult ( SES , .) . MHSwell 1.68 0.19 ## Mult ( SES , .) . MHSmild 0.14 0.20 ## Mult ( SES , .) . MHSmoderate -0.14 0.28 ## Mult ( SES , .) . MHSimpaired -1.41 0.17 As one might expect, the scores are ordered for both factors, suggesting the model for the dependence structure might be simplified further.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 12 / 32
  13. 13. Biplot Model Biplots are graphical displays of data arrays which represent the objects that index all dimensions of the array on the same plot. So for a two-way table, a biplot represents both the rows and columns at the same time. The biplot is constructed from a rank-2 representation of the data. Here we consider the generalized bilinear model g(µij ) = α1i β1j + α2i β2jHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 13 / 32
  14. 14. Example: Leaf Blotch Data The proportion of leaf area affected by leaf blotch was recorded for 10 varieties of barley grown at nine sites (Gabriel, 1998). Thus the response is a continuous variable in [0, 1]. Wedderburn (1974) suggested to model these data using a logit link and a variance proportional to the square of that of the binomial, i.e. V (µ) = µ2 (1 − µ)2 – a quasi-likelihood model.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 14 / 32
  15. 15. Geometrical Intepretation Given the bilinear model logit(µij ) = α1i β1j + α2i β2j the effect of site i can be represented by the point (α1i , α2i ) in the space spanned by the linearly independent basis vectors a1 = (α11 , α12 , . . . α19 )T a2 = (α21 , α22 , . . . α29 )THeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 15 / 32
  16. 16. Visualising Sites and Varieties Thus we can represent the sites and varieties separately as follows Site Effects Variety Effects 4 4 2 2 Component 2 Component 2 1 2 4 3 5 7 6 89 0 0 X CE −2 −2 F B D G H I A −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 Component 1 Component 1Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 16 / 32
  17. 17. Obtaining Orthogonal Bases Given the SVD of the matrix of predictors η = U DV T matrices of orthogonal basis vectors on the same scale are given by 1 1 A = UD2 B = D2V T The model stays the same, but the parametrization changes.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 17 / 32
  18. 18. Biplot Biplot for barley data Biplot for barley data sites: A−I sites: A−I 4 4 varieties: 1−9, X varieties: 1−9, X v−axis I I 2 2 9X H 9X H Component 2 Component 2 6 8 6 8 G G 7 F D 7 F D E E 0 0 5 C 5 C 3 2 4 B A 3 2 4 B A 1 1 −2 −2 h−axis −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 Component 1 Component 1Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 18 / 32
  19. 19. Model Refinement The biplot suggests that the sites could be represented by points along a line, with co-ordinates (γi , δ0 ). and the varieties by points on two lines perpendicular to the site line: (ν0 + ν1 I(i ∈ {2, 3, 6}), ωj ) This corresponds to the following simplification of the bilinear model: α1i β1j + α2i β2j ≈γi (ν0 + ν1 I(i ∈ {2, 3, 6})) + δ0 ωj or equivalently γi (ν0 + ν1 I(i ∈ {2, 3, 6})) + ωj ,Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 19 / 32
  20. 20. Double Additive Model Gabriel (1998) described the model derived from the biplot as the double additive model. An analysis of deviance confirms that this model is adequate for the leaf blotch data ## Analysis of Deviance Table ## ## Model 1: y ~ 0 + Mult ( site , variety , inst = 1) + Mult ( site , ## variety , inst = 2) ## Model 2: y ~ variety + Mult ( site , variety . binary ) - 1 ## Resid . Df Resid . Dev Df Deviance Pr ( > Chi ) ## 1 56 41 ## 2 71 51 -15 -9.94 0.8Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 20 / 32
  21. 21. Stereotype Model The stereotype model (Anderson, 1984) is suitable for ordered categorical data. It is a special case of the multinomial logistic model: exp(β0c + β T xi ) c pr(yi = c|xi ) = r exp(β0r + β T xi ) r in which only the scale of the relationship with the covariates changes between categories: exp(β0c + γc β T xi ) pr(yi = c|xi ) = T r exp(β0r + γr β xi )Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 21 / 32
  22. 22. Poisson Trick The stereotype model can be fitted as a GNM by re-expressing the categorical data as category counts Yi = (Yi1 , . . . , Yik ). Assuming a Poisson distribution for Yic , the joint distribution of Yi is Multinomial(Ni , pi1 , . . . , pik ) conditional on the total count Ni . The expected counts are then µic = Ni pic and the parameters of the sterotype model can be estimated through fitting log µic = log(Ni ) + log(pic ) = αi + β0c + γc βr xir r where the “nuisance” parameters αi ensure that the multinomial denominators are reproduced exactly, as required.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 22 / 32
  23. 23. Augmented Least Squares A disadvantage of using the Poisson trick is that the number of nuisance parameters can be large, making computation slow. The algorithm can be adapted using augmented least squares. For an ordinary least squares model, −1 T −1 yT y yT X A11 A12 (y|X) (y|X) = = XT y XT X A21 A22 where A11 , A12 and A22 are functions of y T y, X T y and X T X. Then it can be shown that ˆ A21 β = (X T X)−1 X T y = − A11 requiring only the first row (column) of the inverse to be found.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 23 / 32
  24. 24. Application to Nuisance Parameters I The same approach can be applied to the IWLS algorithm, letting 1 ˜ X = W 2 (z|X) Now let ˜ X = (U |V ) where V is the part of the design matrix corresponding to the nuisance factor. U is an nk × p matrix where n is the number of nuisance parameters and k is the number of categories and p is the number of model parameters, typically with n >> p. V is an nc × n matrix of dummy variables identifying each individual.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 24 / 32
  25. 25. Application to Nuisance Parameters II Then − ˜T ˜ UTU UTV B 11 B 12 (X X)− = = V TU V TV B 21 B 22 Again, only the first row (column) of this generalised inverse is ˆ required to estimate β, so we are only interested in B 11 and B 12 . B 11 = (U T U − U T V (V T V )−1 V T U )− B 12 = −(V T V )−1 V T U B 11Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 25 / 32
  26. 26. Elimination of the Nuisance Factor U T U is p × p, therefore not expensive to compute. V T V and V T U can be computed without constructing the large nk × n matrix V , due to the stucture of V V T V is diagonal and the non-zero elements can be computed directly V T U is equivalent to aggregating the rows of U by levels of the nuisance factor Thus we only need to construct the U matrix, saving memory and reducing the computational burdenHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 26 / 32
  27. 27. Example: Back Pain Data For 101 patients, 3 prognostic variables were recorded at baseline, then after 3 weeks the level of back pain was recorded (Anderson, 1984) These data were converted to counts, for example for the first record: ## x1 x2 x3 pain count id ## 1 1 1 1 worse 0 1 ## 1.1 1 1 1 same 1 1 ## 1.2 1 1 1 slight . improvement 0 1 ## 1.3 1 1 1 moderate . improvement 0 1 ## 1.4 1 1 1 marked . improvement 0 1 ## 1.5 1 1 1 complete . relief 0 1Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 27 / 32
  28. 28. Back Pain Model In this example, the expanded data is not that long (606 records) and the total number of parameters is only 115 (9 nonlinear), so the model does not take long to fit (< 1s!). However, eliminating the linear parameters reduces the computation time by almost two-thirds, showing the potential of this technique. Compare the stereotype model to the multinomial logistic model: ## Analysis of Deviance Table ## ## Model 1: count ~ pain + Mult ( pain , x1 + x2 + x3 ) - 1 ## Model 2: count ~ pain + pain : x1 + pain : x2 + pain : x3 - 1 ## Resid . Df Resid . Dev Df Deviance Pr ( > Chi ) ## 1 493 303 ## 2 485 299 8 4.08 0.85Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 28 / 32
  29. 29. Identifiability Constraints In order to make the category-specific multipliers identifiable, we must constrain both the location and scale. A simple way to do this is to set the first multiplier to zero and fix the coefficient of the first covariate to one. ## estimate SE quasiSE quasiVar ## worse 0.000 0.000 1.7797 3.16745 ## same -3.710 1.826 0.4281 0.18330 ## slight . improvement -3.510 1.792 0.4025 0.16198 ## moderate . improvement -2.633 1.669 0.5519 0.30454 ## marked . improvement -4.612 1.895 0.3133 0.09817 ## complete . relief -5.372 2.000 0.4920 0.24202 Quasi standard errors (Firth and de Menezes, 2004) are invariant to reference classHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 29 / 32
  30. 30. Comparison Intervals Intervals based on quasi standard errors 4 2 q 0 estimate −2 q q q −4 q q −6 worse same slight moderate marked complete improvement improvment improvement relief painHeather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 30 / 32
  31. 31. Summary Moving from GLMs to GNMs present some technical difficulties, but provides a framework that covers several useful models. Further examples can be found in the help files and manual accompanying the gnm package which is available on CRAN.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 31 / 32
  32. 32. References Agresti, A. (2002). Categorical Data Analysis (2nd ed.). New York: Wiley. Anderson, J. A. (1984). Regression and Ordered Categorical Variables. J. R. Statist. Soc. B 46 (1), 1–30. Firth, D. and R. X. de Menezes (2004). Quasi-variances. Biometrika 91, 65–80. Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika 85, 689–700. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74, 537–552. Wedderburn, R. W. M. (1974). Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika 61, 439–447.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 32 / 32

×