Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM

96 views

Published on

Developed a two stage IRLS algorithm for Conway-Maxwell Poisson regression. Further we extended this approach to implement additive model

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM

  1. 1. Modeling Big Count Data An IRLS framework for COM-Poisson regression and GAM Suneel Chatla Galit Shmueli November 12, 2016 Institute of Service Science National Tsing Hua University, Taiwan (R.O.C)
  2. 2. Table of contents 1. Speed Dating Experiment- Count data models 2. Motivation 3. An IRLS framework 4. Simulation Study-Comparison of IRLS with MLE 5. A CMP Generalized Additive Model 6. Results & Conclusions 1
  3. 3. Speed Dating Experiment- Count data models
  4. 4. Speed dating experiment Fisman et al. (2006) conducted a speed dating experiment to evaluate the gender differences in mate selection 1 . Total sessions 14 Decision 1 or 0 Attractiveness 1-10 Intelligence 1-10 Ambition 1-10 ... ... Control variables 1https://www.kaggle.com/annavictoria/speed-dating-experiment 2
  5. 5. Outcome/Count variables Matches : When both persons decide Yes Tot.Yes : Total number of Yes for each subject in a particular session 3
  6. 6. Summary Statistics Statistic N Mean St. Dev. Min Max matches 531 2.524 2.304 0 14 Tot.Yes 531 6.433 4.361 0 21 Tot.partner 531 15.311 4.967 5 22 age 531 26.303 3.735 18 55 perc.samerace 531 0.391 0.242 0.000 0.833 avg.intcor 531 0.190 0.167 −0.298 0.569 attr 531 6.195 1.122 1.818 10.000 sinc 531 7.205 1.108 2.773 10.000 intel 531 7.381 0.988 3.409 10.000 func 531 6.438 1.103 2.682 10.000 amb 531 6.812 1.133 3.091 10.000 shar 531 5.511 1.333 1.409 10.000 like 531 6.157 1.072 1.682 10.000 prob 531 5.234 1.525 0.778 10.000 mean.agep 531 26.314 1.674 20.444 31.667 attr_o 531 6.200 1.186 2.333 8.688 sinc_o 531 7.224 0.690 4.167 9.000 intel_o 531 7.410 0.614 4.875 9.150 fun_o 531 6.438 1.015 2.625 8.615 amb_o 531 6.827 0.756 4.600 8.842 shar_o 531 5.498 0.942 1.375 7.700 like_o 531 6.161 0.873 2.333 8.300 prob_o 531 5.256 0.736 3.200 7.200 Tot.part.Yes 531 6.420 4.128 0 20 4
  7. 7. Tools: • Poisson Regression • Negative Binomial Regression • Conway-Maxwell Poisson (CMP) Regression 5
  8. 8. The CMP distribution From Shmueli et al. (2005), Y ∼ CMP(λ, ν) implies P(Y = y) = λy (y!)νZ(λ, ν) , y = 0, 1, 2, . . . Z(λ, ν) = ∞∑ s=0 λs (s!)ν for λ > 0, ν ≥ 0. The CMP distribution includes three well-known distributions as special cases: • Poisson (ν = 1), • Geometric (ν = 0, λ < 1), • Bernoulli (ν → ∞ with probability λ 1+λ ). 6
  9. 9. CMP distribution for different (λ, ν) combinations λ=2,ν=0.5 Density 0 5 10 15 0.000.050.100.15 λ=2,ν=0.75 0 2 4 6 8 10 12 0.000.100.20 λ=2,ν=1 0 2 4 6 8 0.00.20.4 λ=2,ν=3 0 1 2 3 4 0.01.02.0 λ=8,ν=0.5 Density 40 60 80 100 0.0000.0150.030 λ=8,ν=0.75 5 10 15 20 25 30 35 0.000.040.08 λ=8,ν=1 0 5 10 15 20 0.000.060.12 λ=8,ν=3 0 1 2 3 4 5 0.00.20.40.60.8 λ=15,ν=0.5 Density 150 200 250 300 0.0000.010 λ=15,ν=0.75 20 30 40 50 60 0.000.020.04 λ=15,ν=1 5 10 15 20 25 30 0.000.040.08 λ=15,ν=3 0 1 2 3 4 5 6 0.00.40.8 7
  10. 10. CMP Regression CMP regression models can be formulated as follows: log(λ) = Xβ (1) log(ν) = Zγ (2) Maximizing the log-likelihood w.r.t the parameters β and γ will yield the following normal equations Sellers and Shmueli (2010): U = ∂logL ∂β = XT (y − E(y)) (3) V = ∂logL ∂γ = νZT (−log(y!) + E(log(y!))) (4) 8
  11. 11. Motivation
  12. 12. Exploration of Speed Dating data q q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q 4 5 6 7 8 9 −2−10123 Sincerity (Others) Tot.Yes(log) q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq qq q q qq q q q q qq q qq q q q qq q q q q q q qq q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q qq q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 5 6 7 8 9 −2−10123 Intelligence (Others) Tot.Yes(log) q q q q q q q q q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q qq q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 4 6 8 10 −2−10123 Sincerity Tot.Yes(log) q q q q qq q q q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 4 6 8 10 −2−10123 Fun seeking Tot.Yes(log) 9
  13. 13. More flexibility? Generalized Additive Models • Smoothing Splines • Penalized Splines Both implementations are dependent upon the Iterative Reweighted Least Squares (IRLS) estimation framework. At present, there is no IRLS framework available for CMP !! 10
  14. 14. An IRLS framework
  15. 15. Update for each iteration I [ β γ ](m) = I [ β γ ](m−1) + [ U V ] which implies the following equations XT ΣyXβ(m) − XT Σy,log(y!)νZγ(m) = XT ΣyXβ(m−1) − XT Σy,log(y!)νZγ(m−1) + XT (y − E(y)) and − νZT Σy,log(y!)Xβ(m) + ν2 ZT Σlog(y!)Zγ(m) = −νZT Σy,log(y!)Xβ(m−1) + ν2 ZT Σlog(y!)Zγ(m−1) + νZT (−log(y!) + E(log(y!))) 11
  16. 16. For the fixed values of both β and γ the equations XT ΣyXβ(m) = XT ΣyXβ(m−1) + XT (y − E(y)) (5) ν2 ZT Σlog(y!)Zγ(m) = ν2 ZT Σlog(y!)Zγ(m−1) + νZT (−log(y!) + E(log(y!))). (6) 12
  17. 17. Algorithm https://arxiv.org/abs/ 1610.08244 13
  18. 18. Practical issues Initial Values • For λ = (y + 0.1)ν • For ν = 0.2 Calculation of Cumulants • Bounding error 10−8 or 10−10 • Asymptotic expressions Stopping Criterion • Based on −2 ∑ l(yi; ˆλi, ˆνi) Step size • Step halving 14
  19. 19. Simulation Study-Comparison of IRLS with MLE
  20. 20. Study design We compare our IRLS algorithm with the existing implementation which is based on maximizing the likelihood function (through optim in R). (a) Set sample size n = 100 (b) Generate x1 ∼ U(0, 1) and x2 ∼ N(0, 1) (c) Calculate x3 = 0.2x1 + U(0, 0.3) and x4 = 0.3x2 + N(0, 0.1) (to create correlated variables) (d) Generate y ∼ CMP(log(λ) = 0.05 + 0.5x1 − 0.5x2 + 0.25x3 − 0.25x4, ν) where ν = {0.5, 2, 5} 15
  21. 21. Results q q q q IR MLE IR MLE IR MLE −0.50.00.51.01.5 x1 q q q q q q q q IR MLE IR MLE IR MLE −2.0−1.5−1.0−0.50.00.5 x2 q q q IR MLE IR MLE IR MLE −4−20246 x3 q q q q q q q q qq IR MLE IR MLE IR MLE −4−2024 x4 q q q IR MLE IR MLE IR MLE −2−101234 log(ν) ν=0.5 ν=2 ν=5 16
  22. 22. A CMP Generalized Additive Model
  23. 23. Additive Model log(λ) = α + p ∑ j=1 fj(Xj) log(ν) = Zγ where fj (j = 1, 2, . . . , p) are the smooth functions for the p variables. 17
  24. 24. Backfitting Based on Hastie and Tibshirani (1990); Wood (2006), the algorithm as follows 1. Initialize: fj = f (0) j , j = 1, . . . , p 2. Cycle: j = 1, . . . , p, 1, . . . , p, . . . fj = Sj ( y − ∑ k̸=j fk|xj ) 3. Continue (2) until the individual functions don’t change. One more nested loop inside the IRLS framework ! 18
  25. 25. Results & Conclusions
  26. 26. Comparison of Regression models on Tot.Yes Poisson Negative Binomial CMP (Intercept) 0.49 0.59 0.14 (0.43) (0.55) (0.33) GenderMale 0.05 0.05 0.03 (0.04) (0.06) (0.03) age −0.01 −0.01 −0.004 (0.01) (0.01) (0.004) Tot.partner 0.07∗∗∗ 0.07∗∗∗ 0.04∗∗∗ (0.00) (0.01) (0.003) avg.intcor −0.04 −0.04 −0.02 (0.11) (0.15) (0.09) attr 0.19∗∗∗ 0.18∗∗∗ 0.11∗∗∗ (0.03) (0.04) (0.02) sinc −0.06 −0.05 −0.04 (0.03) (0.04) (0.02) intel 0.05 0.06 0.03 (0.04) (0.05) (0.03) func 0.03 0.04 0.02 (0.04) (0.05) (0.03) amb −0.12∗∗∗ −0.13∗∗ −0.07∗∗ (0.03) (0.04) (0.02) shar 0.10∗∗∗ 0.10∗∗∗ 0.06∗∗∗ (0.02) (0.03) (0.02) mean.agep −0.01 −0.01 −0.007 (0.01) (0.02) (0.009) attr_o −0.10∗∗∗ −0.10∗∗∗ −0.06∗∗∗ (0.02) (0.03) (0.02) sinc_o 0.02 0.02 0.01 (0.04) (0.05) (0.03) intel_o 0.08 0.08 0.05 (0.05) (0.07) (0.04) fun_o −0.01 −0.01 −0.003 (0.03) (0.04) (0.02) amb_o −0.00 −0.01 0.0005 (0.04) (0.05) (0.03) shar_o 0.02 0.03 0.01 (0.03) (0.04) (0.02) ν 0.53∗∗∗ AIC 2844.92 2777.24 2751.7 BIC 3011.64 2948.23 2922.66 Log Likelihood -1383.46 -1348.62 -1335.33 Deviance 970.04 637.25 Num. obs. 531 531 531 ∗∗∗ p < 0.001, ∗∗ p < 0.01, ∗ p < 0.05 19
  27. 27. Comparison of Additive Models on Tot.Yes Dependent variable: Tot.Yes CMP(Chi.Sq) Poisson(Chi.Sq) s(sinc) 7.16 11.53∗∗ s(func) 7.51 11.40∗∗ s(sinc_o) 13.96∗∗ 29.30∗∗∗ s(intel_o) 14.06∗∗ 13.26∗∗∗ ν 0.56 AIC 2737.03 2804.77 Note: ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01 It’s more about the behavior of opposite person that guide us to select her/him. 20
  28. 28. Summary • The IRLS framework is far more efficient than the existing likelihood based method and provides more flexibility. • Since CMP is computationally heavier than the other GLMs we could parallelize some matrix computations inorder to increase the speed. • The IRLS framework allows CMP to have other modeling extensions such as LASSO etc. Full paper available from https://arxiv.org/abs/1610.08244 and the source code is available from https://github.com/SuneelChatla/cmp 21
  29. 29. Suggestions and 1. 1.1 Questions? 21
  30. 30. References Fisman, R., Iyengar, S. S., Kamenica, E., and Simonson, I. (2006). Gender differences in mate selection: Evidence from a speed dating experiment. The Quarterly Journal of Economics, pages 673–697. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models, volume 43. CRC Press. Sellers, K. F. and Shmueli, G. (2010). A flexible regression model for count data. Annals of Applied Statistics, 4(2):943–961. Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the conway–maxwell–poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(1):127–142.
  31. 31. Wood, S. (2006). Generalized additive models: an introduction with R. CRC press.

×