Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Diamond mixed effects models in Python

By Timothy Sweetser
PyData New York City 2017

Generalized linear mixed effects models, ubiquitous in social science research, are rarely seen in applied data science work despite their relevance and simplicity. We will discuss this class of statistical models, their usefulness in recommender systems, and present a fast, scalable Python solver for them called Diamond.

  • Be the first to comment

Diamond mixed effects models in Python

  1. 1. Diamond: Mixed Effects Models in Python Timothy Sweetser Stitch Fix http://github.com/stitchfix/diamond tsweetser@stitchfix.com @hacktuarial November 27, 2017 Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 1 / 32
  2. 2. Overview 1 context and motivation 2 what is the mixed effects model 3 application to recommender systems 4 computation 5 diamond 6 appendix Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 2 / 32
  3. 3. context and motivation Stitch Fix Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 3 / 32
  4. 4. what is the mixed effects model Refresher: Linear Model y ∼ N(Xβ, σ2 I) y is n x 1 X is n x p β is an unknown vector of length p σ2 is an unknown, nonnegative constant Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 4 / 32
  5. 5. what is the mixed effects model Mixed Effects Model y|b ∼ N(Xβ + Zb, σ2 I) We have a second set of features, Z, n x q the coefficients on Z are b ∼ N(0, Σ) Σ is q x q Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 5 / 32
  6. 6. what is the mixed effects model simple example of a mixed effects model You think there is some relationship between a woman’s height and the ideal length of jeans for her: length = α + β ∗ height + Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 6 / 32
  7. 7. what is the mixed effects model simple example of a mixed effects model You think there is some relationship between a woman’s height and the ideal length of jeans for her: length = α + β ∗ height + But, you think the length might need to be shorter or longer, depending on the silhouette of the jeans. In other words, you want α to vary by silhouette. Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 6 / 32
  8. 8. what is the mixed effects model why might silhouette affect length ∼ height? Skinny Bootcut Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 7 / 32
  9. 9. what is the mixed effects model linear model: formula Linear models can be expressed in formula notation, used by patsy, statsmodels, and R import statsmodels.formula.api as smf lm = smf.ols(’length ~ 1 + height ’, data=train_df).fit() in math, this means length = Xβ + Xi = [1.0, 64.0] β is what we want to learn, using (customer, item) data from jeans that fit well Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 8 / 32
  10. 10. what is the mixed effects model linear model: illustration Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 9 / 32
  11. 11. what is the mixed effects model mixed effects: formula Now, allow the intercept to vary by silhouette mix = smf.mixedlm(’length ~ 1 + height ’, data=train_df , re_formula=’1’, groups=’silhouette ’, use_sparse=True).fit() Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 10 / 32
  12. 12. what is the mixed effects model illustration Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 11 / 32
  13. 13. what is the mixed effects model mixed effects regularization y|b ∼ N(Xβ + Zb, σ2 I) Sort by silhouette: Z =     1bootcut 0 0 0 0 1skinny 0 0 0 0 1straight 0 0 0 0 1wide     X is n x 2 Z is n x 4 Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 12 / 32
  14. 14. what is the mixed effects model matrices and formulas - mixed effects Zb =     1bootcut 0 0 0 0 1skinny 0 0 0 0 1straight 0 0 0 0 1wide         µbootcut µskinny µstraight µwide     Each µsilhouette is drawn from N(0, σ2) This allows for deviations from the average effects, µ and β, by silhouette, to the extend that the data support it Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 13 / 32
  15. 15. application to recommender systems a basic model rating ∼ 1 + (1|user id) + (1|item id) In math, this means rui = µ + αu + βi + ui where µ is an unknown constant αu ∼ N(0, σ2 user ) βi ∼ N(0, σ2 item) Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 14 / 32
  16. 16. application to recommender systems a basic model rating ∼ 1 + (1|user id) + (1|item id) In math, this means rui = µ + αu + βi + ui where µ is an unknown constant αu ∼ N(0, σ2 user ) βi ∼ N(0, σ2 item) some items are more popular than others some users are more picky than others Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 14 / 32
  17. 17. application to recommender systems add features rating ∼ 1 + (1 + item feature1 + item feature2|user id)+ (1 + user feature1 + user feature2|item id) Now, αu ∼ N(0, Σuser ) βi ∼ N(0, Σitem) the good: we’re using features! learn individual and shared preferences helps with new items, new users the bad: scales as O(p2) Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 15 / 32
  18. 18. application to recommender systems comments rating ∼ 1 + (1 + item feature1 + item feature2|user id)+ (1 + user feature1 + user feature2|item id) this is a parametric model, and much less flexible than trees, neural networks, or matrix factorization but you don’t have to choose! you can use an ensemble, or use this as a feature in another model Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 16 / 32
  19. 19. computation computation How can you fit models like this? We were using R’s lme4 package Maximum likelihood computation works like this: Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 17 / 32
  20. 20. computation computation How can you fit models like this? We were using R’s lme4 package Maximum likelihood computation works like this: Estimate covariance structure of random effects, Σ given Σ, estimate coefficients β and b with these, compute loglikelihood repeat until convergence Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 17 / 32
  21. 21. computation computation How can you fit models like this? We were using R’s lme4 package Maximum likelihood computation works like this: Estimate covariance structure of random effects, Σ given Σ, estimate coefficients β and b with these, compute loglikelihood repeat until convergence Doesn’t scale well with number of observations, n lme4 supports a variety of generalized linear models, but is not optimized for any one in particular Is it really necessary to update hyperparameters Σ every time you estimate the coefficients? Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 17 / 32
  22. 22. computation diamond Diamond solves a similar problem using these tricks: Input Σ. Conditional on Σ, the optimization problem is convex Use Hessian of L2 penalized loglikelihood function (pencil + paper) logistic regression cumulative logistic regression, for ordinal responses if Y ∈ (1, 2, 3, . . . , J), log Pr(Y ≤ j) 1 − Pr(Y ≤ j) = αj + βT x for j = 1, 2, . . . , J − 1 quasi-Newton optimization techniques from Minka 2003 Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 18 / 32
  23. 23. computation other solvers How else could you fit mixed effects models? ”Exact” methods Full Bayes: MCMC. e.g. PyStan, PyMC3, Edward diamond, but you must specify the hyperparameters Σ statsmodels only supports linear regression for Gaussian-distributed outcomes R/lme4 Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 19 / 32
  24. 24. computation other solvers How else could you fit mixed effects models? ”Exact” methods Full Bayes: MCMC. e.g. PyStan, PyMC3, Edward diamond, but you must specify the hyperparameters Σ statsmodels only supports linear regression for Gaussian-distributed outcomes R/lme4 Approximate methods Simple, global L2 regularization Full Bayes: Variational Inference moment-based methods Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 19 / 32
  25. 25. diamond Speed test MovieLens, 20M observations like (userId, movieId, rating) binarize (ordinal!) rating → 1(rating > 3.5) this is well-balanced Fit a model like rating ∼ 1 + (1|user id) + (1|item id) Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 20 / 32
  26. 26. diamond diamond from diamond.glms.logistic import LogisticRegression import pandas as pd train_df = ... priors_df = pd.DataFrame({ ’group ’: [’userId ’, ’movieId ’], ’var1 ’: [’intercept ’] * 2, ’var2 ’: [np.nan , np.nan], ’vcov ’: [0.9, 1.0] }) m = LogisticRegression (train_df=train_df , priors_df= priors_df) results = m.fit(’liked ~ 1 + (1 | userId) + (1 | movieId)’, tol=1e-5, max_its=200 , verbose=True) Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 21 / 32
  27. 27. diamond Speed test vs. sklearn Diamond estimate covariance on sample of 1M observations in R. 1-time, 60 minutes σ2 user = 0.9, σ2 movie = 1.0 Takes 83 minutes on my laptop to fit in diamond sklearn LogisticRegression use cross validation to estimate regularization. 1-time, takes 24 minutes grid search would be a fairer comparison refit takes 1 minute Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 22 / 32
  28. 28. diamond diamond vs. sklearn predictions Global L2 regularization is a good approximation for this problem, but may not work as well when σ2 user >> σ2 item, vice versa, or for more models with more features Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 23 / 32
  29. 29. diamond diamond vs. R lme4 takes more than 360 minutes to fit Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 24 / 32
  30. 30. diamond diamond vs. moment-based active area of research by statisticians at Stanford, NYU, elsewhere very fast to fit simple models using method of moments e.g. rating ∼ 1 + (1 + x|user id) or rating ∼ 1 + (1|user id) + (1|item id) Fitting this to movie lens 20M took 4 minutes but not rating ∼ 1 + (1 + x|user id) + (1|item id) Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 25 / 32
  31. 31. diamond diamond vs. variational inference I fit this model in under 5 minutes using Edward, and didn’t have to input Σ. VI is very promising! Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 26 / 32
  32. 32. diamond why use diamond? http://github.com/stitchfix/diamond scales well with number of observations (compared to pure R, MCMC) solves the exact problem (compared to variational, moment-based) scales ok with P (compared to simple global L2) supports ordinal logistic regression if Y ∈ (1, 2, 3, . . . , J), log Pr(Y ≤ j) 1 − Pr(Y ≤ j) = αj + βT x for j = 1, 2, . . . , J − 1 Reference: Agresti, Categorical Data Analysis Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 27 / 32
  33. 33. diamond summary mixed effects models are useful for recommender systems and other data science applications they can be hard to fit for large datasets they play well with other kinds of models diamond, moment-based approaches, and variational inference are good ways to estimate models quickly Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 28 / 32
  34. 34. diamond discussion Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 29 / 32
  35. 35. diamond References I Patrick Perry (2015) Moment Based Estimation for Hierarchical Models https://arxiv.org/abs/1504.04941 Alan Agresti (2012) Categorical Data Analysis, 3rd Ed. ISBN-13 978-0470463635 Gao + Owen (2016) Estimation and Inference for Very Large Linear Mixed Effects Models https://arxiv.org/abs/1610.08088 Edward A Library for probabilistic modeling, inference, and criticism. https://github.com/blei-lab/edward Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 30 / 32
  36. 36. diamond References II inka A comparison of numerical optimizers for logistic regression https://tminka.github.io/papers/logreg/minka-logreg.pdf me4 https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 31 / 32
  37. 37. appendix regularization Usual L2 regularization. If each βi ∼ N(0, 1 λ ) minimize β loss + 1 2 βT (λIp)β Here, the four b coefficient vectors are samples from N(0, Σ). If we knew Σ, the regularization would be minimize b loss + 1 2 bT     Σ−1 0 0 0 0 Σ−1 0 0 0 0 Σ−1 0 0 0 0 Σ−1     b Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 32 / 32

×