Diamond mixed effects models in Python

Diamond: Mixed Effects Models in Python
Timothy Sweetser
Stitch Fix
http://github.com/stitchfix/diamond
tsweetser@stitchfix.com
@hacktuarial
November 27, 2017
Timothy Sweetser (Stitch Fix) Diamond November 27, 2017 1 / 32

Overview
1 context and motivation
2 what is the mixed eﬀects model
3 application to recommender systems
4 computation
5 diamond
6 appendix

context and motivation
Stitch Fix

what is the mixed eﬀects model
Refresher: Linear Model
y ∼ N(Xβ, σ2
I)
y is n x 1
X is n x p
β is an unknown vector of length p
σ2 is an unknown, nonnegative constant

Mixed Eﬀects Model
y|b ∼ N(Xβ + Zb, σ2
I)
We have a second set of features, Z, n x q
the coeﬃcients on Z are b ∼ N(0, Σ)
Σ is q x q

simple example of a mixed eﬀects model
You think there is some relationship between a woman’s height and the
ideal length of jeans for her:
length = α + β ∗ height +

simple example of a mixed eﬀects model
You think there is some relationship between a woman’s height and the
ideal length of jeans for her:
length = α + β ∗ height +
But, you think the length might need to be shorter or longer, depending
on the silhouette of the jeans. In other words, you want α to vary by
silhouette.

why might silhouette aﬀect length ∼ height?
Skinny
Bootcut

linear model: formula
Linear models can be expressed in formula notation, used by patsy,
statsmodels, and R
import statsmodels.formula.api as smf
lm = smf.ols(’length ~ 1 + height ’, data=train_df).fit()
in math, this means length = Xβ +
Xi = [1.0, 64.0]
β is what we want to learn, using (customer, item) data from jeans
that ﬁt well

linear model: illustration

mixed eﬀects: formula
Now, allow the intercept to vary by silhouette
mix = smf.mixedlm(’length ~ 1 + height ’,
data=train_df ,
re_formula=’1’,
groups=’silhouette ’,
use_sparse=True).fit()

illustration

mixed eﬀects regularization
y|b ∼ N(Xβ + Zb, σ2
I)
Sort by silhouette:
Z =




1bootcut 0 0 0
0 1skinny 0 0
0 0 1straight 0
0 0 0 1wide




X is n x 2
Z is n x 4

matrices and formulas - mixed eﬀects
Zb =




1bootcut 0 0 0
0 1skinny 0 0
0 0 1straight 0
0 0 0 1wide








µbootcut
µskinny
µstraight
µwide




Each µsilhouette is drawn from N(0, σ2)
This allows for deviations from the average eﬀects, µ and β, by
silhouette, to the extend that the data support it

application to recommender systems
a basic model
rating ∼ 1 + (1|user id) + (1|item id)
In math, this means
rui = µ + αu + βi + ui
where
µ is an unknown constant
αu ∼ N(0, σ2
user )
βi ∼ N(0, σ2
item)

a basic model
In math, this means
rui = µ + αu + βi + ui
where
µ is an unknown constant
αu ∼ N(0, σ2
user )
βi ∼ N(0, σ2
item)
some items are more popular than others
some users are more picky than others

add features
rating ∼ 1 + (1 + item feature1 + item feature2|user id)+
(1 + user feature1 + user feature2|item id)
Now,
αu ∼ N(0, Σuser )
βi ∼ N(0, Σitem)
the good: we’re using features! learn individual and shared preferences
helps with new items, new users
the bad: scales as O(p2)

comments
rating ∼ 1 + (1 + item feature1 + item feature2|user id)+
(1 + user feature1 + user feature2|item id)
this is a parametric model, and much less ﬂexible than trees, neural
networks, or matrix factorization
but you don’t have to choose!
you can use an ensemble, or use this as a feature in another model

computation
computation
How can you ﬁt models like this? We were using R’s lme4 package
Maximum likelihood computation works like this:

computation
computation
Estimate covariance structure of random eﬀects, Σ
given Σ, estimate coeﬃcients β and b
with these, compute loglikelihood
repeat until convergence

computation
computation
Estimate covariance structure of random effects, Σ
given Σ, estimate coefficients β and b
with these, compute loglikelihood
repeat until convergence
Doesn’t scale well with number of observations, n
lme4 supports a variety of generalized linear models, but is not
optimized for any one in particular
Is it really necessary to update hyperparameters Σ every time you
estimate the coefficients?

computation
diamond
Diamond solves a similar problem using these tricks:
Input Σ. Conditional on Σ, the optimization problem is convex
Use Hessian of L2 penalized loglikelihood function (pencil + paper)
logistic regression
cumulative logistic regression, for ordinal responses
if Y ∈ (1, 2, 3, . . . , J),
log
Pr(Y ≤ j)
1 − Pr(Y ≤ j)
= αj + βT
x
for j = 1, 2, . . . , J − 1
quasi-Newton optimization techniques from Minka 2003

computation
other solvers
How else could you ﬁt mixed eﬀects models?
”Exact” methods
Full Bayes: MCMC. e.g. PyStan, PyMC3, Edward
diamond, but you must specify the hyperparameters Σ
statsmodels only supports linear regression for Gaussian-distributed
outcomes
R/lme4

computation
other solvers
How else could you ﬁt mixed eﬀects models?
”Exact” methods
Full Bayes: MCMC. e.g. PyStan, PyMC3, Edward
diamond, but you must specify the hyperparameters Σ
statsmodels only supports linear regression for Gaussian-distributed
outcomes
R/lme4
Approximate methods
Simple, global L2 regularization
Full Bayes: Variational Inference
moment-based methods

diamond
Speed test
MovieLens, 20M observations like (userId, movieId, rating)
binarize (ordinal!) rating → 1(rating > 3.5)
this is well-balanced
Fit a model like

diamond
diamond
from diamond.glms.logistic import LogisticRegression
import pandas as pd
train_df = ...
priors_df = pd.DataFrame({
’group ’: [’userId ’, ’movieId ’],
’var1 ’: [’intercept ’] * 2,
’var2 ’: [np.nan , np.nan],
’vcov ’: [0.9, 1.0]
})
m = LogisticRegression (train_df=train_df , priors_df=
priors_df)
results = m.fit(’liked ~ 1 + (1 | userId) + (1 | movieId)’,
tol=1e-5, max_its=200 , verbose=True)

diamond
Speed test vs. sklearn
Diamond
estimate covariance on sample of 1M observations in R. 1-time, 60
minutes
σ2
user = 0.9, σ2
movie = 1.0
Takes 83 minutes on my laptop to ﬁt in diamond
sklearn LogisticRegression
use cross validation to estimate regularization. 1-time, takes 24
minutes
grid search would be a fairer comparison
reﬁt takes 1 minute

diamond
diamond vs. sklearn predictions
Global L2 regularization is a good approximation for this problem, but may
not work as well when σ2
user >> σ2
item, vice versa, or for more models with
more features

diamond
diamond vs. R
lme4 takes more than 360 minutes to ﬁt

diamond
diamond vs. moment-based
active area of research by statisticians at Stanford, NYU, elsewhere
very fast to ﬁt simple models using method of moments
e.g. rating ∼ 1 + (1 + x|user id)
or rating ∼ 1 + (1|user id) + (1|item id)
Fitting this to movie lens 20M took 4 minutes
but not rating ∼ 1 + (1 + x|user id) + (1|item id)

diamond
diamond vs. variational inference
I ﬁt this model in under 5 minutes using Edward, and didn’t have to
input Σ.
VI is very promising!

diamond
why use diamond?
http://github.com/stitchﬁx/diamond
scales well with number of observations (compared to pure R, MCMC)
solves the exact problem (compared to variational, moment-based)
scales ok with P (compared to simple global L2)
supports ordinal logistic regression
if Y ∈ (1, 2, 3, . . . , J),
log
Pr(Y ≤ j)
1 − Pr(Y ≤ j)
= αj + βT
x
for j = 1, 2, . . . , J − 1
Reference: Agresti, Categorical Data Analysis

diamond
summary
mixed eﬀects models are useful for recommender systems and other
data science applications
they can be hard to ﬁt for large datasets
they play well with other kinds of models
diamond, moment-based approaches, and variational inference are
good ways to estimate models quickly

diamond
discussion

diamond
References I
Patrick Perry (2015)
Moment Based Estimation for Hierarchical Models
https://arxiv.org/abs/1504.04941
Alan Agresti (2012)
Categorical Data Analysis, 3rd Ed.
ISBN-13 978-0470463635
Gao + Owen (2016)
Estimation and Inference for Very Large Linear Mixed Eﬀects Models
https://arxiv.org/abs/1610.08088
Edward
A Library for probabilistic modeling, inference, and criticism.
https://github.com/blei-lab/edward

diamond
References II
inka
A comparison of numerical optimizers for logistic regression
https://tminka.github.io/papers/logreg/minka-logreg.pdf
me4
https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

appendix
regularization
Usual L2 regularization. If each βi ∼ N(0, 1
λ )
minimize
β
loss +
1
2
βT
(λIp)β
Here, the four b coeﬃcient vectors are samples from N(0, Σ). If we knew
Σ, the regularization would be
minimize
b
loss +
1
2
bT




Σ−1 0 0 0
0 Σ−1 0 0
0 0 Σ−1 0
0 0 0 Σ−1



 b

Diamond mixed effects models in Python

Recommended

Recommended

More Related Content

Similar to Diamond mixed effects models in Python

Similar to Diamond mixed effects models in Python (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Diamond mixed effects models in Python