6. Goals
Propose “relevant” items to customers
Retention
Exploration
Up-sale
Personalized offers
recommended items for a customer given history of activities (transactions, browsing
history, favourites)
Similar items
substitutions
bundles - frequently bought together
. . .
9. Netflix prize
~ 480k users, 18k movies, 100m ratings
sparsity ~ 90%
goal is to reduce RMSE by 10% - from 0.9514 to 0.8563
10. Implicit feedback
noisy feedback (click, likes, purchases, search, . . . )
much easier to collect
wider user/item coverage
usually sparsity > 99.9%
One-Class Collaborative Filtering
observed entries are positive preferences
should have high confidence
missed entries in matrix are mix of negative preferences and positive preferences
consider them as negative with low confidence
we cannot really distinguish that user did not click a banner because of a lack of
interest or lack of awareness
11. Evaluation
Recap: we only care about how to produce small set of highly relevant items.
RMSE is bad metrics - very weak connection to business goals.
Only interested about relevance precision of retreived items:
space on the screen is limited
only order matters - most relevant items should be in top
12. Ranking - Mean average precision
AveragePrecision =
n
k=1
(P(k)×rel(k))
number of relevant documents
## index relevant precision_at_k
## 1: 1 0 0.0000000
## 2: 2 0 0.0000000
## 3: 3 1 0.3333333
## 4: 4 0 0.2500000
## 5: 5 0 0.2000000
map@5 = 0.1566667
13. Ranking - Normalized Discounted Cumulative Gain
Intuition is the same as for MAP@K, but also takes into account value of relevance:
DCGp =
p
i=1
2reli − 1
log2(i + 1)
nDCGp =
DCGp
IDCGp
IDCGp =
|REL|
i=1
2reli − 1
log2(i + 1)
14. Approaches
Content based
good for cold start
not personalized
Collaborative filtering
vanilla collaborative fitlering
matrix factorizations
. . .
Hybrid and context aware recommender systems
best of two worlds
15. Focus today
WRMF (Weighted Regularized Matrix Factorization) - Collaborative Filtering for
Implicit Feedback Datasets (2008)
efficient learning with accelerated approximate Alternating Least Squares
inference time
Linear-FLow - Practical Linear Models for Large-Scale One-Class Collaborative
Filtering (2016)
efficient truncated SVD
cheap cross-validation with full path regularization
16. Matrix Factorizations
Users can be described by small number of latent factors puk
Items can be described by small number of latent factors qki
20. Truncated SVD
Take k largest singular values:
X ≈ UkDkV T
k
- Xk ∈ Rm∗n - Uk, V - columns are orthonormal bases (dot product of any 2 columns is
zero, unit norm) - Dk - matrix with singular values on diagonal
Truncated SVD is the best rank k approximation of the matrix X in terms of
Frobenius norm:
||X − UkDkV T
k ||F
P = Uk Dk
Q = DkV T
k
21. Issue with truncated SVD for “explicit” feedback
Optimal in terms of Frobenius norm - takes into account zeros in ratings -
RMSE =
1
users × items u∈users,i∈items
(rui − ˆrui )2
Overfits data
Objective = error only in “observed” ratings:
RMSE =
1
Observed u,i∈Observed
(rui − ˆrui )2
22. SVD-like matrix factorization with ALS
J =
u,i∈Observed
(rui − pu × qi )2
+ λ(||Q2
|| + ||P2
||)
Given Q fixed solve for p:
min
i∈Observed
(ri − qi × P)2
+ λ
u
j=1
p2
j
Given P fixed solve for q:
min
u∈Observed
(ru − pu × Q)2
+ λ
i
j=1
q2
j
Ridge regression: P = (QT Q + λI)−1QT r, Q = (PT P + λI)−1PT r
23. “Collaborative Filtering for Implicit Feedback Datasets”
WRMF - Weighted Regularized Matrix Factorization
“Default” approach
Proposed in 2008, but still widely used in industry (even at youtube)
several high-quality open-source implementations
J =
u,i
Cui (Pui − XuYi )2
+ λ(||X||F + ||Y ||F )
Preferences - binary
Pij =
1 if Rij > 0
0 otherwise
Confidence - Cui = 1 + f (Rui )
24. Alternating Least Squares for implicit feedback
For fixed Y :
dL/dxu = −2
i=item
cui (pui − xT
u yi )yi + 2λxu =
−2
i=item
cui (pui − yT
i xu)yi + 2λxu =
−2Y T
Cu
p(u) + 2Y T
Cu
Yxu + 2λxu
Setting dL/dxu = 0 for optimal solution gives us (Y T CuY + λI)xu = Y T Cup(u)
xu can be obtained by solving system of linear equations:
xu = solve(Y T
Cu
Y + λI, Y T
Cu
p(u))
25. Alternating Least Squares for implicit feedback
Similarly for fixed X:
dL/dyi = −2XT Ci p(i) + 2XT Ci Yyi + 2λyi
yi = solve(XT Ci X + λI, XT Ci p(i))
Another optimization:
XT Ci X = XT X + XT (Ci − I)X
Y T CuY = Y T Y + Y T (Cu − I)Y
XT X and Y T Y can be precomputed
26. Accelerated Approximate Alternating Least Squares
yi = solve(XT Ci X + λI, XT Ci p(i))
Iterative methods
Conjugate Gradient
Coordinate Descend
Fixed number of steps of (usually 3-4 is enough):
27. Inference time
How to make recommendations for new users?
There are no user embeddings since users are not in original matrix!
28. Inference time
Make one step on ALS with fixed item embeddings matrix => get new user embeddings:
given Y fixed, Cnew - new user-item interactions confidence
xunew = solve(Y T Cunew Y + λI, Y T Cunew p(unew ))
scores = Xnew Y T
29. WRMF Implementations
python implicit - implemets Conjugate Gradient. With GPU support recently!
R reco - implemets Conjugate Gradient
Spark ALS
Quora qmf
Google tensorflow
*titles are clickable
30. Linear-Flow
Idea is to learn item-item similarity matrix W from the data.
First
min J = ||X − XWk||F + λ||Wk||F
With constraint:
rank(W ) ≤ k
31. Linear-Flow observations
1. Whithout L2 regularization optimal solution is Wk = QkQT
k where
SVDk(X) = PkΣkQT
k
2. Whithout rank(W ) ≤ k optimal solution is just solution for ridge regression:
W = (XT X + λI)−1XT X - infeasible.
32. Linear-Flow reparametrization
SVDk(X) = PkΣkQT
k
Let W = QkY :
argmin(Y ) : ||X − XQkY ||F + λ||QkY ||F
Motivation
λ = 0 => W = QkQT
k and also soliton for current problem Y = QT
k
33. Linear-Flow closed-form solution
Notice that if Qk orthogogal then ||QkY ||F = ||Y ||F
Solve ||X − XQkY ||F + λ||Y ||F
Simple ridge regression with close form solution
Y = (QT
k XT
XQk + λI)−1
QT
k XT
X
Very cheap inversion of the matrix of rank k!
34. Linear-Flow hassle-free cross-validation
Y = (QT
k XT
XQk + λI)−1
QT
k XT
X
How to find lamda with cross-validation?
pre-compute Z = QT
k XT X so Y = (ZQk + λI)−1Z -
pre-compute ZQk
notice that value of lambda affects only diagonal of ZQk
generate sequence of lambda (say of length 50) based on min/max diagonal values
solving 50 rigde regression of a small rank is super-fast
36. Suggestions
start simple - SVD, WRMF
design proper cross-validation - both objective and data split
think about how to incorporate business logic (for example how to exclude
something)
use single machine implementations
think about inference time
don’t waste time with libraries/articles/blogposts wich demonstrate MF with dense
matrices