Massive Matrix Factorization : Applications to collaborative filtering

Dictionary Learning for
Massive Matrix Factorization
Arthur Mensch, Julien Mairal
Ga¨el Varoquaux, Bertrand Thirion
Inria Parietal, Inria Thoth
October 6, 2016

Introduction
Why am I here ?
Inria Parietal: machine learning for neuro-imaging
(fMRI data)
Matrix factorization: major ingredient in fMRI analysis
Very large datasets (2 TB): we designed faster algorithms
These algorithms can be used in collaborative ﬁltering
D AX
Voxels
Time
=
k spatial maps Time
x
1
Work presented at ICML 2016
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28

Déroulé
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting

D´eroul´e
Existing methods
New algorithm
3 Results
Setting
Benchmarks
Parameter setting

Collaborative platform
n users rate a fraction of
p items
e.g movies, restaurants
Estimate ratings for
recommendation
Use the ratings of other users for recommendation

How to predict ratings ?
Credit: [Bell and Koren, 2007]
Joe like We were
soldiers, Black Hawk
down.
Bob and Alice like the
same ﬁlms, and also
like Saving private
Ryan.
Joe should watch
Saving private Ryan,
because all of them
indeed likes war ﬁlms.
Need to uncover topics in items

Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coeﬃcient of di , αj
= aﬃnity with the “topic” q
xij αj
di
k topics
di
αj
1

users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common aﬃnity for topics
xij αj
di
k topics
di
αj
1

users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common aﬃnity for topics
xij αj
di
k topics
di
αj
1
Learning problem: estimate D and A with known ratings

Matrix factorization
1Ω X AD
p
n k
n
=
1
X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n
Constraints / penalty on factors D and A
We only observe 1Ω X — Ω set of ratings provided by users
Recommender systems : millions of users, millions of items
How to scale matrix factorization to very large datasets ?

Formalism
Finding representation in Rk for items and users:
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
= 1Ω (X − DA) 2
2 + λ( D 2
F + A 2
F ) 1Ω set of knownratings
2 reconstruction loss — 2 penalty for generalization
Existing methods
Alternated minimization
Stochastic gradient descent

Existing methods
Alternated minimization
Minimize over A, D: alternate between
D = min
D∈Rp×k
(i,j)∈Ω
(xij − di αj
)2
+ λ D 2
F
A = min
A∈Rk×n
(i,j)∈Ω
(xij − di αj
)2
+ λ A 2
F
No hyperparameters
Slow and memory expensive: use all ratings at each iteration
a.k.a. coordinate descent (variation in parameter update order)

Existing methods
Stochastic gradient descent
min
A,D
(i,j)∈Ω
fij (A, B)
def
= (xij − di αj
)2
+
1
cj
λ αj 2
2 +
1
ci
λ di
2
2
Gradient step for each rating:
(At, Dt) ← (At−1, Dt−1)−
1
ct
(A,D)fij (At−1, Dt−1)
Fast and memory eﬃcient – won the Netﬂix prize
Very sensitive to step sizes (ct) – need to cross-validate

Towards a new algorithm
Best of both worlds ?
Fast and memory eﬃcient algorithm
Little sensitive to hyperparameter setting
Subsampled online dictionary learning
Builds upon the online dictionary learning algorithm
popular in computer vision and interpretable learning (fMRI)
Adapt it to handle missing values eﬃciently

D´eroul´e
Existing methods
New algorithm
3 Results
Setting
Benchmarks
Parameter setting

Dictionary learning
Recall: recommender system formalism
Non-masked matrix factorization with 2 penalty:
min
D∈Rp×k
A∈Rk×n
n
j=1
(xj
− D αj
)2
+ λ( D 2
F + A 2
F )
Penalties can be richer, and made into constraints
Dictionary learning
Learn the left side factor [Olshausen and Field, 1997]
min
D∈C
n
j=1
xj
−Dαj 2
2 +λΩ(αj
) αj
= argmin
α∈Rk
xi
−Dα 2
2 +λΩ(α)
Naive approach: alternated minimization

Online dictionary learning [Mairal et al., 2010]
At iteration t, select xt in {xj }j (user ratings), improve D
Single iteration complexity ∝ sample dimension O(p)
(Dt)t converges in a few epochs (one for large n)
xt αtD
p
n k n
=Stream
1
Very eﬃcient in computer vision / networks / fMRI /
hyperspectral images
Can we use it eﬃciently for recommender systems ?

In short: Handling missing values
X
p
n
xt
Steam
Handle large n
n
Handle missing values
Online → online + partial
Batch →
online
Mtxt
Stream
Ignore
Unknown
Unaccessed
1
Leverage streaming + partial access to samples

In detail: online dictionary learning
Objective function involves latent codes (right side factor)
min
D∈C
1
t
t
i=1
xi − Dα∗
i (D) 2
2, α∗
i (D) = argmin
α
1
2
xi − Dα 2
2 + λΩ(α)
Replace latent codes by codes computed with old dictionaries
Build an upper-bounding surrogate function
min
1
t
t
i=1
xi −Dαi
2
2 αi = argmin
α
1
2
xi −Di−1α 2
2+λΩ(α)
Minimize surrogate — updateable online at low cost

Algorithm outline
1 Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt

Algorithm outline
1 Compute code – xt → complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function – Complexity in O(p)
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate – Complexity in O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt

Speciﬁcation for a new algorithm
Mtxt
Stream
Ignore
p
n
1
Constrained : use only known
ratings from Ω
Eﬃcient: single iteration in O(s),
# of ratings provided by user t
Principled: follows the online
matrix factorization algorithm as
much as possible

Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1

Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Code
computation
2 Surrogate
update
3 Surrogate
minimization

Subsampled online dictionary learning
Check out paper !
Original online MF
1 Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xt αt − Bt−1)
3 Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At )j,j
(DAj
t −Bj
t ))
Our algorithm
1 Code computation: masked loss
αt = argmin
α∈Rk
Mt (xt − Dt−1α) 2
2
+ λ
rk Mt
p
Ω(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
i=1 Mi
(Mt xt αt − Mt Bt−1)
3 Surrogate minimization
Mt Dj
← p⊥
Cj
(Mt Dj
−
1
(At )j,j
Mt (D(Aj
t − (Bj
t ))

D´eroul´e
Existing methods
New algorithm
3 Results
Setting
Benchmarks
Parameter setting

Experiments
Validation : Test RMSE (rating prediction) vs CPU time
Baseline : Coordinate descent solver [Yu et al., 2012] for
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
Fastest solver available apart from SGD — hyperparameters
↑ Our method has a learning rate with little inﬂuence
Datasets : Movielens, Netﬂix
Publicly available
Larger one in the industry...

Results
Scalable algorithm: speed-up improves with size

Performance
Dataset Test RMSE Convergence time Speed
CD SODL CD SODL -up
ML 1M 0.872 0.866 6 s 8 s ×0.75
ML 10M 0.802 0.799 223 s 60 s ×3.7
NF (140M) 0.938 0.934 1714 s 256 s ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netﬂix
Simple model: RMSE is not state-of-the-art

Robustness to learning rate
Learning rate in algorithm to be set in [0.75, 1] (← theory)
In practice: Just set it in [0.8, 1]
1 10 40Epoch
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
RMSEontestset
Learning rate β0.75
0.78
0.81
0.83
0.86
0.89
0.92
0.94
0.97
1.00
MovieLens 10M
.1 1 10 20
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Netﬂix

Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value eﬃciently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications

Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value eﬃciently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?

Appendix: Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
1
24 epoch
Proposed method
10 h run time
1
2 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster

Bibliography I
[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).
Lessons from the Netﬂix prize challenge.
ACM SIGKDD Explorations Newsletter, 9(2):75–79.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization for
recommender systems.
In Proceedings of the International Conference on Data Mining, pages
765–774. IEEE.

Massive Matrix Factorization : Applications to collaborative filtering

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Massive Matrix Factorization : Applications to collaborative filtering

Similar to Massive Matrix Factorization : Applications to collaborative filtering (20)

Recently uploaded

Recently uploaded (20)

Massive Matrix Factorization : Applications to collaborative filtering