SlideShare a Scribd company logo
Dictionary Learning for
Massive Matrix Factorization
Arthur Mensch, Julien Mairal
Ga¨el Varoquaux, Bertrand Thirion
Inria Parietal, Inria Thoth
October 6, 2016
Introduction
Why am I here ?
Inria Parietal: machine learning for neuro-imaging
(fMRI data)
Matrix factorization: major ingredient in fMRI analysis
Very large datasets (2 TB): we designed faster algorithms
These algorithms can be used in collaborative filtering
D AX
Voxels
Time
=
k spatial maps Time
x
1
Work presented at ICML 2016
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
Collaborative filtering
Collaborative platform
n users rate a fraction of
p items
e.g movies, restaurants
Estimate ratings for
recommendation
Use the ratings of other users for recommendation
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
How to predict ratings ?
Credit: [Bell and Koren, 2007]
Joe like We were
soldiers, Black Hawk
down.
Bob and Alice like the
same films, and also
like Saving private
Ryan.
Joe should watch
Saving private Ryan,
because all of them
indeed likes war films.
Need to uncover topics in items
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Matrix factorization
1Ω X AD
p
n k
n
=
1
X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n
Constraints / penalty on factors D and A
We only observe 1Ω X — Ω set of ratings provided by users
Recommender systems : millions of users, millions of items
How to scale matrix factorization to very large datasets ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
Formalism
Finding representation in Rk for items and users:
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
= 1Ω (X − DA) 2
2 + λ( D 2
F + A 2
F ) 1Ω set of knownratings
2 reconstruction loss — 2 penalty for generalization
Existing methods
Alternated minimization
Stochastic gradient descent
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
Existing methods
Alternated minimization
Minimize over A, D: alternate between
D = min
D∈Rp×k
(i,j)∈Ω
(xij − di αj
)2
+ λ D 2
F
A = min
A∈Rk×n
(i,j)∈Ω
(xij − di αj
)2
+ λ A 2
F
No hyperparameters
Slow and memory expensive: use all ratings at each iteration
a.k.a. coordinate descent (variation in parameter update order)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
Existing methods
Stochastic gradient descent
min
A,D
(i,j)∈Ω
fij (A, B)
def
= (xij − di αj
)2
+
1
cj
λ αj 2
2 +
1
ci
λ di
2
2
Gradient step for each rating:
(At, Dt) ← (At−1, Dt−1)−
1
ct
(A,D)fij (At−1, Dt−1)
Fast and memory efficient – won the Netflix prize
Very sensitive to step sizes (ct) – need to cross-validate
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
Towards a new algorithm
Best of both worlds ?
Fast and memory efficient algorithm
Little sensitive to hyperparameter setting
Subsampled online dictionary learning
Builds upon the online dictionary learning algorithm
popular in computer vision and interpretable learning (fMRI)
Adapt it to handle missing values efficiently
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
Dictionary learning
Recall: recommender system formalism
Non-masked matrix factorization with 2 penalty:
min
D∈Rp×k
A∈Rk×n
n
j=1
(xj
− D αj
)2
+ λ( D 2
F + A 2
F )
Penalties can be richer, and made into constraints
Dictionary learning
Learn the left side factor [Olshausen and Field, 1997]
min
D∈C
n
j=1
xj
−Dαj 2
2 +λΩ(αj
) αj
= argmin
α∈Rk
xi
−Dα 2
2 +λΩ(α)
Naive approach: alternated minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
Online dictionary learning [Mairal et al., 2010]
At iteration t, select xt in {xj }j (user ratings), improve D
Single iteration complexity ∝ sample dimension O(p)
(Dt)t converges in a few epochs (one for large n)
xt αtD
p
n k n
=Stream
1
Very efficient in computer vision / networks / fMRI /
hyperspectral images
Can we use it efficiently for recommender systems ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
In short: Handling missing values
X
p
n
xt
Steam
Handle large n
n
Handle missing values
Online → online + partial
Batch →
online
Mtxt
Stream
Ignore
Unknown
Unaccessed
1
Leverage streaming + partial access to samples
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
In detail: online dictionary learning
Objective function involves latent codes (right side factor)
min
D∈C
1
t
t
i=1
xi − Dα∗
i (D) 2
2, α∗
i (D) = argmin
α
1
2
xi − Dα 2
2 + λΩ(α)
Replace latent codes by codes computed with old dictionaries
Build an upper-bounding surrogate function
min
1
t
t
i=1
xi −Dαi
2
2 αi = argmin
α
1
2
xi −Di−1α 2
2+λΩ(α)
Minimize surrogate — updateable online at low cost
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code – xt → complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function – Complexity in O(p)
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate – Complexity in O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
Specification for a new algorithm
Mtxt
Stream
Ignore
p
n
1
Constrained : use only known
ratings from Ω
Efficient: single iteration in O(s),
# of ratings provided by user t
Principled: follows the online
matrix factorization algorithm as
much as possible
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Code
computation
2 Surrogate
update
3 Surrogate
minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Subsampled online dictionary learning
Check out paper !
Original online MF
1 Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xt αt − Bt−1)
3 Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At )j,j
(DAj
t −Bj
t ))
Our algorithm
1 Code computation: masked loss
αt = argmin
α∈Rk
Mt (xt − Dt−1α) 2
2
+ λ
rk Mt
p
Ω(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
i=1 Mi
(Mt xt αt − Mt Bt−1)
3 Surrogate minimization
Mt Dj
← p⊥
Cj
(Mt Dj
−
1
(At )j,j
Mt (D(Aj
t − (Bj
t ))
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
Experiments
Validation : Test RMSE (rating prediction) vs CPU time
Baseline : Coordinate descent solver [Yu et al., 2012] for
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
Fastest solver available apart from SGD — hyperparameters
↑ Our method has a learning rate with little influence
Datasets : Movielens, Netflix
Publicly available
Larger one in the industry...
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
Results
Scalable algorithm: speed-up improves with size
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
Performance
Dataset Test RMSE Convergence time Speed
CD SODL CD SODL -up
ML 1M 0.872 0.866 6 s 8 s ×0.75
ML 10M 0.802 0.799 223 s 60 s ×3.7
NF (140M) 0.938 0.934 1714 s 256 s ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Simple model: RMSE is not state-of-the-art
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
Robustness to learning rate
Learning rate in algorithm to be set in [0.75, 1] (← theory)
In practice: Just set it in [0.8, 1]
1 10 40Epoch
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
RMSEontestset
Learning rate β0.75
0.78
0.81
0.83
0.86
0.89
0.92
0.94
0.97
1.00
MovieLens 10M
.1 1 10 20
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Netflix
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Appendix: Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
1
24 epoch
Proposed method
10 h run time
1
2 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
Bibliography I
[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).
Lessons from the Netflix prize challenge.
ACM SIGKDD Explorations Newsletter, 9(2):75–79.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization for
recommender systems.
In Proceedings of the International Conference on Data Mining, pages
765–774. IEEE.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28

More Related Content

What's hot

Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
MLconf
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
Jonas Adler
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
MLconf
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
Barbara Fusinska
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
Barbara Fusinska
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
Kazuki Fujikawa
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 

What's hot (19)

Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
 
Review_Cibe Sridharan
Review_Cibe SridharanReview_Cibe Sridharan
Review_Cibe Sridharan
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 

Similar to Massive Matrix Factorization : Applications to collaborative filtering

Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
Dmitriy Selivanov
 
Talk icml
Talk icmlTalk icml
Talk icmlBo Li
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
Yuko Kuroki (黒木祐子)
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
arogozhnikov
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
tthonet
 
slides-defense-jie
slides-defense-jieslides-defense-jie
slides-defense-jiejie ren
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
Joel Falcou
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
South Tyrol Free Software Conference
 
AlgorithmAnalysis2.ppt
AlgorithmAnalysis2.pptAlgorithmAnalysis2.ppt
AlgorithmAnalysis2.ppt
REMEGIUSPRAVEENSAHAY
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
in computer data structures and algorithms
in computer data structures and algorithmsin computer data structures and algorithms
in computer data structures and algorithms
FIONACHATOLA
 
WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper
Antidot
 

Similar to Massive Matrix Factorization : Applications to collaborative filtering (20)

Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Talk icml
Talk icmlTalk icml
Talk icml
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
slides-defense-jie
slides-defense-jieslides-defense-jie
slides-defense-jie
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
 
AlgorithmAnalysis2.ppt
AlgorithmAnalysis2.pptAlgorithmAnalysis2.ppt
AlgorithmAnalysis2.ppt
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
in computer data structures and algorithms
in computer data structures and algorithmsin computer data structures and algorithms
in computer data structures and algorithms
 
WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper WISS 2015 - Machine Learning lecture by Ludovic Samper
WISS 2015 - Machine Learning lecture by Ludovic Samper
 

Recently uploaded

Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 

Recently uploaded (20)

Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 

Massive Matrix Factorization : Applications to collaborative filtering

  • 1. Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal Ga¨el Varoquaux, Bertrand Thirion Inria Parietal, Inria Thoth October 6, 2016
  • 2. Introduction Why am I here ? Inria Parietal: machine learning for neuro-imaging (fMRI data) Matrix factorization: major ingredient in fMRI analysis Very large datasets (2 TB): we designed faster algorithms These algorithms can be used in collaborative filtering D AX Voxels Time = k spatial maps Time x 1 Work presented at ICML 2016 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
  • 3. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
  • 4. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
  • 5. Collaborative filtering Collaborative platform n users rate a fraction of p items e.g movies, restaurants Estimate ratings for recommendation Use the ratings of other users for recommendation Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
  • 6. How to predict ratings ? Credit: [Bell and Koren, 2007] Joe like We were soldiers, Black Hawk down. Bob and Alice like the same films, and also like Saving private Ryan. Joe should watch Saving private Ryan, because all of them indeed likes war films. Need to uncover topics in items Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
  • 7. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 8. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 9. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Learning problem: estimate D and A with known ratings Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 10. Matrix factorization 1Ω X AD p n k n = 1 X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n Constraints / penalty on factors D and A We only observe 1Ω X — Ω set of ratings provided by users Recommender systems : millions of users, millions of items How to scale matrix factorization to very large datasets ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
  • 11. Formalism Finding representation in Rk for items and users: min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) = 1Ω (X − DA) 2 2 + λ( D 2 F + A 2 F ) 1Ω set of knownratings 2 reconstruction loss — 2 penalty for generalization Existing methods Alternated minimization Stochastic gradient descent Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
  • 12. Existing methods Alternated minimization Minimize over A, D: alternate between D = min D∈Rp×k (i,j)∈Ω (xij − di αj )2 + λ D 2 F A = min A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ A 2 F No hyperparameters Slow and memory expensive: use all ratings at each iteration a.k.a. coordinate descent (variation in parameter update order) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
  • 13. Existing methods Stochastic gradient descent min A,D (i,j)∈Ω fij (A, B) def = (xij − di αj )2 + 1 cj λ αj 2 2 + 1 ci λ di 2 2 Gradient step for each rating: (At, Dt) ← (At−1, Dt−1)− 1 ct (A,D)fij (At−1, Dt−1) Fast and memory efficient – won the Netflix prize Very sensitive to step sizes (ct) – need to cross-validate Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
  • 14. Towards a new algorithm Best of both worlds ? Fast and memory efficient algorithm Little sensitive to hyperparameter setting Subsampled online dictionary learning Builds upon the online dictionary learning algorithm popular in computer vision and interpretable learning (fMRI) Adapt it to handle missing values efficiently Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
  • 15. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
  • 16. Dictionary learning Recall: recommender system formalism Non-masked matrix factorization with 2 penalty: min D∈Rp×k A∈Rk×n n j=1 (xj − D αj )2 + λ( D 2 F + A 2 F ) Penalties can be richer, and made into constraints Dictionary learning Learn the left side factor [Olshausen and Field, 1997] min D∈C n j=1 xj −Dαj 2 2 +λΩ(αj ) αj = argmin α∈Rk xi −Dα 2 2 +λΩ(α) Naive approach: alternated minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
  • 17. Online dictionary learning [Mairal et al., 2010] At iteration t, select xt in {xj }j (user ratings), improve D Single iteration complexity ∝ sample dimension O(p) (Dt)t converges in a few epochs (one for large n) xt αtD p n k n =Stream 1 Very efficient in computer vision / networks / fMRI / hyperspectral images Can we use it efficiently for recommender systems ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
  • 18. In short: Handling missing values X p n xt Steam Handle large n n Handle missing values Online → online + partial Batch → online Mtxt Stream Ignore Unknown Unaccessed 1 Leverage streaming + partial access to samples Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
  • 19. In detail: online dictionary learning Objective function involves latent codes (right side factor) min D∈C 1 t t i=1 xi − Dα∗ i (D) 2 2, α∗ i (D) = argmin α 1 2 xi − Dα 2 2 + λΩ(α) Replace latent codes by codes computed with old dictionaries Build an upper-bounding surrogate function min 1 t t i=1 xi −Dαi 2 2 αi = argmin α 1 2 xi −Di−1α 2 2+λΩ(α) Minimize surrogate — updateable online at low cost Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
  • 20. In detail: online dictionary learning Algorithm outline 1 Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 21. In detail: online dictionary learning Algorithm outline 1 Compute code – xt → complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function – Complexity in O(p) gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate – Complexity in O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 22. Specification for a new algorithm Mtxt Stream Ignore p n 1 Constrained : use only known ratings from Ω Efficient: single iteration in O(s), # of ratings provided by user t Principled: follows the online matrix factorization algorithm as much as possible Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
  • 23. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 24. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Adaptation to make Modify all parts of the algorithm to obtain O(s) complexity 1 Code computation 2 Surrogate update 3 Surrogate minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 25. Subsampled online dictionary learning Check out paper ! Original online MF 1 Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xt αt − Bt−1) 3 Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At )j,j (DAj t −Bj t )) Our algorithm 1 Code computation: masked loss αt = argmin α∈Rk Mt (xt − Dt−1α) 2 2 + λ rk Mt p Ω(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t i=1 Mi (Mt xt αt − Mt Bt−1) 3 Surrogate minimization Mt Dj ← p⊥ Cj (Mt Dj − 1 (At )j,j Mt (D(Aj t − (Bj t )) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
  • 26. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
  • 27. Experiments Validation : Test RMSE (rating prediction) vs CPU time Baseline : Coordinate descent solver [Yu et al., 2012] for min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) Fastest solver available apart from SGD — hyperparameters ↑ Our method has a learning rate with little influence Datasets : Movielens, Netflix Publicly available Larger one in the industry... Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
  • 28. Results Scalable algorithm: speed-up improves with size Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
  • 29. Performance Dataset Test RMSE Convergence time Speed CD SODL CD SODL -up ML 1M 0.872 0.866 6 s 8 s ×0.75 ML 10M 0.802 0.799 223 s 60 s ×3.7 NF (140M) 0.938 0.934 1714 s 256 s ×6.8 Outperform coordinate descent beyond 10M ratings Same prediction performance Speed-up 6.8× on Netflix Simple model: RMSE is not state-of-the-art Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
  • 30. Robustness to learning rate Learning rate in algorithm to be set in [0.75, 1] (← theory) In practice: Just set it in [0.8, 1] 1 10 40Epoch 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 RMSEontestset Learning rate β0.75 0.78 0.81 0.83 0.86 0.89 0.92 0.94 0.97 1.00 MovieLens 10M .1 1 10 20 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Netflix Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
  • 31. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 32. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Questions ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 33. Appendix: Resting-state fMRI Online dictionary learning 235 h run time 1 full epoch 10 h run time 1 24 epoch Proposed method 10 h run time 1 2 epoch, reduction r=12 Qualitatively, usable maps are obtained 10× faster Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
  • 34. Bibliography I [Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007). Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79. [Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60. [Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325. [Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012). Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Proceedings of the International Conference on Data Mining, pages 765–774. IEEE. Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28