Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dictionary Learning for Massive Matrix Factorization

1,413 views

Published on

Talk given by Arthur Mensch, Inria Parietal, during the RecsysFR meetup on October 6th 2016.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Dictionary Learning for Massive Matrix Factorization

  1. 1. Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal Ga¨el Varoquaux, Bertrand Thirion Inria Parietal, Inria Thoth October 6, 2016
  2. 2. Introduction Why am I here ? Inria Parietal: machine learning for neuro-imaging (fMRI data) Matrix factorization: major ingredient in fMRI analysis Very large datasets (2 TB): we designed faster algorithms These algorithms can be used in collaborative filtering D AX Voxels Time = k spatial maps Time x 1 Work presented at ICML 2016 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
  3. 3. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
  4. 4. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
  5. 5. Collaborative filtering Collaborative platform n users rate a fraction of p items e.g movies, restaurants Estimate ratings for recommendation Use the ratings of other users for recommendation Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
  6. 6. How to predict ratings ? Credit: [Bell and Koren, 2007] Joe like We were soldiers, Black Hawk down. Bob and Alice like the same films, and also like Saving private Ryan. Joe should watch Saving private Ryan, because all of them indeed likes war films. Need to uncover topics in items Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
  7. 7. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  8. 8. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  9. 9. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Learning problem: estimate D and A with known ratings Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  10. 10. Matrix factorization 1Ω X AD p n k n = 1 X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n Constraints / penalty on factors D and A We only observe 1Ω X — Ω set of ratings provided by users Recommender systems : millions of users, millions of items How to scale matrix factorization to very large datasets ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
  11. 11. Formalism Finding representation in Rk for items and users: min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) = 1Ω (X − DA) 2 2 + λ( D 2 F + A 2 F ) 1Ω set of knownratings 2 reconstruction loss — 2 penalty for generalization Existing methods Alternated minimization Stochastic gradient descent Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
  12. 12. Existing methods Alternated minimization Minimize over A, D: alternate between D = min D∈Rp×k (i,j)∈Ω (xij − di αj )2 + λ D 2 F A = min A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ A 2 F No hyperparameters Slow and memory expensive: use all ratings at each iteration a.k.a. coordinate descent (variation in parameter update order) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
  13. 13. Existing methods Stochastic gradient descent min A,D (i,j)∈Ω fij (A, B) def = (xij − di αj )2 + 1 cj λ αj 2 2 + 1 ci λ di 2 2 Gradient step for each rating: (At, Dt) ← (At−1, Dt−1)− 1 ct (A,D)fij (At−1, Dt−1) Fast and memory efficient – won the Netflix prize Very sensitive to step sizes (ct) – need to cross-validate Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
  14. 14. Towards a new algorithm Best of both worlds ? Fast and memory efficient algorithm Little sensitive to hyperparameter setting Subsampled online dictionary learning Builds upon the online dictionary learning algorithm popular in computer vision and interpretable learning (fMRI) Adapt it to handle missing values efficiently Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
  15. 15. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
  16. 16. Dictionary learning Recall: recommender system formalism Non-masked matrix factorization with 2 penalty: min D∈Rp×k A∈Rk×n n j=1 (xj − D αj )2 + λ( D 2 F + A 2 F ) Penalties can be richer, and made into constraints Dictionary learning Learn the left side factor [Olshausen and Field, 1997] min D∈C n j=1 xj −Dαj 2 2 +λΩ(αj ) αj = argmin α∈Rk xi −Dα 2 2 +λΩ(α) Naive approach: alternated minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
  17. 17. Online dictionary learning [Mairal et al., 2010] At iteration t, select xt in {xj }j (user ratings), improve D Single iteration complexity ∝ sample dimension O(p) (Dt)t converges in a few epochs (one for large n) xt αtD p n k n =Stream 1 Very efficient in computer vision / networks / fMRI / hyperspectral images Can we use it efficiently for recommender systems ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
  18. 18. In short: Handling missing values X p n xt Steam Handle large n n Handle missing values Online → online + partial Batch → online Mtxt Stream Ignore Unknown Unaccessed 1 Leverage streaming + partial access to samples Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
  19. 19. In detail: online dictionary learning Objective function involves latent codes (right side factor) min D∈C 1 t t i=1 xi − Dα∗ i (D) 2 2, α∗ i (D) = argmin α 1 2 xi − Dα 2 2 + λΩ(α) Replace latent codes by codes computed with old dictionaries Build an upper-bounding surrogate function min 1 t t i=1 xi −Dαi 2 2 αi = argmin α 1 2 xi −Di−1α 2 2+λΩ(α) Minimize surrogate — updateable online at low cost Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
  20. 20. In detail: online dictionary learning Algorithm outline 1 Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  21. 21. In detail: online dictionary learning Algorithm outline 1 Compute code – xt → complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function – Complexity in O(p) gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate – Complexity in O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  22. 22. Specification for a new algorithm Mtxt Stream Ignore p n 1 Constrained : use only known ratings from Ω Efficient: single iteration in O(s), # of ratings provided by user t Principled: follows the online matrix factorization algorithm as much as possible Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
  23. 23. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  24. 24. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Adaptation to make Modify all parts of the algorithm to obtain O(s) complexity 1 Code computation 2 Surrogate update 3 Surrogate minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  25. 25. Subsampled online dictionary learning Check out paper ! Original online MF 1 Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xt αt − Bt−1) 3 Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At )j,j (DAj t −Bj t )) Our algorithm 1 Code computation: masked loss αt = argmin α∈Rk Mt (xt − Dt−1α) 2 2 + λ rk Mt p Ω(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t i=1 Mi (Mt xt αt − Mt Bt−1) 3 Surrogate minimization Mt Dj ← p⊥ Cj (Mt Dj − 1 (At )j,j Mt (D(Aj t − (Bj t )) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
  26. 26. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
  27. 27. Experiments Validation : Test RMSE (rating prediction) vs CPU time Baseline : Coordinate descent solver [Yu et al., 2012] for min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) Fastest solver available apart from SGD — hyperparameters ↑ Our method has a learning rate with little influence Datasets : Movielens, Netflix Publicly available Larger one in the industry... Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
  28. 28. Results Scalable algorithm: speed-up improves with size Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
  29. 29. Performance Dataset Test RMSE Convergence time Speed CD SODL CD SODL -up ML 1M 0.872 0.866 6 s 8 s ×0.75 ML 10M 0.802 0.799 223 s 60 s ×3.7 NF (140M) 0.938 0.934 1714 s 256 s ×6.8 Outperform coordinate descent beyond 10M ratings Same prediction performance Speed-up 6.8× on Netflix Simple model: RMSE is not state-of-the-art Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
  30. 30. Robustness to learning rate Learning rate in algorithm to be set in [0.75, 1] (← theory) In practice: Just set it in [0.8, 1] 1 10 40Epoch 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 RMSEontestset Learning rate β0.75 0.78 0.81 0.83 0.86 0.89 0.92 0.94 0.97 1.00 MovieLens 10M .1 1 10 20 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Netflix Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
  31. 31. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  32. 32. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Questions ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  33. 33. Appendix: Resting-state fMRI Online dictionary learning 235 h run time 1 full epoch 10 h run time 1 24 epoch Proposed method 10 h run time 1 2 epoch, reduction r=12 Qualitatively, usable maps are obtained 10× faster Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
  34. 34. Bibliography I [Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007). Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79. [Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60. [Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325. [Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012). Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Proceedings of the International Conference on Data Mining, pages 765–774. IEEE. Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28

×