Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fast ALS-based matrix factorization for explicit and implicit feedback datasets

3,244 views

Published on

If you want to learn more, get in touch with us through http://www.gravityrd.com

Published in: Technology, Education

Fast ALS-based matrix factorization for explicit and implicit feedback datasets

  1. 1. Fast ALS-based matrix factorization for explicit and implicit feedback datasets Istv á n Pil á szy, D ávid Zibriczky, Domonkos Tikk Gravity R&D Ltd. www.gravityrd.com 28 September 20 10
  2. 2. Collaborative filtering
  3. 3. Problem setting 5 4 3 4 4 2 4 1
  4. 4. <ul><li>Ridge Regression </li></ul>
  5. 5. <ul><li>Optimal solution: </li></ul><ul><li>Ridge Regression </li></ul>
  6. 6. <ul><li>Computing the optimal solution: </li></ul><ul><li>Matrix inversion is costly: </li></ul><ul><li>Sum of squared errors of the optimal solution: 0.055 </li></ul><ul><li>Ridge Regression </li></ul>
  7. 7. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero: </li></ul><ul><li>Sum of squared errors: 24.6 </li></ul>
  8. 8. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 </li></ul><ul><li>Sum of squared errors: 7.5 </li></ul>
  9. 9. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 ,then optimize w 2 </li></ul><ul><li>Sum of squared errors: 6.2 </li></ul>
  10. 10. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 , then w 2 , then w 3 </li></ul><ul><li>Sum of squared errors: 5.7 </li></ul>
  11. 11. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 4 </li></ul><ul><li>Sum of squared errors: 5.4 </li></ul>
  12. 12. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 5 </li></ul><ul><li>Sum of squared errors: 5.0 </li></ul>
  13. 13. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 1 again </li></ul><ul><li>Sum of squared errors: 3.4 </li></ul>
  14. 14. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 2 again </li></ul><ul><li>Sum of squared errors: 2.9 </li></ul>
  15. 15. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 3 again </li></ul><ul><li>Sum of squared errors: 2.7 </li></ul>
  16. 16. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… after a while: </li></ul><ul><li>Sum of squared errors: 0.055 </li></ul><ul><li>No remarkable difference </li></ul><ul><li>Cost: n examples, e epoch </li></ul>
  17. 17. <ul><li>The rating matrix, R of (M x N ) is approximated as the product of two lower ranked matrices, </li></ul><ul><li>P : user feature matrix of ( M x K ) size </li></ul><ul><li>Q : item (movie) feature matrix of ( N x K ) size </li></ul><ul><li>K : number of features </li></ul><ul><li>Matrix factorization </li></ul>P T R T Q
  18. 18. Matrix Factorization for explicit feedb. Q P 5 5 4 3 1 R 3.3 1.3 1.3 1. 4 1. 3 1 . 9 1. 7 0.7 1.0 1.3 0.8 0 0. 7 0.4 1. 7 0. 3 2.1 2.2 6.7 1.6 1. 4 2 4 3.3 1.6 1.8
  19. 19. Finding P and Q Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 ? ? <ul><li>Init Q randomly </li></ul><ul><li>Find p 1 </li></ul>
  20. 20. Finding p 1 with RR <ul><li>Optimal solution: </li></ul>
  21. 21. Finding p 1 with RR Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 2.3 3.2
  22. 22. <ul><li>Initialize Q randomly </li></ul><ul><li>Repeat </li></ul><ul><ul><li>Recompute P </li></ul></ul><ul><ul><ul><li>Compute p 1 with RR </li></ul></ul></ul><ul><ul><ul><li>Compute p 2 with RR </li></ul></ul></ul><ul><ul><ul><li>… (for each user) </li></ul></ul></ul><ul><ul><li>Recompute Q </li></ul></ul><ul><ul><ul><li>Compute q 1 with RR </li></ul></ul></ul><ul><ul><ul><li>… (for each item) </li></ul></ul></ul><ul><li>Alternating Least Squares (ALS) </li></ul>
  23. 23. <ul><li>ALS relies on RR: </li></ul><ul><ul><li>recomputation of vectors with RR </li></ul></ul><ul><ul><ul><li>when recomputing p 1 , the previously computed value is ignored </li></ul></ul></ul><ul><ul><li>ALS1 relies on RR1: </li></ul></ul><ul><ul><ul><li>optimize the previously computed p 1 , one scalar at once </li></ul></ul></ul><ul><ul><ul><li>the previously computed value is not lost </li></ul></ul></ul><ul><ul><ul><li>run RR1 only for one epoch </li></ul></ul></ul><ul><ul><li>ALS is just an approximation method. </li></ul></ul><ul><ul><li>Likewise ALS1. </li></ul></ul><ul><li>ALS1: ALS with RR1 </li></ul>
  24. 24. Implicit feedback Q P 1 0 R 0.5 0.1 0.2 0.7 0.3 0.1 0.1 0.7 0.3 0 0.2 0 0. 7 0.4 0.4 0. 4 1 0 0 0 0 1 1 0 0 1 0 1 1
  25. 25. <ul><li>The matrix is fully specified: each user watched each item. </li></ul><ul><li>Zeros are less important, but still important. Many 0-s, few 1-s. </li></ul><ul><li>Recall, that </li></ul><ul><li>Idea (Hu, Koren, Volinsky): </li></ul><ul><ul><li>consider a user, who watched nothing </li></ul></ul><ul><ul><li>compute and for this user (the null-user) </li></ul></ul><ul><ul><li>when recomputing p 1 , compare her to the null-user </li></ul></ul><ul><ul><li>based on the cached and , update them according to the differences </li></ul></ul><ul><ul><li>In this way, only the number of 1-s affect performance, not the number of 0-s </li></ul></ul><ul><li>IALS: alternating least squares with this trick. </li></ul><ul><li>Implicit feedback: IALS </li></ul>
  26. 26. <ul><li>The RR1 trick cannot be applied here  </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  27. 27. <ul><li>The RR1 trick cannot be applied here  </li></ul><ul><li>But, wait…! </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  28. 28. <ul><li>X T X is just a matrix. </li></ul><ul><li>No matter how many items we have, its dimension is the same (KxK) </li></ul><ul><li>If we are lucky, we can find K items which generate this matrix </li></ul><ul><li>What, if we are unlucky? We can still create synthetic items. </li></ul><ul><li>Assume that the null user did not watch these K items </li></ul><ul><li>X T X and X T y are the same, if synthetic items were created appropriately </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  29. 29. <ul><li>Can we find a Z matrix such that </li></ul><ul><ul><li>Z is small, KxK and ? </li></ul></ul><ul><li>We can, by eigenvalue decomposition </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  30. 30. <ul><li>If a user watched N items,we can run RR1 with N+K examples </li></ul><ul><li>To recompute p u , we need steps (assume 1 epoch) </li></ul><ul><li>Is it better in practice, than the of IALS ? </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  31. 31. <ul><li>Evaluation of ALS vs. ALS1 </li></ul><ul><li>Probe10 RMSE on Netflix Prize dataset, after 25 epochs </li></ul>
  32. 32. <ul><li>Evaluation of ALS vs. ALS1 </li></ul><ul><li>Time-accuracy tradeoff </li></ul>
  33. 33. <ul><li>Evaluation of IALS vs. IALS1 </li></ul><ul><li>Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better. </li></ul>
  34. 34. <ul><li>Evaluation of IALS vs. IALS1 </li></ul><ul><li>Time – accuracy tradeoff. </li></ul>
  35. 35. Conclusions users items <ul><li>We learned two tricks: </li></ul><ul><ul><li>ALS1: RR1 can be used instead of RR in ALS </li></ul></ul><ul><ul><li>IALS1: we can create few synthetic examples to replace the not-watching of many examples </li></ul></ul><ul><li>ALS and IALS are approximation algorithms, so why not change them to be even more approximative </li></ul><ul><li>ALS1 and IALS1 offer better time-accuracy tradeoffs, esp. when K is large. </li></ul><ul><li>They can be even 10x faster (or even 100x faster, for non-realistic K values) </li></ul><ul><li>TODO: </li></ul><ul><li>Precision, recall, other datasets. </li></ul>
  36. 36. Thank you for your attention ?

×