Fast ALS-based matrix factorization for explicit and implicit feedback datasets

2,652 views

Published on

If you want to learn more, get in touch with us through http://www.gravityrd.com

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,652
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
83
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Fast ALS-based matrix factorization for explicit and implicit feedback datasets

  1. 1. Fast ALS-based matrix factorization for explicit and implicit feedback datasets Istv á n Pil á szy, D ávid Zibriczky, Domonkos Tikk Gravity R&D Ltd. www.gravityrd.com 28 September 20 10
  2. 2. Collaborative filtering
  3. 3. Problem setting 5 4 3 4 4 2 4 1
  4. 4. <ul><li>Ridge Regression </li></ul>
  5. 5. <ul><li>Optimal solution: </li></ul><ul><li>Ridge Regression </li></ul>
  6. 6. <ul><li>Computing the optimal solution: </li></ul><ul><li>Matrix inversion is costly: </li></ul><ul><li>Sum of squared errors of the optimal solution: 0.055 </li></ul><ul><li>Ridge Regression </li></ul>
  7. 7. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero: </li></ul><ul><li>Sum of squared errors: 24.6 </li></ul>
  8. 8. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 </li></ul><ul><li>Sum of squared errors: 7.5 </li></ul>
  9. 9. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 ,then optimize w 2 </li></ul><ul><li>Sum of squared errors: 6.2 </li></ul>
  10. 10. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>Start with zero, then optimize w 1 , then w 2 , then w 3 </li></ul><ul><li>Sum of squared errors: 5.7 </li></ul>
  11. 11. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 4 </li></ul><ul><li>Sum of squared errors: 5.4 </li></ul>
  12. 12. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 5 </li></ul><ul><li>Sum of squared errors: 5.0 </li></ul>
  13. 13. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 1 again </li></ul><ul><li>Sum of squared errors: 3.4 </li></ul>
  14. 14. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 2 again </li></ul><ul><li>Sum of squared errors: 2.9 </li></ul>
  15. 15. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… w 3 again </li></ul><ul><li>Sum of squared errors: 2.7 </li></ul>
  16. 16. <ul><li>RR1: RR with coordinate descent </li></ul><ul><li>Idea: optimize only one variable of at once </li></ul><ul><li>… after a while: </li></ul><ul><li>Sum of squared errors: 0.055 </li></ul><ul><li>No remarkable difference </li></ul><ul><li>Cost: n examples, e epoch </li></ul>
  17. 17. <ul><li>The rating matrix, R of (M x N ) is approximated as the product of two lower ranked matrices, </li></ul><ul><li>P : user feature matrix of ( M x K ) size </li></ul><ul><li>Q : item (movie) feature matrix of ( N x K ) size </li></ul><ul><li>K : number of features </li></ul><ul><li>Matrix factorization </li></ul>P T R T Q
  18. 18. Matrix Factorization for explicit feedb. Q P 5 5 4 3 1 R 3.3 1.3 1.3 1. 4 1. 3 1 . 9 1. 7 0.7 1.0 1.3 0.8 0 0. 7 0.4 1. 7 0. 3 2.1 2.2 6.7 1.6 1. 4 2 4 3.3 1.6 1.8
  19. 19. Finding P and Q Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 ? ? <ul><li>Init Q randomly </li></ul><ul><li>Find p 1 </li></ul>
  20. 20. Finding p 1 with RR <ul><li>Optimal solution: </li></ul>
  21. 21. Finding p 1 with RR Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 2.3 3.2
  22. 22. <ul><li>Initialize Q randomly </li></ul><ul><li>Repeat </li></ul><ul><ul><li>Recompute P </li></ul></ul><ul><ul><ul><li>Compute p 1 with RR </li></ul></ul></ul><ul><ul><ul><li>Compute p 2 with RR </li></ul></ul></ul><ul><ul><ul><li>… (for each user) </li></ul></ul></ul><ul><ul><li>Recompute Q </li></ul></ul><ul><ul><ul><li>Compute q 1 with RR </li></ul></ul></ul><ul><ul><ul><li>… (for each item) </li></ul></ul></ul><ul><li>Alternating Least Squares (ALS) </li></ul>
  23. 23. <ul><li>ALS relies on RR: </li></ul><ul><ul><li>recomputation of vectors with RR </li></ul></ul><ul><ul><ul><li>when recomputing p 1 , the previously computed value is ignored </li></ul></ul></ul><ul><ul><li>ALS1 relies on RR1: </li></ul></ul><ul><ul><ul><li>optimize the previously computed p 1 , one scalar at once </li></ul></ul></ul><ul><ul><ul><li>the previously computed value is not lost </li></ul></ul></ul><ul><ul><ul><li>run RR1 only for one epoch </li></ul></ul></ul><ul><ul><li>ALS is just an approximation method. </li></ul></ul><ul><ul><li>Likewise ALS1. </li></ul></ul><ul><li>ALS1: ALS with RR1 </li></ul>
  24. 24. Implicit feedback Q P 1 0 R 0.5 0.1 0.2 0.7 0.3 0.1 0.1 0.7 0.3 0 0.2 0 0. 7 0.4 0.4 0. 4 1 0 0 0 0 1 1 0 0 1 0 1 1
  25. 25. <ul><li>The matrix is fully specified: each user watched each item. </li></ul><ul><li>Zeros are less important, but still important. Many 0-s, few 1-s. </li></ul><ul><li>Recall, that </li></ul><ul><li>Idea (Hu, Koren, Volinsky): </li></ul><ul><ul><li>consider a user, who watched nothing </li></ul></ul><ul><ul><li>compute and for this user (the null-user) </li></ul></ul><ul><ul><li>when recomputing p 1 , compare her to the null-user </li></ul></ul><ul><ul><li>based on the cached and , update them according to the differences </li></ul></ul><ul><ul><li>In this way, only the number of 1-s affect performance, not the number of 0-s </li></ul></ul><ul><li>IALS: alternating least squares with this trick. </li></ul><ul><li>Implicit feedback: IALS </li></ul>
  26. 26. <ul><li>The RR1 trick cannot be applied here  </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  27. 27. <ul><li>The RR1 trick cannot be applied here  </li></ul><ul><li>But, wait…! </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  28. 28. <ul><li>X T X is just a matrix. </li></ul><ul><li>No matter how many items we have, its dimension is the same (KxK) </li></ul><ul><li>If we are lucky, we can find K items which generate this matrix </li></ul><ul><li>What, if we are unlucky? We can still create synthetic items. </li></ul><ul><li>Assume that the null user did not watch these K items </li></ul><ul><li>X T X and X T y are the same, if synthetic items were created appropriately </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  29. 29. <ul><li>Can we find a Z matrix such that </li></ul><ul><ul><li>Z is small, KxK and ? </li></ul></ul><ul><li>We can, by eigenvalue decomposition </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  30. 30. <ul><li>If a user watched N items,we can run RR1 with N+K examples </li></ul><ul><li>To recompute p u , we need steps (assume 1 epoch) </li></ul><ul><li>Is it better in practice, than the of IALS ? </li></ul><ul><li>Implicit feedback: IALS1 </li></ul>
  31. 31. <ul><li>Evaluation of ALS vs. ALS1 </li></ul><ul><li>Probe10 RMSE on Netflix Prize dataset, after 25 epochs </li></ul>
  32. 32. <ul><li>Evaluation of ALS vs. ALS1 </li></ul><ul><li>Time-accuracy tradeoff </li></ul>
  33. 33. <ul><li>Evaluation of IALS vs. IALS1 </li></ul><ul><li>Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better. </li></ul>
  34. 34. <ul><li>Evaluation of IALS vs. IALS1 </li></ul><ul><li>Time – accuracy tradeoff. </li></ul>
  35. 35. Conclusions users items <ul><li>We learned two tricks: </li></ul><ul><ul><li>ALS1: RR1 can be used instead of RR in ALS </li></ul></ul><ul><ul><li>IALS1: we can create few synthetic examples to replace the not-watching of many examples </li></ul></ul><ul><li>ALS and IALS are approximation algorithms, so why not change them to be even more approximative </li></ul><ul><li>ALS1 and IALS1 offer better time-accuracy tradeoffs, esp. when K is large. </li></ul><ul><li>They can be even 10x faster (or even 100x faster, for non-realistic K values) </li></ul><ul><li>TODO: </li></ul><ul><li>Precision, recall, other datasets. </li></ul>
  36. 36. Thank you for your attention ?

×