Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Matrix Factorizations for Recommender Systems

2,744 views

Published on

Scalable algorithms for recommender systems in one-class collaborative filtering settings. WRMF, LinearFlow.

Published in: Data & Analytics
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks for providing helpful information and for your professionalism. =>> https://url.cn/krOAnJTk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • A fantastic synthesis of all aspects of matrix factorization for recommender systems using implicit data. Thank you very much!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Matrix Factorizations for Recommender Systems

  1. 1. Matrix Factorizations for Recommender Systems Dmitriy Selivanov selivanov.dmitriy@gmail.com 2017-11-16
  2. 2. Recommender systems are everywhere Figure 1:
  3. 3. Recommender systems are everywhere Figure 2:
  4. 4. Recommender systems are everywhere Figure 3:
  5. 5. Recommender systems are everywhere Figure 4:
  6. 6. Goals Propose “relevant” items to customers Retention Exploration Up-sale Personalized offers recommended items for a customer given history of activities (transactions, browsing history, favourites) Similar items substitutions bundles - frequently bought together . . .
  7. 7. Live demo Dataset - LastFM-360K: 360k users 160k artists 17M observations sparsity - 0.9999999
  8. 8. Explicit feedback Ratings, likes/dislikes, purchases: cleaner data smaller hard to collect RMSE2 = 1 D u,i∈D (rui − ˆrui )2
  9. 9. Netflix prize ~ 480k users, 18k movies, 100m ratings sparsity ~ 90% goal is to reduce RMSE by 10% - from 0.9514 to 0.8563
  10. 10. Implicit feedback noisy feedback (click, likes, purchases, search, . . . ) much easier to collect wider user/item coverage usually sparsity > 99.9% One-Class Collaborative Filtering observed entries are positive preferences should have high confidence missed entries in matrix are mix of negative preferences and positive preferences consider them as negative with low confidence we cannot really distinguish that user did not click a banner because of a lack of interest or lack of awareness
  11. 11. Evaluation Recap: we only care about how to produce small set of highly relevant items. RMSE is bad metrics - very weak connection to business goals. Only interested about relevance precision of retreived items: space on the screen is limited only order matters - most relevant items should be in top
  12. 12. Ranking - Mean average precision AveragePrecision = n k=1 (P(k)×rel(k)) number of relevant documents ## index relevant precision_at_k ## 1: 1 0 0.0000000 ## 2: 2 0 0.0000000 ## 3: 3 1 0.3333333 ## 4: 4 0 0.2500000 ## 5: 5 0 0.2000000 map@5 = 0.1566667
  13. 13. Ranking - Normalized Discounted Cumulative Gain Intuition is the same as for MAP@K, but also takes into account value of relevance: DCGp = p i=1 2reli − 1 log2(i + 1) nDCGp = DCGp IDCGp IDCGp = |REL| i=1 2reli − 1 log2(i + 1)
  14. 14. Approaches Content based good for cold start not personalized Collaborative filtering vanilla collaborative fitlering matrix factorizations . . . Hybrid and context aware recommender systems best of two worlds
  15. 15. Focus today WRMF (Weighted Regularized Matrix Factorization) - Collaborative Filtering for Implicit Feedback Datasets (2008) efficient learning with accelerated approximate Alternating Least Squares inference time Linear-FLow - Practical Linear Models for Large-Scale One-Class Collaborative Filtering (2016) efficient truncated SVD cheap cross-validation with full path regularization
  16. 16. Matrix Factorizations Users can be described by small number of latent factors puk Items can be described by small number of latent factors qki
  17. 17. Sparse data items users
  18. 18. Low rank matrix factorization R = P × Q factors users items factors
  19. 19. Reconstruction items users items users
  20. 20. Truncated SVD Take k largest singular values: X ≈ UkDkV T k - Xk ∈ Rm∗n - Uk, V - columns are orthonormal bases (dot product of any 2 columns is zero, unit norm) - Dk - matrix with singular values on diagonal Truncated SVD is the best rank k approximation of the matrix X in terms of Frobenius norm: ||X − UkDkV T k ||F P = Uk Dk Q = DkV T k
  21. 21. Issue with truncated SVD for “explicit” feedback Optimal in terms of Frobenius norm - takes into account zeros in ratings - RMSE = 1 users × items u∈users,i∈items (rui − ˆrui )2 Overfits data Objective = error only in “observed” ratings: RMSE = 1 Observed u,i∈Observed (rui − ˆrui )2
  22. 22. SVD-like matrix factorization with ALS J = u,i∈Observed (rui − pu × qi )2 + λ(||Q2 || + ||P2 ||) Given Q fixed solve for p: min i∈Observed (ri − qi × P)2 + λ u j=1 p2 j Given P fixed solve for q: min u∈Observed (ru − pu × Q)2 + λ i j=1 q2 j Ridge regression: P = (QT Q + λI)−1QT r, Q = (PT P + λI)−1PT r
  23. 23. “Collaborative Filtering for Implicit Feedback Datasets” WRMF - Weighted Regularized Matrix Factorization “Default” approach Proposed in 2008, but still widely used in industry (even at youtube) several high-quality open-source implementations J = u,i Cui (Pui − XuYi )2 + λ(||X||F + ||Y ||F ) Preferences - binary Pij = 1 if Rij > 0 0 otherwise Confidence - Cui = 1 + f (Rui )
  24. 24. Alternating Least Squares for implicit feedback For fixed Y : dL/dxu = −2 i=item cui (pui − xT u yi )yi + 2λxu = −2 i=item cui (pui − yT i xu)yi + 2λxu = −2Y T Cu p(u) + 2Y T Cu Yxu + 2λxu Setting dL/dxu = 0 for optimal solution gives us (Y T CuY + λI)xu = Y T Cup(u) xu can be obtained by solving system of linear equations: xu = solve(Y T Cu Y + λI, Y T Cu p(u))
  25. 25. Alternating Least Squares for implicit feedback Similarly for fixed X: dL/dyi = −2XT Ci p(i) + 2XT Ci Yyi + 2λyi yi = solve(XT Ci X + λI, XT Ci p(i)) Another optimization: XT Ci X = XT X + XT (Ci − I)X Y T CuY = Y T Y + Y T (Cu − I)Y XT X and Y T Y can be precomputed
  26. 26. Accelerated Approximate Alternating Least Squares yi = solve(XT Ci X + λI, XT Ci p(i)) Iterative methods Conjugate Gradient Coordinate Descend Fixed number of steps of (usually 3-4 is enough):
  27. 27. Inference time How to make recommendations for new users? There are no user embeddings since users are not in original matrix!
  28. 28. Inference time Make one step on ALS with fixed item embeddings matrix => get new user embeddings: given Y fixed, Cnew - new user-item interactions confidence xunew = solve(Y T Cunew Y + λI, Y T Cunew p(unew )) scores = Xnew Y T
  29. 29. WRMF Implementations python implicit - implemets Conjugate Gradient. With GPU support recently! R reco - implemets Conjugate Gradient Spark ALS Quora qmf Google tensorflow *titles are clickable
  30. 30. Linear-Flow Idea is to learn item-item similarity matrix W from the data. First min J = ||X − XWk||F + λ||Wk||F With constraint: rank(W ) ≤ k
  31. 31. Linear-Flow observations 1. Whithout L2 regularization optimal solution is Wk = QkQT k where SVDk(X) = PkΣkQT k 2. Whithout rank(W ) ≤ k optimal solution is just solution for ridge regression: W = (XT X + λI)−1XT X - infeasible.
  32. 32. Linear-Flow reparametrization SVDk(X) = PkΣkQT k Let W = QkY : argmin(Y ) : ||X − XQkY ||F + λ||QkY ||F Motivation λ = 0 => W = QkQT k and also soliton for current problem Y = QT k
  33. 33. Linear-Flow closed-form solution Notice that if Qk orthogogal then ||QkY ||F = ||Y ||F Solve ||X − XQkY ||F + λ||Y ||F Simple ridge regression with close form solution Y = (QT k XT XQk + λI)−1 QT k XT X Very cheap inversion of the matrix of rank k!
  34. 34. Linear-Flow hassle-free cross-validation Y = (QT k XT XQk + λI)−1 QT k XT X How to find lamda with cross-validation? pre-compute Z = QT k XT X so Y = (ZQk + λI)−1Z - pre-compute ZQk notice that value of lambda affects only diagonal of ZQk generate sequence of lambda (say of length 50) based on min/max diagonal values solving 50 rigde regression of a small rank is super-fast
  35. 35. Linear-Flow hassle-free cross-validation Figure 7:
  36. 36. Suggestions start simple - SVD, WRMF design proper cross-validation - both objective and data split think about how to incorporate business logic (for example how to exclude something) use single machine implementations think about inference time don’t waste time with libraries/articles/blogposts wich demonstrate MF with dense matrices
  37. 37. Questions? http://dsnotes.com/tags/recommender-systems/ https://github.com/dselivanov/reco Contacts: selivanov.dmitriy@gmail.com https://github.com/dselivanov https://www.linkedin.com/in/dselivanov1

×