Latent factor models for Collaborative Filtering


Published on

Published in: Education, Technology

Latent factor models for Collaborative Filtering

  1. 1. AIM3 – Scalable Data Analysis and Data Mining 11 – Latent factor models for Collaborative Filtering Sebastian Schelter, Christoph Boden, Volker Markl Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin20.06.2012 DIMA – TU Berlin 1
  2. 2. Recap: Item-Based Collaborative FilteringItembased Collaborative Filtering • compute pairwise similarities of the columns of the rating matrix using some similarity measure • store top 20 to 50 most similar items per item in the item-similarity matrix • prediction: use a weighted sum over all items similar to the unknown item that have been rated by the current user p ui =  j S ( i , u ) s ij ruj  j S ( i , u ) s  ij 20.06.2012 DIMA – TU Berlin 2
  3. 3. Drawbacks of similarity-based neighborhood methods • the assumption that a rating is defined by all the users ratings for commonly co-rated items is hard to justify in general • lack of bias correction • every co-rated item is looked at in isolation, say a movie was similar to „Lord of the Rings“, do we want each part to of the trilogy to contribute as a single similar item? • best choice of similarity measure is based on experimentation not on mathematical reasons20.06.2012 DIMA – TU Berlin 3
  4. 4. Latent factor models■ Idea • ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, complexity of characters) • these factors are in general not obvious, we might be able to think of some of them but its hard to estimate their impact on the ratings • the goal is to infer those so called latent factors from the rating data by using mathematical techniques 20.06.2012 DIMA – TU Berlin 4
  5. 5. Latent factor models■ Approach • users and items are characterized by latent n f factors, each user and item is mapped onto ui ,m j  R a latent feature space • each rating is approximated by the dot T rij  m j u i product of the user feature vector and the item feature vector • prediction of unknown ratings also uses this dot product • squared error as a measure of loss r ij T  m j ui  2 20.06.2012 DIMA – TU Berlin 5
  6. 6. Latent factor models■ Approach • decomposition of the rating matrix into the product of a user feature and an item feature matrix • row in U: vector of a users affinity to the features • row in M: vector of an items relation to the features • closely related to Singular Value Decomposition which produces an optimal low-rank optimization of a matrix MT R ≈ U 20.06.2012 DIMA – TU Berlin 6
  7. 7. Latent factor models■ Properties of the decomposition • automatically ranks features by their „impact“ on the ratings • features might not necessarily be intuitively understandable 20.06.2012 DIMA – TU Berlin 7
  8. 8. Latent factor models■ Problematic situation with explicit feedback data • the rating matrix is not only sparse, but partially defined, missing entries cannot be interpreted as 0 they are just unknown • standard decomposition algorithms like Lanczos method for SVD are not applicableSolution • decomposition has to be done using the known ratings only • find the set of user and item feature vectors that minimizes the squared error to the known ratings  r  m j ui  T 2 min U, M i, j 20.06.2012 DIMA – TU Berlin 8
  9. 9. Latent factor models■ quality of the decomposition is not measured with respect to the reconstruction error to the original data, but with respect to the generalization to unseen data■ regularization necessary to avoid overfitting■ model has hyperparameters (regularization, learning rate) that need to be chosen■ process: split data into training, test and validation set □ train model using the training set □ choose hyperparameters according to performance on the test set □ evaluate generalization on the validation set □ ensure that each datapoint is used in each set once (cross-validation) 20.06.2012 DIMA – TU Berlin 9
  10. 10. Stochastic Gradient Descent • add a regularizarion term min U, M  r i, j T  m j ui  2  + λ ui 2 + m j 2  • loop through all ratings in the training set, compute associated prediction error T e ui = rij  m j u i • modify parameters in the opposite direction of the gradient u i  u i + γ e u, i m j  λu i  m j  m j + γ e u, i u i  λm j  • problem: approach is inherently sequential (although recent research might have unveiled a parallelization technique)20.06.2012 DIMA – TU Berlin 10
  11. 11. Alternating Least Squares with Weighted λ-Regularization■ Model • feature matrices are modeled directly by using only the observed ratings • add a regularization term to avoid overfitting • minimize regularized error of: f U, M =  r ij  m j ui  + λ T 2  n u i ui 2 +  nm j m j 2 Solving technique • fixing one of the unknown variable to make this a simple quadratic equation • rotate between fixing u and m until convergence („Alternating Least Squares“) 20.06.2012 DIMA – TU Berlin 11
  12. 12. ALS-WR is scalable■ Which properties make this approach scalable? • all the features in one iteration can be computed independently of each other • only a small portion of the data necessary to compute a feature vectorParallelization with Map/Reduce • Computing user feature vectors: the mappers need to send each users rating vector and the feature vectors of his/her rated items to the same reducer • Computing item feature vectors: the mappers need to send each items rating vector and the feature vectors of users who rated it to the same reducer 20.06.2012 DIMA – TU Berlin 12
  13. 13. Incorporating biases■ Problem: explicit feedback data is highly biased □ some users tend to rate more extreme than others □ some items tend to get higher ratings than others■ Solution: explicitly model biases □ the bias of a rating is model as a combination of the items average rating, the item bias and the user bias b ij    b i  b j □ the rating bias can be incorporated into the prediction rij    b i  b j  m j u i T ˆ 20.06.2012 DIMA – TU Berlin 13
  14. 14. Latent factor models■ implicit feedback data is very different from explicit data! □ e.g. use the number of clicks on a product page of an online shop □ the whole matrix is defined! □ no negative feedback □ interactions that did not happen produce zero values □ however we should have only little confidence in these (maybe the user never had the chance to interact with these items) □ using standard decomposition techniques like SVD would give us a decomposition that is biased towards the zero entries, again not applicable 20.06.2012 DIMA – TU Berlin 14
  15. 15. Latent factor models■ Solution for working with implicit data: weighted matrix factorization 1 rij  0■ create a binary preference matrix P p ij   0 rij  0 ■ each entry in this matrix can be weighted by a confidence function □ zero values should get low confidence c ( i , j )  1   rij □ values that are based on a lot of interactions should get high confidence■ confidence is incorporated into the model □ the factorization will ‚prefer‘ more confident values f U, M =   T c ( i , j ) p ij  m j u i  2 + λ  ui 2 +  m j 2  20.06.2012 DIMA – TU Berlin 15
  16. 16. Sources • Sarwar et al.: „Item-Based Collaborative Filtering Recommendation Algorithms“, 2001 • Koren et al.: „Matrix Factorization Techniques for Recommender Systems“, 2009 • Funk: „Netflix Update: Try This at Home“,, 2006 • Zhou et al.: „Large-scale Parallel Collaborative Filtering for the Netflix Prize“, 2008 • Hu et al.: „Collaborative Filtering for Implicit Feedback Datasets“, 200820.06.2012 DIMA – TU Berlin 16