Latent factor models for Collaborative Filtering

AIM3 – Scalable Data Analysis and Data
Mining

11 – Latent factor models for Collaborative Filtering
Sebastian Schelter, Christoph Boden, Volker Markl

Fachgebiet Datenbanksysteme und Informationsmanagement
Technische Universität Berlin

20.06.2012
http://www.dima.tu-berlin.de/
DIMA – TU Berlin 1

Recap: Item-Based Collaborative Filtering

Itembased Collaborative Filtering

• compute pairwise similarities of the columns of
the rating matrix using some similarity measure
• store top 20 to 50 most similar items per item
in the item-similarity matrix
• prediction: use a weighted sum over all items
similar to the unknown item that have been
rated by the current user

p ui =
 j S ( i , u )
s ij ruj

 j S ( i , u )
s 
ij

20.06.2012 DIMA – TU Berlin 2

Drawbacks of similarity-based neighborhood
methods

• the assumption that a rating is defined by all the
user's ratings for commonly co-rated items is
hard to justify in general

• lack of bias correction

• every co-rated item is looked at in isolation,
say a movie was similar to „Lord of the Rings“, do
we want each part to of the trilogy to contribute as
a single similar item?

• best choice of similarity measure is based on
experimentation not on mathematical reasons


Latent factor models

■ Idea

• ratings are deeply influenced by a set of factors that are
very specific to the domain (e.g. amount of action in movies,
complexity of characters)

• these factors are in general not obvious, we might be able to
think of some of them but it's hard to estimate their impact on
the ratings

• the goal is to infer those so called latent factors from the
rating data by using mathematical techniques



■ Approach

• users and items are characterized by latent n
f
factors, each user and item is mapped onto ui ,m j
 R
a latent feature space

• each rating is approximated by the dot T
rij  m j u i
product of the user feature vector
and the item feature vector

• prediction of unknown ratings also uses
this dot product

• squared error as a measure of loss r ij
T
 m j ui  2



■ Approach

• decomposition of the rating matrix into the product of a user
feature and an item feature matrix
• row in U: vector of a user's affinity to the features
• row in M: vector of an item's relation to the features

• closely related to Singular Value Decomposition which
produces an optimal low-rank optimization of a matrix

MT
R ≈ U



■ Properties of the decomposition
• automatically ranks features by their „impact“ on the ratings
• features might not necessarily be intuitively understandable



■ Problematic situation with explicit feedback data

• the rating matrix is not only sparse, but partially defined,
missing entries cannot be interpreted as 0 they are just
unknown
• standard decomposition algorithms like Lanczos method for
SVD are not applicable

Solution

• decomposition has to be done using the known ratings only
• find the set of user and item feature vectors that minimizes the
squared error to the known ratings

 r  m j ui 
T 2
min U, M i, j



■ quality of the decomposition is not measured with respect to
the reconstruction error to the original data, but with
respect to the generalization to unseen data
■ regularization necessary to avoid overfitting

■ model has hyperparameters (regularization, learning rate)
that need to be chosen

■ process: split data into training, test and validation set
□ train model using the training set
□ choose hyperparameters according to performance on the test set
□ evaluate generalization on the validation set
□ ensure that each datapoint is used in each set once
(cross-validation)


Stochastic Gradient Descent

• add a regularizarion term

min U, M  r i, j
T
 m j ui 
2

+ λ ui
2
+ m j
2

• loop through all ratings in the training set, compute
associated prediction error
T
e ui = rij  m j u i

• modify parameters in the opposite direction of the gradient

u i  u i + γ e u, i m j
 λu i

m j
 m j + γ e u, i u i  λm j

• problem: approach is inherently sequential (although recent
research might have unveiled a parallelization technique)


Alternating Least Squares with
Weighted λ-Regularization
■ Model

• feature matrices are modeled directly by using only
the observed ratings
• add a regularization term to avoid overfitting
• minimize regularized error of:

f U, M =  r ij
 m j ui  + λ
T 2
 n u
i
ui
2
+  nm
j
m j
2

Solving technique

• fixing one of the unknown variable to make this a simple
quadratic equation
• rotate between fixing u and m until convergence
(„Alternating Least Squares“)


ALS-WR is scalable

■ Which properties make this approach scalable?

• all the features in one iteration can be computed
independently of each other
• only a small portion of the data necessary to compute
a feature vector

Parallelization with Map/Reduce

• Computing user feature vectors: the mappers need to send
each user's rating vector and the feature vectors of his/her
rated items to the same reducer

• Computing item feature vectors: the mappers need to send
each item's rating vector and the feature vectors of users who
rated it to the same reducer


Incorporating biases

■ Problem: explicit feedback data is highly biased
□ some users tend to rate more extreme than others
□ some items tend to get higher ratings than others

■ Solution: explicitly model biases
□ the bias of a rating is model as a combination of the items average
rating, the item bias and the user bias

b ij    b i  b j

□ the rating bias can be incorporated into the prediction

rij    b i  b j  m j u i
T
ˆ



■ implicit feedback data is very different from explicit data!

□ e.g. use the number of clicks on a product page of an online shop

□ the whole matrix is defined!
□ no negative feedback
□ interactions that did not happen produce zero values
□ however we should have only little confidence in these (maybe the user
never had the chance to interact with these items)

□ using standard decomposition techniques like SVD would give us a
decomposition that is biased towards the zero entries, again not
applicable



■ Solution for working with implicit data:
weighted matrix factorization

1 rij  0
■ create a binary preference matrix P p ij  
0 rij  0


■ each entry in this matrix can be weighted
by a confidence function
□ zero values should get low confidence c ( i , j )  1   rij

□ values that are based on a lot of interactions
should get high confidence

■ confidence is incorporated into the model
□ the factorization will ‚prefer‘ more confident values

f U, M =   T
c ( i , j ) p ij  m j u i 
2
+ λ  ui
2
+  m j
2


Sources

• Sarwar et al.: „Item-Based Collaborative Filtering
Recommendation Algorithms“, 2001
• Koren et al.: „Matrix Factorization Techniques for Recommender
Systems“, 2009
• Funk: „Netflix Update: Try This at Home“,
http://sifter.org/~simon/journal/20061211.html, 2006
• Zhou et al.: „Large-scale Parallel Collaborative Filtering for the
Netflix Prize“, 2008
• Hu et al.: „Collaborative Filtering for Implicit Feedback
Datasets“, 2008


Latent factor models for Collaborative Filtering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Latent factor models for Collaborative Filtering

Similar to Latent factor models for Collaborative Filtering (20)

More from sscdotopen

More from sscdotopen (9)

Recently uploaded

Recently uploaded (20)

Latent factor models for Collaborative Filtering