Recommender Systems from A to Z – Model Training

Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment

1. Introduction
Optimization problem, linear regression and Stochastic Gradient Descent (SGD)
1. Baseline models
Global average, user average and item-item models
1. Basic linear models
Least Squares (LS)
Regularized Least Squares (RLS)
1. Matrix factorization
Matrix Factorization, analytical solution and numerical solution
1. Non-linear models
Basic and Complex Deep Learning model

Model training – Introduction
Explicit vs Implicit feedback
Explicit feedback
(users’ ratings)
Implicit feedback
(users’ clicks)

Explicit vs Implicit feedback
Explicit feedback
(users’ ratings)
Implicit feedback
(users’ clicks)
Explicit feedback Implicit feedback
Example Domains Movies, Tv-Shows, Music Marketplaces, Businesses
Example Data type Like/Dislike, Stars Clicks, Play-time, Purchases
Complexity Clean, Costly, Easy to interpret Dirty, Cheap, Difficult to interpret

Recommendation engine types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item

Recommendation
engine
Content-based
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Linear Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning

Recommendation
engine
Content-based
Collaborative-based
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Linear Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning

Model training – Introduction - Optimization
(or R)

Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings

Ratings of User #1
Embedding of User #1
Embedding of Item #1
m users and n items
Ratings of User #m
To Item #n

AVAILABLE DATASET
?
?
m users and n items

Optimization problem (basic formulation with RMSE)
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)

Optimization problem (more complex formulation)
Content-based
Content-based with Regularization

Content-based
Available data
Regularization to
avoid overfitting

Content-based
Take home
● In content-based models we already know I (items features)
● We can find a linear solutions to this problem using Least Squares
Available data
Regularization to
avoid overfitting

Collaborative-filtering with Regularization

Available data
Regularization to
avoid overfitting

Available data
Regularization to
avoid overfitting
Take home
● In collaborative-filtering we want to find U and I (users and items embeddings)
● We can find a linear solutions to this problem using Matrix Factorization and SGD

How to analytical solve an optimization problem?
Let’s start with the simple optimization problem: linear regression without regularization.
With m > n and. We want to find W such as:

How to analytical solve an optimization problem?
Let’s start with the simple optimization problem: linear regression without regularization.
With m > n and. We want to find W such as:
Add column of ones
to support w0
Scalar numbers

How to numerical solve an optimization problem?
Gradient descent: Start with random values for W and move in the opposite direction of the gradient
By taking just one sample

How to numerical solve an optimization problem?
Gradient descent: Start with random values for W and move in the opposite direction of the gradient
By taking just one sample
J(w)

Gradient Descent algorithm Stochastic Gradient Descent algorithm
for epoch in n_epochs:
● compute the predictions for all the samples
● compute the error between truth and predictions
● compute the gradient using all the samples
● update the parameters of the model
● shuffle the samples
● for sample in n_samples:
○ compute the predictions for the sample
○ compute the error between truth and
predictions
○ compute the gradient using the sample
○ update the parameters of the model
Mini-Batch Gradient Descent algorithm
● shuffle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model

Gradient Descent comparison
Gradient Descent Stochastic Gradient Descent Mini-Batch Gradient Descent
Gradient
Speed Very Fast (vectorized) Slow (compute sample by sample) Fast (vectorized)
Memory O(dataset) O(1) O(batch)
Convergence Needs more epochs Needs less epochs Middle point between GD and SGD
Gradient Stability Smooth updates in params Noisy updates in params Middle point between GD and SGD

A Problem with Implicit Feedback
With datasets with only unary positive feedback (e.g. clicks history)
Negative Sampling
Common fix: add random users and items with r=0

A Problem with Implicit Feedback
With datasets with only unary positive feedback (e.g. clicks history)
Negative Sampling
Common fix: add random users and items with r=0
Uniform distribution
Dataset

Negative Sampling
Common fix: add random users and items with rating=0
● Expresses “unknowns items” from users
● Acts as a regularizer
● Works also for explicit feedback

Model Training – Baseline models
Introduction
● Before starting to train models, always compute a baseline
● Baselines are very useful to debug more complex models
● As a general rule:
○ Very basic models can’t capture all the details on the training data and tend to underfit
○ Very complex models capture every detail on the training data and tend to overfit
● Note: During this presentation we will be using RMSE for comparing models performance

Global Average
Average = 3.64
3.64
3.64
3.64
3.64
3.64
3.64
Prediction
RMSE = sqrt((2 - 3.64)^2 + (1-3.64)^2 + …)
RMSE = sqrt(4.13)

Global average - Numpy code
importnumpyas np
from scipy.sparse import csr_matrix
rows= np.array([0,0,0,1,1,2,2,2,2,3,3,3,4,4,5,5,5])
cols = np.array([0,1,5,3,5,0,1,2,4,0,3,5,0,2,1,3,4])
data = np.array([2,5,4,1,5,2,4,5,4,4,5,1,5,2,1,4,2])
ratings= csr_matrix((data,(rows, cols)), shape=(6, 6))
idx = np.random.permutation(data.size)
idx_train = idx[0:int(idx.size*0.8)]
idx_valid = idx[int(idx.size*0.8):]
global_avg= data[idx_train].mean()
rmse = np.sqrt(((data[idx_valid]- global_avg)**2).sum())

User average
Average u1 = 4.50
Average u2 = 5.00
Average u3 = 3.67
4.50
5.00
3.67
2.50
5.00
2.50
Prediction
RMSE = sqrt((2 - 4.5)^2 + (1-5.0)^2 ...)
RMSE = sqrt(6.15)
Average u4 = 2.50
Average u5 = 5.00
Average u6 = 2.50

User average - Numpy code
rows= np.array([0,0,0,1,1,2,2,2,2,3,3,3,4,4,5,5,5])
cols = np.array([0,1,5,3,5,0,1,2,4,0,3,5,0,2,1,3,4])
data = np.array([2,5,4,1,5,2,4,5,4,4,5,1,5,2,1,4,2])
ratings= csr_matrix((data,(rows, cols)), shape=(6, 6))
idx = np.random.permutation(data.size)
idx_train = idx[0:int(idx.size*0.8)]
idx_valid = idx[int(idx.size*0.8):]
ratings_train = csr_matrix((data[idx_train],(rows[idx_train],cols[idx_train])),shape=(6,6))
ratings_valid = csr_matrix((data[idx_valid],(rows[idx_valid],cols[idx_valid])),shape=(6,6))
count_per_row = (ratings_train> 0).sum(axis=1).A1
sum_per_row = ratings_train.sum(axis=1).A1.astype('float32')
user_avg = sum_per_row / count_per_row
rmse = np.sqrt(((ratings_valid.tocoo().data -user_avg[rows[idx_valid]])**2).sum())

Item-Item

Model Training – Basic linear models
Content Based - Standard Least Squares model
● Goal: very basic linear model
● Data: the matrix of items features I (may be sparse)
● Pre-processing: use PCA to reduce the dimension of I
● Solve:
● Solution is Least Squares:

Content Based - Standard Least Squares model
● Goal: very basic linear model
● Data: the matrix of items features I (may be sparse)
● Pre-processing: use PCA to reduce the dimension of I
● Solve:
● Solution is Least Squares:
Never compute the inverse!
(1) Use numpy:
numpy.linalg.solve(I*I.T, I*R.T)
(1) Use Cholesky decomposition:
(I * I.T) is a positive definite matrix!

Content Based - Regularized Least Squares model
● Goal: avoid overfitting
● Method: Tikhonov Regularization (a.k.a Ridge Regression)
● Solve:
● Solution is Regularized Least Squares:

Model Training – Matrix Factorization
Matrix Factorization
● If we don’t have I, to find a linear solution to our problem we need to use Matrix Factorization
techniques.
● Now we want to solve the following optimization problem:
SOLUTIONS
ANALYTICAL NUMERICAL
SVD ALS SGD

Matrix Factorization - Graphical interpretation
(or R)

Matrix Factorization - Graphical interpretation

Analytical solution - Singular Value Decomposition (SVD)
● Optimal Solution
● Closed Form, readily available in scikit-learn
● O(n^3) algorithm, does not scale

Numerical solution - Alternating Least Square (ALS)
Initialize:
Iterate:
● Solving least squares is easy
● Scales to big dataset
● Distributed implementation are available (e.g. on Spark)

Numerical solution - Stochastic Gradient Descent (SGD)
We are using SGD -> One sample each time

100 epochs

Model Training – Non-linear models
Simple Deep Learning model for collaborative filtering

Model Training – Basic Deep Learning model
Simple Deep Learning model for collaborative filtering

Model Training – Complex Deep Learning problem
More complex Deep Learning model for collaborative filtering

Model Training – Complex Deep Learning problem
Training with Deep Learning
● Use Deep Learning Framework (e.g. PyTorch, TensorFlow)
● ...or at least Analytical Gradient Libraries (e.g. Theano, Chainer)
● Acceleration Heuristics (e.g. AdaGrad, Nesterov, RMSProp, Adam, NAdam)
● DropOut / BatchNorm
● Watch-out for Sparse Momentum Updates! Most Deep Learning frameworks don’t support it
● Hyper-parameter Optimization and Architecture Search (e.g. Gaussian Processes)

Model Training – Conclusions
Conclusions
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries

Model Training – Conclusions
Take home
● Always start with the simplest, stupidest models
● Spend time on simple interpretable models to debug your codebase and clean your data
● Gradually increase the complexity of your models
● Add more regularization as soon as a complex model performs worse than a simpler model

Recommender Systems from A to Z – Model Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recommender Systems from A to Z – Model Training

Similar to Recommender Systems from A to Z – Model Training (20)

Recently uploaded

Recently uploaded (20)

Recommender Systems from A to Z – Model Training