Recommender Systems from A to Z – Model Evaluation

Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment

1. Introduction
Train/Valid split, underfitting and overfitting
Learning Curve in recommendation engines
2. Evaluation functions
Basic metrics for recommender engines (Precision, Recall, TPR, TNR...)
Regression, Classification, Ranking metrics
3. Loss functions
Optimization problems and Losses functions properties
Regression, Classification, Ranking losses
4. Practical recommendations
Regularization, HP optimization, Embeddings evaluations.

Previous Meetup Recap: Recommendation Engine Types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning

Previous Meetup Recap: Recommendation Engine Models
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries

Optimization Problem – Matrix Factorization Example
(or R)

Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings

Ratings of User #1
Embedding of User #1
Embedding of Item #1
m users and n items
Ratings of User #m
To Item #n

AVAILABLE DATASET
?
?
m users and n items

Our goal is to find U and I, such as the diﬀerence between each datapoint in R and and the product
between each user and item is minimal.
(or R)

Our goal is to find U and I, such as the diﬀerence between each datapoint in R and and the product
between each user and item is minimal.
(or R)
3. How are we going to solve the problem?
2. What properties are we looking in our outputs?
- Exact rating vs like/dislike vs ranking predictions
1. What type of data do we have?

Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
Business decisions
Technical decisions

Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
EVALUATION FUNCTIONS
LOSS FUNCTIONS
RANDOM SEARCH, GP
COMPARE METRICS
ML FOR RECOMMENDATION
Business decisions
Technical decisions

Objectives Types (from data point of view)
Classification
● clic/no-click
● like/dislike/missing
● estimated probability of like (e.g. watch time)
Regression
● absolute rating (e.g. from 1/5 to 5/5)
● number of interactions
Ranking
● estimated order of preference (e.g. watch time)
● pairwise comparisons
Unsupervised
● clustering of items
● clustering of users

Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?

Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
LOSS FUNCTION THAT PENALIZE MORE
ERRORS IN ALL-STAR RATING
RANKING LOSS FUNCTION
CLASSIFICATION LOSS FUNCTION

Cross Validation – In Traditional Machine Learning
1 2 3 4 4 1 2 3
3 4 1 2 2 3 4 1

Cross Validation – In Recommendation Engines
Dataset

Cross Validation – In Recommendation Engines
Split such as every user is present in train and valid
More stronger: split as every user have 80/20 train and valid
Dataset

Underfitting and Overfitting
Model fails to learn
relations in data
Model is a good fit
for the data
Model fails to
generalize
New samples New samples New samples
+ Complex

Validation
Sample
+ Complex

Validation
Sample
+ Complex
OverfittingUnderfitting

epoch
Loss
Function
or
Metric
Mini-Batch Gradient Descent
for epoch in n_epochs:
● shuﬀle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model
● plot error vs epoch

A very simple way of checking underfitting
Ground truth
Y
Model predictions
Model is predicting always the same
Predicted Y
Underfitting

What do we want to evaluate?
Classification
● True Positive Rate (TPR)
● True Negative Rate (TNR)
● Precision
● F-measure
Regression
● Mean Square Error (MSE)
Ranking
● Recall@K
● Precision@K
● CG, DCG, nDCG
Ranking/Classification metrics
● AUC
Some common evaluation functions

Regression
Mean Square Error (MSE)
● Easy to compute
● Linear gradient
● Can also be used as loss function
Mean Absolute Error (MAE)
● Easy to compute
● Easy to interpret
● Discontinuous gradient
● Can’t be used as loss function

Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
User’s likes
User’s dislikes
Model recommendations
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2

Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
User’s likes
User’s dislikes
Model recommendations
Recall = 5/7
Precision = 5/5 = 1
Recall = 7/7 = 1
Precision = 7/9
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2

Classification 1/2
True Positive Rate (a.k.a TPR, Recall, Sensitivity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TPR <=1 (higher is better)
True Negative Rate (a.k.a TNR, Selectivity, Specificity)
● Measure of global bias of a model
● 0 <= TNR <=1 (higher is better)

Classification 2/2
Precision
● Measure quality of recommendation
● 0 <= Precision <=1 (higher is better)
F-measure
● Balance precision and recall
● Not good for recommendation, because
doesn’t take into account True Negatives
● 0 <= F-measure <= 1 (higher is better)

Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Total Positive = ?
Recall@K = ?

Ranking 1/3
Recall@K
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Total Positive = 4
Recall@K = 2 / 4
top 1
top 2
top 3

Ranking 1/3
Recall@K
● In math terms:

Ranking 2/3
Precision@K
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
K = 3
TOP K = ?
TOP K Positive = ?
Recall@K = ?

Ranking 2/3
Precision@K
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
K = 3
TOP K Positive = {TS1, TS2} = 2
Recall@K = 2 / 3
top 1
top 2
top 3

Ranking 2/3
Precision@K
● In math terms:

Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
K = 3
TOP K = ?
CG = ?
DCG = ?

Ranking 3/3
CG, DCG, and nDCG
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
K = 3
CG = 1.0 + 0.9 - 0.2
DCG = 1/1 + 0.9/2 - 0.2/3
top 1
top 2
top 3

Ranking 3/3
CG, DCG, and nDCG
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings

Hybrid Ranking/Classification
AUC
● Vary positive prediction threshold (not just 0)
● Compute TPR and FPR for all possible positive thresholds
● Build Receiver Operating Characteristic (ROC) curve
● Integrate Area Under the ROC Curve (AUC)

Loss Functions vs Evaluation Functions
Evaluation Metrics
● Expensive to evaluate
● Often not smooth
● Often not even derivable
Loss Functions
● Smooth approximations of your evaluation metric
● Well suited for SGD

Loss Functions: How we are going to solve the problem?
Classification loss
● Logistic
● Cross Entropy
● Kullback-Leibler Divergence
Regression loss
● Mean Square Error (MSE)
Ranking loss
● WARP
● BPR
Some common loss functions

Optimization Problems – Basic Formulation with RMSE
Goal: find U and I s.t. the diﬀerence between each datapoint in R and and the product between each
user and item is minimal
(or R)

Optimization Problems – General Formulation
Goal: find U and I s.t. the loss function J is minimized.
(or R)

Convex vs Non-Convex Optimization
Convex Non-convex

Loss Functions – Regression
Mean Square Error
● Typical used loss function for regression. It’s a smooth function. It’s easy to understand.
Regularized Mean Square Error
● Mean square error plus regularization to avoid overfitting.

Loss Functions – Classification
Logistic
● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.

Logistic

Loss Functions – Ranking
Weighted Approximate-Rank Pairwise (WARP)
● Approximates DCG-like evaluation metrics
● Smooth and tractable computation
Bayesian Personalised Ranking (BPR)
● Approximates AUC
● Smooth and tractable computation
● Requires binary comparisons (good for binary comparison feedback)

Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users

Practical Recommendations
(6) Always ask feedback from end users
COMPARE WITH GLOBAL MODELS IT’S EASY
IF OVERFITTING, USE REGULARIZATION
GRID SEARCH OR GAUSSIAN PROCESS
TPR, TNR, PRECISION, ETC.
ITEM/ITEM SIMILARITIES
EVERYTHING IS ABOUT USER TASTE

Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries

Model-based
● Dropout
● Bagging
Loss-based normalization
● norm: best approximation of sparsity-inducing norm
● norm: very smooth, easy to optimize
Data Augmentation
● Negative Sampling

Grid Search
Brute force over all the combinations of the parameters
Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs
Random Search
Uniformly sample combinations of the parameters
Very easy to implement, very useful in practice
Gaussian Process Optimization
Meta-learning of the validation error given hyper-parameters
Solve exploration/exploitation tradeoﬀ

Metric to minimize
Metric to maximize

Items embeddings
● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS
● After getting the embeddings, we always compute Top-K similarities in well known items
● We use the items embeddings to create clusters and analyze how good they are

(5) Always ask for final users feedback
RECOMMENDATION IS ALL ABOUT USERS TASTE
ASK THEM FOR FEEDBACK!!

Losses and metrics summary table
Name Category loss eval batch-SGD support implicit Comments
MSE Regr ✓ ✓ ✓ ✓ linear gradient
MAE Regr ✓ ✓ easy to interpret
Logistic / XE / KL Classif ✓ ✓ ✓ ✓ flexible truth
Exponential Classif ✓ ✓ exploding gradient
Recall (global) Classif ✓ ✓ ✓ requires negative
Precision (global) Classif ✓ ✓ ✓ requires negative
F-measure
(global)
Classif ✓ ✓ ✓ requires negative
MRR Ranking ✓ considers only 1 item
nDCG Ranking ✓ requires rank
WARP Ranking ✓ for nDCG, p@k, r@k
AUC Hybrid ✓ ✓ ✓ requires negative
BPR Hybrid ✓ ✓ for AUC
Recall@k Hybrid ✓ requires≤k positives
Precision@k Hybrid ✓ requires ≥k positives

Negative Sampling
Problem
● Unary feedback: the best model will always predict “1” for each user and item.
● In general:
○ your model is used in real life to predict (user, item) outside sparse dataset.
○ can’t train on the full (#users x #items) dense matrix.
Negative Sampling Solution
● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing)
● sample strategy matters a lot (i.e. how to split train and valid)
● how many negative samples matters a lot

Negative Sampling
Split negative feedback in the same proportion

Underfitting and Overfitting – Take Home
(1) For doing cross-validation split data such as almost all users are in training and validation
(2) Use negative sampling to avoid overfitting in your models
(3) Always use learning curves to get more insights about underfitting vs overfitting
(4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting

● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model)
● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability
● Often used for deep-learning based recommendation engines
● Smooth gradient around zero and steep for large errors
Logistic

Recommender Systems from A to Z – Model Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recommender Systems from A to Z – Model Evaluation

Similar to Recommender Systems from A to Z – Model Evaluation (20)

Recently uploaded

Recently uploaded (20)

Recommender Systems from A to Z – Model Evaluation