SlideShare a Scribd company logo
Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
1. Introduction
Train/Valid split, underfitting and overfitting
Learning Curve in recommendation engines
2. Evaluation functions
Basic metrics for recommender engines (Precision, Recall, TPR, TNR...)
Regression, Classification, Ranking metrics
3. Loss functions
Optimization problems and Losses functions properties
Regression, Classification, Ranking losses
4. Practical recommendations
Regularization, HP optimization, Embeddings evaluations.
Introduction
Previous Meetup Recap: Recommendation Engine Types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning
Previous Meetup Recap: Recommendation Engine Models
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
Optimization Problem – Matrix Factorization Example
(or R)
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Ratings of User #1
Embedding of User #1
Embedding of Item #1
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Ratings of User #m
To Item #n
Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
AVAILABLE DATASET
?
?
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
3. How are we going to solve the problem?
2. What properties are we looking in our outputs?
- Exact rating vs like/dislike vs ranking predictions
1. What type of data do we have?
Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
Business decisions
Technical decisions
Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
EVALUATION FUNCTIONS
LOSS FUNCTIONS
RANDOM SEARCH, GP
COMPARE METRICS
ML FOR RECOMMENDATION
Business decisions
Technical decisions
Objectives Types (from data point of view)
Classification
● clic/no-click
● like/dislike/missing
● estimated probability of like (e.g. watch time)
Regression
● absolute rating (e.g. from 1/5 to 5/5)
● number of interactions
Ranking
● estimated order of preference (e.g. watch time)
● pairwise comparisons
Unsupervised
● clustering of items
● clustering of users
Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
LOSS FUNCTION THAT PENALIZE MORE
ERRORS IN ALL-STAR RATING
RANKING LOSS FUNCTION
CLASSIFICATION LOSS FUNCTION
Cross Validation – In Traditional Machine Learning
1 2 3 4 4 1 2 3
3 4 1 2 2 3 4 1
Cross Validation – In Recommendation Engines
Dataset
Cross Validation – In Recommendation Engines
Split such as every user is present in train and valid
More stronger: split as every user have 80/20 train and valid
Dataset
Underfitting and Overfitting
Underfitting and Overfitting
Model fails to learn
relations in data
Model is a good fit
for the data
Model fails to
generalize
New samples New samples New samples
+ Complex
Underfitting and Overfitting
Validation
Sample
+ Complex
Underfitting and Overfitting
Validation
Sample
+ Complex
OverfittingUnderfitting
Underfitting and Overfitting
epoch
Loss
Function
or
Metric
Mini-Batch Gradient Descent
for epoch in n_epochs:
● shuffle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model
● plot error vs epoch
Underfitting and Overfitting
A very simple way of checking underfitting
Ground truth
Y
Model predictions
Model is predicting always the same
Predicted Y
Underfitting
Evaluation Functions
What do we want to evaluate?
Classification
● True Positive Rate (TPR)
● True Negative Rate (TNR)
● Precision
● F-measure
Regression
● Mean Square Error (MSE)
Ranking
● Recall@K
● Precision@K
● CG, DCG, nDCG
Ranking/Classification metrics
● AUC
Some common evaluation functions
Regression
Mean Square Error (MSE)
● Easy to compute
● Linear gradient
● Can also be used as loss function
Mean Absolute Error (MAE)
● Easy to compute
● Easy to interpret
● Discontinuous gradient
● Can’t be used as loss function
Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
User’s likes
User’s dislikes
Model recommendations
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
User’s likes
User’s dislikes
Model recommendations
Recall = 5/7
Precision = 5/5 = 1
Recall = 7/7 = 1
Precision = 7/9
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
Classification 1/2
True Positive Rate (a.k.a TPR, Recall, Sensitivity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TPR <=1 (higher is better)
True Negative Rate (a.k.a TNR, Selectivity, Specificity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TNR <=1 (higher is better)
Classification 2/2
Precision
● Easy to understand
● Useful for likes/dislikes datasets
● Measure quality of recommendation
● 0 <= Precision <=1 (higher is better)
F-measure
● Balance precision and recall
● Not good for recommendation, because
doesn’t take into account True Negatives
● 0 <= F-measure <= 1 (higher is better)
Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Total Positive = ?
Recall@K = ?
Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Total Positive = 4
Recall@K = 2 / 4
top 1
top 2
top 3
Ranking 1/3
Recall@K
● In math terms:
Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Recall@K = ?
Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Recall@K = 2 / 3
top 1
top 2
top 3
Ranking 2/3
Precision@K
● In math terms:
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
CG = ?
DCG = ?
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
CG = 1.0 + 0.9 - 0.2
DCG = 1/1 + 0.9/2 - 0.2/3
top 1
top 2
top 3
Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
Hybrid Ranking/Classification
AUC
● Vary positive prediction threshold (not just 0)
● Compute TPR and FPR for all possible positive thresholds
● Build Receiver Operating Characteristic (ROC) curve
● Integrate Area Under the ROC Curve (AUC)
Loss functions
Loss Functions vs Evaluation Functions
Evaluation Metrics
● Expensive to evaluate
● Often not smooth
● Often not even derivable
Loss Functions
● Smooth approximations of your evaluation metric
● Well suited for SGD
Loss Functions: How we are going to solve the problem?
Classification loss
● Logistic
● Cross Entropy
● Kullback-Leibler Divergence
Regression loss
● Mean Square Error (MSE)
Ranking loss
● WARP
● BPR
Some common loss functions
Optimization Problems – Basic Formulation with RMSE
Goal: find U and I s.t. the difference between each datapoint in R and and the product between each
user and item is minimal
(or R)
Optimization Problems – General Formulation
Goal: find U and I s.t. the loss function J is minimized.
(or R)
Convex vs Non-Convex Optimization
Convex Non-convex
Convex Optimization
Non-Convex Optimization
Loss Functions – Regression
Mean Square Error
● Typical used loss function for regression. It’s a smooth function. It’s easy to understand.
Regularized Mean Square Error
● Mean square error plus regularization to avoid overfitting.
Loss Functions – Classification
Logistic
● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.
Loss Functions – Classification
Logistic
Loss Functions – Ranking
Weighted Approximate-Rank Pairwise (WARP)
● Approximates DCG-like evaluation metrics
● Smooth and tractable computation
Bayesian Personalised Ranking (BPR)
● Approximates AUC
● Smooth and tractable computation
● Requires binary comparisons (good for binary comparison feedback)
Practical Recommendations
Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
COMPARE WITH GLOBAL MODELS IT’S EASY
IF OVERFITTING, USE REGULARIZATION
GRID SEARCH OR GAUSSIAN PROCESS
TPR, TNR, PRECISION, ETC.
ITEM/ITEM SIMILARITIES
EVERYTHING IS ABOUT USER TASTE
(1) Always compute baseline metrics
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
(2) Always analyze underfitting vs overfitting
Model-based
● Dropout
● Bagging
Loss-based normalization
● norm: best approximation of sparsity-inducing norm
● norm: very smooth, easy to optimize
Data Augmentation
● Negative Sampling
(3) Always do hyperparameter optimization
Grid Search
Brute force over all the combinations of the parameters
Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs
Random Search
Uniformly sample combinations of the parameters
Very easy to implement, very useful in practice
Gaussian Process Optimization
Meta-learning of the validation error given hyper-parameters
Solve exploration/exploitation tradeoff
(3) Always do hyperparameter optimization
Metric to minimize
Metric to maximize
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
Items embeddings
● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS
● After getting the embeddings, we always compute Top-K similarities in well known items
● We use the items embeddings to create clusters and analyze how good they are
(5) Always ask for final users feedback
RECOMMENDATION IS ALL ABOUT USERS TASTE
ASK THEM FOR FEEDBACK!!
Conclusions
Losses and metrics summary table
Name Category loss eval batch-SGD support implicit Comments
MSE Regr ✓ ✓ ✓ ✓ linear gradient
MAE Regr ✓ ✓ easy to interpret
Logistic / XE / KL Classif ✓ ✓ ✓ ✓ flexible truth
Exponential Classif ✓ ✓ exploding gradient
Recall (global) Classif ✓ ✓ ✓ requires negative
Precision (global) Classif ✓ ✓ ✓ requires negative
F-measure
(global)
Classif ✓ ✓ ✓ requires negative
MRR Ranking ✓ considers only 1 item
nDCG Ranking ✓ requires rank
WARP Ranking ✓ for nDCG, p@k, r@k
AUC Hybrid ✓ ✓ ✓ requires negative
BPR Hybrid ✓ ✓ for AUC
Recall@k Hybrid ✓ requires≤k positives
Precision@k Hybrid ✓ requires ≥k positives
Questions
Thank YOU!
Negative Sampling
Problem
● Unary feedback: the best model will always predict “1” for each user and item.
● In general:
○ your model is used in real life to predict (user, item) outside sparse dataset.
○ can’t train on the full (#users x #items) dense matrix.
Negative Sampling Solution
● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing)
● sample strategy matters a lot (i.e. how to split train and valid)
● how many negative samples matters a lot
Negative Sampling
Negative Sampling
Split negative feedback in the same proportion
Underfitting and Overfitting – Take Home
(1) For doing cross-validation split data such as almost all users are in training and validation
(2) Use negative sampling to avoid overfitting in your models
(3) Always use learning curves to get more insights about underfitting vs overfitting
(4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting
Loss Functions – Classification
● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model)
● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability
● Often used for deep-learning based recommendation engines
● Smooth gradient around zero and steep for large errors
Logistic

More Related Content

What's hot

Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
Liang Xiang
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
Justin Basilico
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
Adaptation and Evaluation of Recommendationsfor Short-term Shopping GoalsAdaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
LukasLerche
 

What's hot (20)

Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Correlation, causation and incrementally recommendation problems at netflix ...
Correlation, causation and incrementally  recommendation problems at netflix ...Correlation, causation and incrementally  recommendation problems at netflix ...
Correlation, causation and incrementally recommendation problems at netflix ...
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFSteffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SF
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
 
Contextualization at Netflix
Contextualization at NetflixContextualization at Netflix
Contextualization at Netflix
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
Adaptation and Evaluation of Recommendationsfor Short-term Shopping GoalsAdaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
 

Similar to Recommender Systems from A to Z – Model Evaluation

Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
Vivek Sharma
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 

Similar to Recommender Systems from A to Z – Model Evaluation (20)

Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Mr4 ms10
Mr4 ms10Mr4 ms10
Mr4 ms10
 
Evaluation metrics for binary classification - the ultimate guide
Evaluation metrics for binary classification - the ultimate guideEvaluation metrics for binary classification - the ultimate guide
Evaluation metrics for binary classification - the ultimate guide
 
Yelp Dataset Challenge
Yelp Dataset ChallengeYelp Dataset Challenge
Yelp Dataset Challenge
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
Marketing Research Ppt
Marketing Research PptMarketing Research Ppt
Marketing Research Ppt
 
Agile estimation
Agile estimationAgile estimation
Agile estimation
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Snapshot of winning submissions- Jigsaw Academy ValueLabs Sentiment Analysis ...
Snapshot of winning submissions- Jigsaw Academy ValueLabs Sentiment Analysis ...Snapshot of winning submissions- Jigsaw Academy ValueLabs Sentiment Analysis ...
Snapshot of winning submissions- Jigsaw Academy ValueLabs Sentiment Analysis ...
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
 

Recently uploaded

THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
Jocelyn Atis
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
Sérgio Sacani
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 
Shuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxShuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptx
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdfGEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 

Recommender Systems from A to Z – Model Evaluation

  • 1.
  • 2. Recommender Systems from A to Z Part 1: The Right Dataset Part 2: Model Training Part 3: Model Evaluation Part 4: Real-Time Deployment
  • 3. Recommender Systems from A to Z Part 1: The Right Dataset Part 2: Model Training Part 3: Model Evaluation Part 4: Real-Time Deployment
  • 4. 1. Introduction Train/Valid split, underfitting and overfitting Learning Curve in recommendation engines 2. Evaluation functions Basic metrics for recommender engines (Precision, Recall, TPR, TNR...) Regression, Classification, Ranking metrics 3. Loss functions Optimization problems and Losses functions properties Regression, Classification, Ranking losses 4. Practical recommendations Regularization, HP optimization, Embeddings evaluations.
  • 6. Previous Meetup Recap: Recommendation Engine Types Recommendation engine Content-based Collaborative-filtering Hybrid engine Memory-based Model-based Item-Item User-User User-Item Model When? Problem definition Solutions strategies Content-based Item Cold start Least Square, Deep Learning Item-Item n_users >> n_items Affinity Matrix User-User n_user << n_items KNN, Affinity Matrix User-Item Better performance Matrix Factorization, Deep Learning
  • 7. Previous Meetup Recap: Recommendation Engine Models Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear Time Complexity + + +++ ++++ ++++ ++++ ++ Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit Hyper-Params 0 0 0 1 2 2–3 many Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
  • 8. Optimization Problem – Matrix Factorization Example (or R)
  • 9. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings
  • 10. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) Ratings of User #1 Embedding of User #1 Embedding of Item #1 Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings Ratings of User #m To Item #n
  • 11. Optimization Problem – Matrix Factorization Example Optimization problem (definitions) AVAILABLE DATASET ? ? Sparse matrix of ratings with m users and n items Dense matrix of users embeddings Dense matrix of items embeddings
  • 12. Optimization Problem – Matrix Factorization Example Our goal is to find U and I, such as the difference between each datapoint in R and and the product between each user and item is minimal. (or R)
  • 13. Optimization Problem – Matrix Factorization Example Our goal is to find U and I, such as the difference between each datapoint in R and and the product between each user and item is minimal. (or R) 3. How are we going to solve the problem? 2. What properties are we looking in our outputs? - Exact rating vs like/dislike vs ranking predictions 1. What type of data do we have?
  • 14. Ask the Right Questions (1) What type of data do we have? (2) What properties are we looking in our outputs? (3) How are we going to solve the problem? (4) Which hyper-parameters of my model are the best? (5) Which model is the best? Business decisions Technical decisions
  • 15. Ask the Right Questions (1) What type of data do we have? (2) What properties are we looking in our outputs? (3) How are we going to solve the problem? (4) Which hyper-parameters of my model are the best? (5) Which model is the best? EVALUATION FUNCTIONS LOSS FUNCTIONS RANDOM SEARCH, GP COMPARE METRICS ML FOR RECOMMENDATION Business decisions Technical decisions
  • 16. Objectives Types (from data point of view) Classification ● clic/no-click ● like/dislike/missing ● estimated probability of like (e.g. watch time) Regression ● absolute rating (e.g. from 1/5 to 5/5) ● number of interactions Ranking ● estimated order of preference (e.g. watch time) ● pairwise comparisons Unsupervised ● clustering of items ● clustering of users
  • 17. Choosing the Right Objective (from business point of view) Absolute Predictions vs Relative Predictions Does only the order of the predictions matter? Sensitivity vs Specificity Is false positive worst than false negative? Skewness Is misclassifying an all-star favorite worst than misclassifying a casual like?
  • 18. Choosing the Right Objective (from business point of view) Absolute Predictions vs Relative Predictions Does only the order of the predictions matter? Sensitivity vs Specificity Is false positive worst than false negative? Skewness Is misclassifying an all-star favorite worst than misclassifying a casual like? LOSS FUNCTION THAT PENALIZE MORE ERRORS IN ALL-STAR RATING RANKING LOSS FUNCTION CLASSIFICATION LOSS FUNCTION
  • 19. Cross Validation – In Traditional Machine Learning 1 2 3 4 4 1 2 3 3 4 1 2 2 3 4 1
  • 20. Cross Validation – In Recommendation Engines Dataset
  • 21. Cross Validation – In Recommendation Engines Split such as every user is present in train and valid More stronger: split as every user have 80/20 train and valid Dataset
  • 23. Underfitting and Overfitting Model fails to learn relations in data Model is a good fit for the data Model fails to generalize New samples New samples New samples + Complex
  • 25. Underfitting and Overfitting Validation Sample + Complex OverfittingUnderfitting
  • 26. Underfitting and Overfitting epoch Loss Function or Metric Mini-Batch Gradient Descent for epoch in n_epochs: ● shuffle the batches ● for batch in n_batches: ○ compute the predictions for the batch ○ compute the error for the batch ○ compute the gradient for the batch ○ update the parameters of the model ● plot error vs epoch
  • 27. Underfitting and Overfitting A very simple way of checking underfitting Ground truth Y Model predictions Model is predicting always the same Predicted Y Underfitting
  • 29. What do we want to evaluate? Classification ● True Positive Rate (TPR) ● True Negative Rate (TNR) ● Precision ● F-measure Regression ● Mean Square Error (MSE) Ranking ● Recall@K ● Precision@K ● CG, DCG, nDCG Ranking/Classification metrics ● AUC Some common evaluation functions
  • 30. Regression Mean Square Error (MSE) ● Easy to compute ● Linear gradient ● Can also be used as loss function Mean Absolute Error (MAE) ● Easy to compute ● Easy to interpret ● Discontinuous gradient ● Can’t be used as loss function
  • 31. Classification – Precision vs Recall TS = Toy Story KP = Kung Fu Panda TD = How to train your dragon A = Annabelle Model 1 Model 2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2 User’s likes User’s dislikes Model recommendations TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2
  • 32. Classification – Precision vs Recall TS = Toy Story KP = Kung Fu Panda TD = How to train your dragon A = Annabelle User’s likes User’s dislikes Model recommendations Recall = 5/7 Precision = 5/5 = 1 Recall = 7/7 = 1 Precision = 7/9 Model 1 Model 2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2 TS1 TS2 TS3 KP1 KP2 TS4 KP3 A1 A2
  • 33. Classification 1/2 True Positive Rate (a.k.a TPR, Recall, Sensitivity) ● Easy to understand ● Useful for likes/dislikes datasets ● Measure of global bias of a model ● 0 <= TPR <=1 (higher is better) True Negative Rate (a.k.a TNR, Selectivity, Specificity) ● Easy to understand ● Useful for likes/dislikes datasets ● Measure of global bias of a model ● 0 <= TNR <=1 (higher is better)
  • 34. Classification 2/2 Precision ● Easy to understand ● Useful for likes/dislikes datasets ● Measure quality of recommendation ● 0 <= Precision <=1 (higher is better) F-measure ● Balance precision and recall ● Not good for recommendation, because doesn’t take into account True Negatives ● 0 <= F-measure <= 1 (higher is better)
  • 35. Ranking 1/3 Recall@K ● Count the positive items of the top K items predicted for each user ● Divides that number by the number of positive items for each user ● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? TOP K Positive = ? Total Positive = ? Recall@K = ?
  • 36. Ranking 1/3 Recall@K ● Count the positive items of the top K items predicted for each user ● Divides that number by the number of positive items for each user ● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} TOP K Positive = {TS1, TS2} = 2 Total Positive = 4 Recall@K = 2 / 4 top 1 top 2 top 3
  • 38. Ranking 2/3 Precision@K ● Count the positive items of the top K items predicted for each user ● Divides that number by K for each user ● A perfect score is 1 if the user has K or more positive items and the top K only contains positives ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? TOP K Positive = ? Recall@K = ?
  • 39. Ranking 2/3 Precision@K ● Count the positive items of the top K items predicted for each user ● Divides that number by K for each user ● A perfect score is 1 if the user has K or more positive items and the top K only contains positives ● Independent of the exact values of the predictions, only their relative rank matters Movie Toy Story 1 1.0 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} TOP K Positive = {TS1, TS2} = 2 Recall@K = 2 / 3 top 1 top 2 top 3
  • 41. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings ● The bigger the score the better Movie Toy Story 1 1 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = ? CG = ? DCG = ?
  • 42. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings ● The bigger the score the better Movie Toy Story 1 1 0.9 Toy Story 2 0.9 0.7 Kung Fu Panda 1 0.7 0.1 Kung Fu Panda 2 0.6 -0.1 Annabelle 1 -0.2 0.4 K = 3 TOP K = {TS1, TS2, A1} CG = 1.0 + 0.9 - 0.2 DCG = 1/1 + 0.9/2 - 0.2/3 top 1 top 2 top 3
  • 43. Ranking 3/3 CG, DCG, and nDCG ● CG: Sum the true ratings of the Top K items predicted for each user ● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1] ● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
  • 44. Hybrid Ranking/Classification AUC ● Vary positive prediction threshold (not just 0) ● Compute TPR and FPR for all possible positive thresholds ● Build Receiver Operating Characteristic (ROC) curve ● Integrate Area Under the ROC Curve (AUC)
  • 46. Loss Functions vs Evaluation Functions Evaluation Metrics ● Expensive to evaluate ● Often not smooth ● Often not even derivable Loss Functions ● Smooth approximations of your evaluation metric ● Well suited for SGD
  • 47. Loss Functions: How we are going to solve the problem? Classification loss ● Logistic ● Cross Entropy ● Kullback-Leibler Divergence Regression loss ● Mean Square Error (MSE) Ranking loss ● WARP ● BPR Some common loss functions
  • 48. Optimization Problems – Basic Formulation with RMSE Goal: find U and I s.t. the difference between each datapoint in R and and the product between each user and item is minimal (or R)
  • 49. Optimization Problems – General Formulation Goal: find U and I s.t. the loss function J is minimized. (or R)
  • 50. Convex vs Non-Convex Optimization Convex Non-convex
  • 53. Loss Functions – Regression Mean Square Error ● Typical used loss function for regression. It’s a smooth function. It’s easy to understand. Regularized Mean Square Error ● Mean square error plus regularization to avoid overfitting.
  • 54. Loss Functions – Classification Logistic ● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.
  • 55. Loss Functions – Classification Logistic
  • 56. Loss Functions – Ranking Weighted Approximate-Rank Pairwise (WARP) ● Approximates DCG-like evaluation metrics ● Smooth and tractable computation Bayesian Personalised Ranking (BPR) ● Approximates AUC ● Smooth and tractable computation ● Requires binary comparisons (good for binary comparison feedback)
  • 58. Practical Recommendations (1) Always compute baseline metrics (2) Always analyze underfitting vs overfitting (3) Always do hyperparameter optimization (4) Always compute multiple metrics for your models (5) Always analyze the clustering properties of the items/users (6) Always ask feedback from end users
  • 59. Practical Recommendations (1) Always compute baseline metrics (2) Always analyze underfitting vs overfitting (3) Always do hyperparameter optimization (4) Always compute multiple metrics for your models (5) Always analyze the clustering properties of the items/users (6) Always ask feedback from end users COMPARE WITH GLOBAL MODELS IT’S EASY IF OVERFITTING, USE REGULARIZATION GRID SEARCH OR GAUSSIAN PROCESS TPR, TNR, PRECISION, ETC. ITEM/ITEM SIMILARITIES EVERYTHING IS ABOUT USER TASTE
  • 60. (1) Always compute baseline metrics Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear Time Complexity + + +++ ++++ ++++ ++++ ++ Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit Hyper-Params 0 0 0 1 2 2–3 many Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
  • 61. (2) Always analyze underfitting vs overfitting Model-based ● Dropout ● Bagging Loss-based normalization ● norm: best approximation of sparsity-inducing norm ● norm: very smooth, easy to optimize Data Augmentation ● Negative Sampling
  • 62. (3) Always do hyperparameter optimization Grid Search Brute force over all the combinations of the parameters Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs Random Search Uniformly sample combinations of the parameters Very easy to implement, very useful in practice Gaussian Process Optimization Meta-learning of the validation error given hyper-parameters Solve exploration/exploitation tradeoff
  • 63. (3) Always do hyperparameter optimization Metric to minimize Metric to maximize
  • 64. (4) Always compute multiple metrics for your models
  • 65. (5) Always analyze the clustering properties of the items/users Items embeddings ● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS ● After getting the embeddings, we always compute Top-K similarities in well known items ● We use the items embeddings to create clusters and analyze how good they are
  • 66. (5) Always ask for final users feedback RECOMMENDATION IS ALL ABOUT USERS TASTE ASK THEM FOR FEEDBACK!!
  • 68. Losses and metrics summary table Name Category loss eval batch-SGD support implicit Comments MSE Regr ✓ ✓ ✓ ✓ linear gradient MAE Regr ✓ ✓ easy to interpret Logistic / XE / KL Classif ✓ ✓ ✓ ✓ flexible truth Exponential Classif ✓ ✓ exploding gradient Recall (global) Classif ✓ ✓ ✓ requires negative Precision (global) Classif ✓ ✓ ✓ requires negative F-measure (global) Classif ✓ ✓ ✓ requires negative MRR Ranking ✓ considers only 1 item nDCG Ranking ✓ requires rank WARP Ranking ✓ for nDCG, p@k, r@k AUC Hybrid ✓ ✓ ✓ requires negative BPR Hybrid ✓ ✓ for AUC Recall@k Hybrid ✓ requires≤k positives Precision@k Hybrid ✓ requires ≥k positives
  • 71. Negative Sampling Problem ● Unary feedback: the best model will always predict “1” for each user and item. ● In general: ○ your model is used in real life to predict (user, item) outside sparse dataset. ○ can’t train on the full (#users x #items) dense matrix. Negative Sampling Solution ● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing) ● sample strategy matters a lot (i.e. how to split train and valid) ● how many negative samples matters a lot
  • 73. Negative Sampling Split negative feedback in the same proportion
  • 74. Underfitting and Overfitting – Take Home (1) For doing cross-validation split data such as almost all users are in training and validation (2) Use negative sampling to avoid overfitting in your models (3) Always use learning curves to get more insights about underfitting vs overfitting (4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting
  • 75. Loss Functions – Classification ● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model) ● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability ● Often used for deep-learning based recommendation engines ● Smooth gradient around zero and steep for large errors Logistic