Toward a new Protocol to evaluate Recommender Systems

Toward a New Protocol to Evaluate
Recommender Systems
Frank Meyer, Françoise Fessant, Fabrice Clerot, Eric Gaussier
Franck.meyer@orange.com

University Joseph Fourier & Orange
RecSys 2012 – WorkShop on Recommendation Utility Evaluation
2012 – v1.18

R&D

Summary
 Introduction

1. Industrial tasks for recommender systems

2. Industrial (off line) protocol

3. Main results

 Conclusion and future works

p2 Orange R&D Orange FT-group

Summary

 Introduction



3. Main results



Recommender systems
 For industrial applications
 Amazon, Google News, Youtube (Google), ContentWise, BeeHive
(IBM),...
 as for well-known academic realizations
 Fab, More, Twittomender,...
 the recommendation is multi-facetted
 pushing items, sorting items, linking items...
 and cannot be reduced to the rating prediction of a score of
interest of a user u for an item i.

 What is a good recommender system?
 just a system accurate for rating prediction for top N blockbusters and
top M big users?
 ... or something else?


Summary
 Introduction


3. Main results



Industrial point of view
 Main goals of the automatic recommendation:
 to increase sales
 to increase the audience (click rates...)
 to increase customer’s satisfaction and loyalty

 Main needs (analysis at Orange: TV, Video On Demand,
shows, web-radios,...)
1. Helping all the users: big users and small users
2. recommending all the items : frequently purchased/viewed items,
rarely purchased/viewed items
3. Helping users on different identified problems
1. should I take this item?
2. should I take this item or that one?
3. what should interest me in this catalog?
4. what is similar to this item?


We propose 4 key functions
 Help to Explore (navigate) Example:

 Given an item i used as a context, give N items similar to i.

 Help to Decide Example:
 Given an user u, and an item i, give a predictive score of
interest of u for i (a rating).

 Help to Compare
Example:
 Given a user u and a list of items i1,…,in, sort the items in
a decreasing order according to the score of interest for u.

 Help to Discover
 Given a user u, give N interesting items for u. Example:


Decide/ Compare / Discover / Explore
Function Quality criteria Measure
Decide The rating prediction must be precise. Existing measure: RMSE
Extreme errors must be penalized
because they may more often lead to
a wrong decision.

Compare The ranking prediction must be good Existing measure: NDPM
for any couple of items of the catalog (or number of compatible orders)
(not only for a Top N).

Discover The recommendation must be useful. Existing measure : Precision

Problem: if one recommends only well-
known blockbusters (i.e. Star Wars,
Titanic...) one will be precise but not useful!

Introducing the Impact Measure

Explore Problem: the semantic relevance is
not evaluable without user’s feedback.
Introducing a validation method
for a similarity measure


Summary
 Introduction


3. Main results



Known Vs Unknown, Risky Vs Safe
Recommending an item for a user...

Probability that
the user already
knows the item Bad Trivial
recommendation recommendation

But the item is correct but not often
generally known by useful
name by the user

Very bad Very good
recommendation recommendation

the user does not know Help to Discover
the item: if he trusts the
systems, he will be misled

Probability that
the user likes
the item

Orange R&D Orange FT-group

Measuring the Help to Discover
Proba user
already
knows
 Average Measure of Impact
Proba user
likes

Recommendation impact
Impact if the user Impact if the user likes
dislikes the item the item
Recommending a slightly negative slightly positive
popular item
Recommending a rare, Strongly negative Strongly positive
unknown item

Size of the
List Z of Impact: rarity of the catalog
List H of logs
recommended items * relative rating of (normalization)
(u,i,r) in the
items Test Set the user u (according to
p11
her mean of ratings)

Principle of the protocol
Datasets used:
MovieLens 1M and Netflix.

LOGS TEST No long tail distribution
detected in Netflix neither in
MovieLens’ dataset

So we use the simplest
userID, itemID, note
userID, itemID, note
segmentation according to
userID, itemID, rating the mean of the number of
Simple mean-based ratings: light/heavy users,
item/user segmentation popular/unpopular items
Learn

For each (userID, itemID) in Test: RMSE
generate a rating prediction, compare with true rating

For each list of itemIDs for each userID in Test : %COMP
Sort the list according to the ratings, compare the strict (% compatible)
orders of the rating with the order given by the model
Model
For each userID in Test:
generate a list of recommended items; for each of this
items actually rating by userID in Test, evaluate the
relavance
AMI

We will use 4 algorithms to validate the protocol
 Uniform Random Predictor
 Returns a rating between 1 and 5 (min et max) with a random uniform
distribution
 Default Predictor (mean of item + mean of user )/2
 Robust mean of the items: requires at least 10 ratings on the item, otherwise
use only the user’s mean
 K-Nearest Neighbor item method
 Use K nearest neighbors per item, a scoring method detailed below, a

similarity measure called Weighted Pearson. Uses the Default predictor when
an item cannot be predicted
• Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted
Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255
 Fast factorization method
 Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF”
implementation)
• Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative
Filtering Approaches for Large Recommender Systems. Journal of Machine Learning
Research 10: 623-656 (2009)


What about “Help to Explore”?

 How to compare the “semantic quality” of the link between 2 items?

 Principle
 Define a similarity measure that could be extracted from the model
 use the similarity measure to build an item-item similarity matrix
 use the similarity matrix as a model for a recommender system using a KNN item-item
model
 if this system obtains good performances for RMSE, %COMP, and AMI then the
semantic quality of the similarity measure must be good

 Application
 for a KNN-item model this is immediate (there is an intrinsic similarity)
 for a matrix factorization model, we can use a similarity measure (as Pearson)
computed on the items’ factors
 for a random rating predictor, this is not applicable...
 for a mean-based rating predictor, this is not applicable...


Evaluating “Help To Explore” for Gravity
columns of users

items X users
matrix of
rows of ratings
items
Gravity (fast
Matrix matrix of
Factorization) users’ factors
matrix of
(not used)
items’ factors

Similarity KNN based
Matrix (KNN) recommender
of the items system
items’ similarity (model for a
computations and K recommender
Nearest Neighbors system)
search, using the matrix Possible evaluation of the
of items’ factors quality of this similarity matrix
via RMSE, %Comp, AMI...
p15


Summary
 Introduction



3. Main results


Finding 1: different performances
according to the segments
We have a decrease in performance of more than
25% between heavy user popular item segment
and light user unpopular item segment

RMSE for Gravity on Netflix RMSE for KNN on Netflix
rmse av.
1.05
Default Pred 1.1 rmse av.
rmse 1.05 Default Pred.
1
rmse
1
0.95 rmse Huser
0.95

RMSE
Pitem rmse Huser
RMSE

0.9
0.9 Pitem
rmse Luser rmse Luser
0.85 Pitem 0.85 Pitem
0.8 rmse Huser 0.8 rmse Huser
Uitem Uitem
0.75
0.75 rmse Luser
rmse Luser 0 50 100 150 200
0 10 20 30 40 50 60 70 Uitem
Uitem
Number Of Factors Number of KNN

the 4 RMSE Light users Unpopular items (Luser Uitem)
RMSE Light users Popular items (Luser Pitem)
segments RMSE Heavy users Unpopular items (Huser Uitem)
analyzed RMSE Heavy users Popular items (Huser Pitem)

+ RMSE (global)
+ Default predictor

Finding 2: RMSE not strictly linked to the
other performances
the light user popular item segment is the light user popular item segment is as
easier to optimize than the light user difficult to optimize as the light user
unpopular item segment for RMSE unpopular item segment for Ranking

RMSE for Gravity on Netflix Ranking compatibility for Gravity - Netflix
rmse av.
1.05
Default Pred 77.00%
%Compatible
rmse 75.00% Default Pred
1
%compatible

%Compatible
73.00%
0.95 rmse Huser
Pitem 71.00% %compatible
RMSE

0.9 Huser Pitem
rmse Luser 69.00% %compatible
0.85 Pitem Luser Pitem
67.00%
0.8 rmse Huser %compatible
Uitem 65.00% Huser Uitem
0.75 %compatible
rmse Luser 0 20 40 60
0 10 20 30 40 50 60 70 Luser Uitem
Uitem Number of factors
Number Of Factors

RMSE Light users Unpopular items
RMSE Light users Popular items
Example on 2 segments... RMSE Heavy users Unpopular items
RMSE Heavy users Popular items

RMSE (global)
Default predictor (global)

Main Fact 2 (continued): RMSE not strictly
linked to the other performances
RMSE for KNN on Netflix
RMSE for Gravity on Netflix rmse av. 1.1 rmse av.
1.05
Default Pred 1.05 Default Pred.
rmse
rmse 1
1
0.95

RMSE
rmse Huser
0.95 rmse Huser Pitem
0.9
Pitem rmse Luser
RMSE

0.9 0.85 Pitem
rmse Luser
0.8 rmse Huser
0.85 Pitem
Uitem
0.75
0.8 rmse Huser rmse Luser
Uitem 0 50 100 150 200
Uitem
0.75 Number of KNN
0 10 20 30 40 50 60 70
rmse Luser
Uitem
Number Of Factors
Average Measure of Impact - Netflix
2.5

RMSE (global) 2

1.5

1
Average Measure of Impact -
Netflix
0.5
Globally, Gravity is better than KNN for RMSE,
but is worse than KNN for Average Measure of 0

Impact Random Pred Default Pred KNN, K=100 Gravity, F=32
-0.5

-1

Global results
Help to Decide / Compare / Discover

Gravity
dominates
for the
RMSE
measure

KNN
dominates on
the heavy
user
segments

The default
Predictor is
very useful for
unpopular (i.e.
infrequent)
item segments


Comparing native similarities with
Gravity-based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16
factors) Gravity :
1. KNN item-item can be performed on a factorized matrix with little performance loss (and
faster!).
2. Gravity can be used for the “Help to Explore function”
Native KNN KNN computed on Gravity's
K=100 items factors
K=100, number of
factors=16
RMSE 0.8440 0.8691
Ranking: % compatible 77.03% 75.67%

Precision 91.90% 86.39%
AMI 2.043 2.025
Global time 5290 seconds 3758 seconds
of the modeling task


Summary
 Introduction



3. Main results



Conclusion: contributions
 As industrial recommendation is multi-facetted
 we proposed to list the key functions of the recommendation
• Help to Decide, Help to Compare, Help to Discover, Help to Explore
• Note for Help to explore: the similarity feature is mandatory for a recommender system
 we proposed to define a dual segmentation of Items and Users
• just being very accurate on big users and blockbuster items is not very useful

 For a new offline protocol to evaluate recommender systems
 we proposed to use the recommender’s key functions with the dual segmentation
• Mapping Key functions with measures
• adding the measure of Impact to evaluate the “Help to Discover” function
• adding a method to evaluate the “Help to Explore” function
 we made a demonstration of its utility
• RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover,
Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee
with the other measures!)
• The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited
to improve the global performances
• + we saw empirically that the KNN approach could be virtualized, performing the similarities
between items on a factorized space built for instance by Gravity


Future works: 3 main axis

1. Evalutation of the quality of the 4 core functions using an online
A/B Testing protocol

2. Hybrid switch system: the best algorithm for the adapted task
according to the user-item-segment

3. KNN virtualization via matrix factorization


Annexes


about this work...

 Frank Meyer: Recommender systems in industrial
contexts. CoRR abs/1203.4487: (2012)

 Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric
Gaussier: Toward a New Protocol to Evaluate Recommender
Systems. Workshop on Recommender Utility Evaluation, RecSys
2012. Dublin.

 Frank Meyer, Françoise Fessant: Reperio: A Generic and
Flexible Industrial Recommender System. Web Intelligence
2011: 502-505. Lyon.


Classic mathematic representation
of the recommendation problem
thousands of users

u1 u2 ul un

i1 4 2 ? 5 ? 2 ? 1
i2 4 5 4 5 5 4 1 5 4
known
4 3 1 1 ratings
2 1 of
interest
thousands ik 3 ? 4 ? 5
of items
? 2 ratings
of
1 4 5 interest
to predict
? ? ?
4 5 4 4
3 ?
im 5 ? 2 4

Well known industrial example:
Item-to-Items recommendation (Amazon )
TM

p28

Multi-facetted analysis: measures
predicted rating

RMSE
real rating

number of logs in the Test
Set nb of contradictory
orders
on a same
nb of compatible dataset and on
NDPM orders a same user,
% compatible
nb strict orders directly usable
given by the user

Precision
number of recommeded
items actually evaluable in
the Test Set

AMI

Average Measure of
Impact

Comparing native similarities with Gravity-
based similarities
Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) :
 Gravity can be used for the “Help to Explore function”
 KNN item-item can be performed on a factorized matrix with little performance loss!.


Reperio C-V5
 Centralized mode, example of a movie recommender


Reperio E-V2
 Embedded Mode, example of a TV program recommender


Toward a new Protocol to evaluate Recommender Systems

More Related Content

Similar to Toward a new Protocol to evaluate Recommender Systems

Recently uploaded

Toward a new Protocol to evaluate Recommender Systems