Transcript of "Toward a new Protocol to evaluate Recommender Systems"
1.
Toward a New Protocol to Evaluate Recommender SystemsFrank Meyer, Françoise Fessant, Fabrice Clerot, Eric GaussierFranck.meyer@orange.comUniversity Joseph Fourier & OrangeRecSys 2012 – WorkShop on Recommendation Utility Evaluation2012 – v1.18 R&D
2.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp2 Orange R&D Orange FT-group
3.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp3 Orange R&D Orange FT-group
4.
Recommender systems For industrial applications Amazon, Google News, Youtube (Google), ContentWise, BeeHive (IBM),... as for well-known academic realizations Fab, More, Twittomender,... the recommendation is multi-facetted pushing items, sorting items, linking items... and cannot be reduced to the rating prediction of a score of interest of a user u for an item i. What is a good recommender system? just a system accurate for rating prediction for top N blockbusters and top M big users? ... or something else?p4 Orange R&D Orange FT-group
5.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp5 Orange R&D Orange FT-group
6.
Industrial point of view Main goals of the automatic recommendation: to increase sales to increase the audience (click rates...) to increase customer’s satisfaction and loyalty Main needs (analysis at Orange: TV, Video On Demand, shows, web-radios,...) 1. Helping all the users: big users and small users 2. recommending all the items : frequently purchased/viewed items, rarely purchased/viewed items 3. Helping users on different identified problems 1. should I take this item? 2. should I take this item or that one? 3. what should interest me in this catalog? 4. what is similar to this item?p6 Orange R&D Orange FT-group
7.
We propose 4 key functions Help to Explore (navigate) Example: Given an item i used as a context, give N items similar to i. Help to Decide Example: Given an user u, and an item i, give a predictive score of interest of u for i (a rating). Help to Compare Example: Given a user u and a list of items i1,…,in, sort the items in a decreasing order according to the score of interest for u. Help to Discover Given a user u, give N interesting items for u. Example: p7 Orange R&D Orange FT-group
8.
Decide/ Compare / Discover / ExploreFunction Quality criteria MeasureDecide The rating prediction must be precise. Existing measure: RMSE Extreme errors must be penalized because they may more often lead to a wrong decision.Compare The ranking prediction must be good Existing measure: NDPM for any couple of items of the catalog (or number of compatible orders) (not only for a Top N).Discover The recommendation must be useful. Existing measure : Precision Problem: if one recommends only well- known blockbusters (i.e. Star Wars, Titanic...) one will be precise but not useful! Introducing the Impact MeasureExplore Problem: the semantic relevance is not evaluable without user’s feedback. Introducing a validation method for a similarity measure p8 Orange R&D Orange FT-group
9.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp9 Orange R&D Orange FT-group
10.
Known Vs Unknown, Risky Vs Safe Recommending an item for a user...Probability thatthe user alreadyknows the item Bad Trivial recommendation recommendation But the item is correct but not often generally known by useful name by the user Very bad Very good recommendation recommendation the user does not know Help to Discover the item: if he trusts the systems, he will be misled Probability that the user likes the item Orange R&D Orange FT-group
11.
Measuring the Help to Discover Proba user already knows Average Measure of Impact Proba user likes Recommendation impact Impact if the user Impact if the user likes dislikes the item the item Recommending a slightly negative slightly positive popular item Recommending a rare, Strongly negative Strongly positive unknown item Size of theList Z of Impact: rarity of the catalog List H of logsrecommended items * relative rating of (normalization) (u,i,r) in theitems Test Set the user u (according to p11 her mean of ratings) Orange R&D Orange FT-group
12.
Principle of the protocol Datasets used: MovieLens 1M and Netflix.LOGS TEST No long tail distribution detected in Netflix neither in MovieLens’ dataset So we use the simplest userID, itemID, note userID, itemID, note segmentation according to userID, itemID, rating the mean of the number of Simple mean-based ratings: light/heavy users, item/user segmentation popular/unpopular itemsLearn For each (userID, itemID) in Test: RMSE generate a rating prediction, compare with true rating For each list of itemIDs for each userID in Test : %COMP Sort the list according to the ratings, compare the strict (% compatible) orders of the rating with the order given by the modelModel For each userID in Test: generate a list of recommended items; for each of this items actually rating by userID in Test, evaluate the relavance Orange R&D Orange FT-group AMI
13.
We will use 4 algorithms to validate the protocol Uniform Random Predictor Returns a rating between 1 and 5 (min et max) with a random uniform distribution Default Predictor (mean of item + mean of user )/2 Robust mean of the items: requires at least 10 ratings on the item, otherwise use only the user’s mean K-Nearest Neighbor item method Use K nearest neighbors per item, a scoring method detailed below, a similarity measure called Weighted Pearson. Uses the Default predictor when an item cannot be predicted • Ref: Candillier, L., Meyer, F., Fessant, F. (2008). Designing Specific Weighted Similarity Measures to Improve Collaborative Filtering Systems. ICDM 2008: 242-255 Fast factorization method Fast Factorization Algorithm, with F factors, known as Gravity (“BRISMF” implementation) • Ref: Takács, G., Pilászy, I., Németh, B., Tikk, D. (2009): Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research 10: 623-656 (2009) p13 Orange R&D Orange FT-group
14.
What about “Help to Explore”? How to compare the “semantic quality” of the link between 2 items? Principle Define a similarity measure that could be extracted from the model use the similarity measure to build an item-item similarity matrix use the similarity matrix as a model for a recommender system using a KNN item-item model if this system obtains good performances for RMSE, %COMP, and AMI then the semantic quality of the similarity measure must be good Application for a KNN-item model this is immediate (there is an intrinsic similarity) for a matrix factorization model, we can use a similarity measure (as Pearson) computed on the items’ factors for a random rating predictor, this is not applicable... for a mean-based rating predictor, this is not applicable... p14 Orange R&D Orange FT-group
15.
Evaluating “Help To Explore” for Gravity columns of users items X users matrix ofrows of ratingsitems Gravity (fast Matrix matrix of Factorization) users’ factors matrix of (not used) items’ factors Similarity KNN based Matrix (KNN) recommender of the items system items’ similarity (model for a computations and K recommender Nearest Neighbors system) search, using the matrix Possible evaluation of the of items’ factors quality of this similarity matrix via RMSE, %Comp, AMI... p15 Orange R&D Orange FT-group
16.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp16 Orange R&D Orange FT-group
17.
Finding 1: different performances according to the segments We have a decrease in performance of more than 25% between heavy user popular item segment and light user unpopular item segment RMSE for Gravity on Netflix RMSE for KNN on Netflix rmse av. 1.05 Default Pred 1.1 rmse av. rmse 1.05 Default Pred. 1 rmse 1 0.95 rmse Huser 0.95 RMSE Pitem rmse HuserRMSE 0.9 0.9 Pitem rmse Luser rmse Luser 0.85 Pitem 0.85 Pitem 0.8 rmse Huser 0.8 rmse Huser Uitem Uitem 0.75 0.75 rmse Luser rmse Luser 0 50 100 150 200 0 10 20 30 40 50 60 70 Uitem Uitem Number Of Factors Number of KNN the 4 RMSE Light users Unpopular items (Luser Uitem) RMSE Light users Popular items (Luser Pitem) segments RMSE Heavy users Unpopular items (Huser Uitem) analyzed RMSE Heavy users Popular items (Huser Pitem) + RMSE (global) p17 Orange R&D Orange FT-group + Default predictor
18.
Finding 2: RMSE not strictly linked to the other performances the light user popular item segment is the light user popular item segment is as easier to optimize than the light user difficult to optimize as the light user unpopular item segment for RMSE unpopular item segment for Ranking RMSE for Gravity on Netflix Ranking compatibility for Gravity - Netflix rmse av. 1.05 Default Pred 77.00% %Compatible rmse 75.00% Default Pred 1 %compatible %Compatible 73.00% 0.95 rmse Huser Pitem 71.00% %compatibleRMSE 0.9 Huser Pitem rmse Luser 69.00% %compatible 0.85 Pitem Luser Pitem 67.00% 0.8 rmse Huser %compatible Uitem 65.00% Huser Uitem 0.75 %compatible rmse Luser 0 20 40 60 0 10 20 30 40 50 60 70 Luser Uitem Uitem Number of factors Number Of Factors RMSE Light users Unpopular items RMSE Light users Popular items Example on 2 segments... RMSE Heavy users Unpopular items RMSE Heavy users Popular items RMSE (global) Default predictor (global) p18 Orange R&D Orange FT-group
19.
Main Fact 2 (continued): RMSE not strictly linked to the other performances RMSE for KNN on Netflix RMSE for Gravity on Netflix rmse av. 1.1 rmse av. 1.05 Default Pred 1.05 Default Pred. rmse rmse 1 1 0.95 RMSE rmse Huser 0.95 rmse Huser Pitem 0.9 Pitem rmse LuserRMSE 0.9 0.85 Pitem rmse Luser 0.8 rmse Huser 0.85 Pitem Uitem 0.75 0.8 rmse Huser rmse Luser Uitem 0 50 100 150 200 Uitem 0.75 Number of KNN 0 10 20 30 40 50 60 70 rmse Luser Uitem Number Of Factors Average Measure of Impact - Netflix 2.5 RMSE (global) 2 1.5 1 Average Measure of Impact - Netflix 0.5 Globally, Gravity is better than KNN for RMSE, but is worse than KNN for Average Measure of 0 Impact Random Pred Default Pred KNN, K=100 Gravity, F=32 -0.5 -1 p19 Orange R&D Orange FT-group
20.
Global results Help to Decide / Compare / Discover Gravity dominates for the RMSE measureKNNdominates onthe heavyusersegments The default Predictor is very useful for unpopular (i.e. infrequent) item segments p20 Orange R&D Orange FT-group
21.
Comparing native similarities with Gravity-based similaritiesSimilarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) Gravity :1. KNN item-item can be performed on a factorized matrix with little performance loss (and faster!).2. Gravity can be used for the “Help to Explore function” Native KNN KNN computed on Gravitys K=100 items factors K=100, number of factors=16 RMSE 0.8440 0.8691 Ranking: % compatible 77.03% 75.67% Precision 91.90% 86.39% AMI 2.043 2.025 Global time 5290 seconds 3758 seconds of the modeling task p21 Orange R&D Orange FT-group
22.
Summary Introduction1. Industrial tasks for recommender systems2. Industrial (off line) protocol3. Main results Conclusion and future worksp22 Orange R&D Orange FT-group
23.
Conclusion: contributions As industrial recommendation is multi-facetted we proposed to list the key functions of the recommendation • Help to Decide, Help to Compare, Help to Discover, Help to Explore • Note for Help to explore: the similarity feature is mandatory for a recommender system we proposed to define a dual segmentation of Items and Users • just being very accurate on big users and blockbuster items is not very useful For a new offline protocol to evaluate recommender systems we proposed to use the recommender’s key functions with the dual segmentation • Mapping Key functions with measures • adding the measure of Impact to evaluate the “Help to Discover” function • adding a method to evaluate the “Help to Explore” function we made a demonstration of its utility • RMSE (Discover) is not strictly linked to the quality of the other functions (Compare, Discover, Explore) so it is very dangerous to evaluate a recommender system only with RMSE (no guarantee with the other measures!) • The mapping of the best algorithm adapted for each couple (function, Segment) could be exploited to improve the global performances • + we saw empirically that the KNN approach could be virtualized, performing the similarities between items on a factorized space built for instance by Gravityp23 Orange R&D Orange FT-group
24.
Future works: 3 main axis1. Evalutation of the quality of the 4 core functions using an online A/B Testing protocol2. Hybrid switch system: the best algorithm for the adapted task according to the user-item-segment3. KNN virtualization via matrix factorizationp24 Orange R&D Orange FT-group
26.
about this work... Frank Meyer: Recommender systems in industrial contexts. CoRR abs/1203.4487: (2012) Frank Meyer, Françoise Fessant, Fabrice Clérot and Eric Gaussier: Toward a New Protocol to Evaluate Recommender Systems. Workshop on Recommender Utility Evaluation, RecSys 2012. Dublin. Frank Meyer, Françoise Fessant: Reperio: A Generic and Flexible Industrial Recommender System. Web Intelligence 2011: 502-505. Lyon.p26 Orange R&D Orange FT-group
27.
Classic mathematic representation of the recommendation problem thousands of users u1 u2 ul un i1 4 2 ? 5 ? 2 ? 1 i2 4 5 4 5 5 4 1 5 4 known 4 3 1 1 ratings 2 1 of interestthousands ik 3 ? 4 ? 5of items ? 2 ratings of 1 4 5 interest to predict ? ? ? 4 5 4 4 3 ? im 5 ? 2 4 p27 Orange R&D Orange FT-group
28.
Well known industrial example: Item-to-Items recommendation (Amazon ) TM Orange R&D Orange FT-groupp28
29.
Multi-facetted analysis: measures predicted rating RMSE real rating number of logs in the Test Set nb of contradictory orders on a same nb of compatible dataset and on NDPM orders a same user, % compatible nb strict orders directly usable given by the user Precision number of recommeded items actually evaluable in the Test Set AMIAverage Measure ofImpact Orange R&D Orange FT-group
30.
Comparing native similarities with Gravity- based similarities Similarities measured applying a Pearson similarity on items’ factors given by Gravity (16 factors) : Gravity can be used for the “Help to Explore function” KNN item-item can be performed on a factorized matrix with little performance loss!. p30 Orange R&D Orange FT-group
31.
Reperio C-V5 Centralized mode, example of a movie recommender p31 Orange R&D Orange FT-group
32.
Reperio E-V2 Embedded Mode, example of a TV program recommender p32 Orange R&D Orange FT-group
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment