A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
Upcoming SlideShare
Loading in...5
×
 

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

on

  • 1,132 views

he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists ...

he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).

Statistics

Views

Total Views
1,132
Views on SlideShare
1,078
Embed Views
54

Actions

Likes
6
Downloads
19
Comments
0

4 Embeds 54

https://twitter.com 42
http://eventifier.com 9
http://eventifier.co 2
http://yali-ld1.linkedin.biz 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems Presentation Transcript

  • A  Top-­‐N  Recommender  System  Evalua8on   Protocol  Inspired  by  Deployed  Systems   Alan  Said,  Alejandro  Bellogín,  Arjen  De  Vries   CWI   @alansaid,  @abellogin,  @arjenpdevries  
  • Outline   •  Evalua8on   –  Real  world     –  Offline   •  Not  algorithmic  comparison!     •  Comparison  of  evalua8on   •  Protocol   •  Experiments  &  Results   •  Conclusions   2013-­‐10-­‐13   LSRS'13   2  
  • EVALUATION   2013-­‐10-­‐13   LSRS'13   3   View slide
  • Evalua8on   •  Does  p@10  in  [Smith,2010a]  measure  the  same  quality  as  p@10  in  [Smith, 2012b]?   –  Even  if  it  does   •  is  the  underlying  data  the  same?   •  was  cross-­‐valida8on  performed  similarly?   •  etc.   2013-­‐10-­‐13   LSRS'13   4   View slide
  • Evalua8on   •  What  metrics  should  we  use?   •  How  should  we  evaluate?   –  Relevance  criteria  for  test  items   –  Cross  valida8on  (n-­‐fold,  random)   •  Should  all  users  and  items  be  treated  the  same  way?   –  Do  certain  users  and  items  reflect  different  evalua8on  quali8es?     2013-­‐10-­‐13   LSRS'13   5  
  • Offline  Evalua8on   Recommender  System  accuracy  evalua8on  is  currently  based  on  methods   from  IR/ML   –  –  –  –  –  One  training  set   One  test  set   (One  valida8on  set)   Algorithms  are  trained  on  the  training  set   Evaluate  using  metric@N  (e.g.  p@N  –  a  page  size)   •  Even  when  N  is  larger  than  the  number  of  test  items   •  p@N  =  1.0  is  (almost)  impossible   2013-­‐10-­‐13   LSRS'13   6  
  • Evalua8on  in  produc8on   •  One  dynamic  training  set   –  All  of  the  available  data  at  a  certain  point  in  8me   –  Con8nuously  updated   •  No  test  set     –  Only  live  user  interac8ons   •  Clicked/purchased  items  are  good  recommenda8ons   Can  we  simulate  this  offline?   2013-­‐10-­‐13   LSRS'13   7  
  • Evalua8on  Protocol   •  •  •  •  Based  on  “real  world”  concepts   Uses  as  much  available  data  as  possible   Trains  algorithms  once  per  user  and  evalua8on  selng  (e.g.  N)   Evaluates  p@N  when  there  are  exactly  N  correct  items  in  the  test  set   –  possible  p@N  =  1  (gold  standard)   2013-­‐10-­‐13   LSRS'13   8  
  • Evalua8on  Protocol   Three  concepts:   1.  Personalized  training  &  test  sets   2.  3.  –  Use  all  available  informa8on  about  the  system  for  the  candidate  user   –  Different  test/training  sets  for  different  levels  of  N   Candidate  item  selec8on  (items  in  test  sets)   –  Only  “good”  items  go  in  test  sets  (no  random  80%-­‐20%  splits)   –  How  “good”  an  item  is  is  based  on  each  user’s  personal  preference   Candidate  user  selec8on  (users  in  test  sets)   –  Candidate  users  must  have  items  in  the  training  set   –  When  evalua8ng  p@N,  each  user  in  test  set  should  have  N  items  in  test  set   •  Effec8vely  precision  becomes  R-­‐precision   Train  each  algorithm  once  for  each  user  in  the  test  set  and  once  for  each  N.       2013-­‐10-­‐13   LSRS'13   9  
  • Evalua8on  Protocol   2013-­‐10-­‐13   LSRS'13   10  
  • EXPERIMENTS   2013-­‐10-­‐13   LSRS'13   11  
  • Experiments   –  Movielens  100k   •  •  •  •  Minimum  20  ra8ngs  per  user   943  users   6.43%  density   Not  realis8c   –  Movielens  1M  sample   •  100k  ra8ngs   •  1000  users   •  3.0%  density   •  number  of  users   Datasets:   10   1   10   100   number  of  raAngs   1000   100   number  of  raAngs   1000   12   100   Algorithms   –  SVD   –  User-­‐based  CF  (kNN)   –  Item-­‐based  CF   2013-­‐10-­‐13   number  of  users   •  100   10   1   LSRS'13   10  
  • Experimental  Selngs   According  to  proposed  protocol:   •  Evaluate  R-­‐precision  for   N=[1,5,10,20,50,100]   •  Users  evaluated  at  N  must  have  at   least  N  items  rated  above  the   relevance  threshold  (RT)   •  RT  depends  on  the  users  mean   ra8ng  and  standard  devia8on   Baseline   •  Evaluate  p@N  for   N=[1,5,10,20,50,100]   •  80%-­‐20%  training-­‐test  split   •  Number  of  runs:  |N|*|users|   •  Number  of  runs:  1     2013-­‐10-­‐13   –  Items  in  test  set  rated  at  least  3     LSRS'13   13  
  • Results   User-­‐based  CF  ML1M  sample   2013-­‐10-­‐13   LSRS'13   14  
  • User-­‐based  CF  ML1M  sample   User-­‐based  CF  ML100k   SVD  ML1M  sample   SVD  ML1M  sample   2013-­‐10-­‐13   Results   LSRS'13   15  
  • Results   What  about  8me?   –  |N|*|users|  vs.  1?   –  Trade-­‐off  between  a  realis8c   evalua8on  and  complexity?   2013-­‐10-­‐13   LSRS'13   16  
  • Conclusions   •  We  can  emulate  a  realis8c  produc8on  scenario  by  crea8ng  personalized   training/test  sets  and  evalua8ng  them  for  each  candidate  user  separately   •  We  can  see  how  well  a  recommender  performs  at  different  levels  of  recall   (page  size)   •  We  can  compare  towards  a  gold  standard   •  We  can  reduce  evalua8on  8me   2013-­‐10-­‐13   LSRS'13   17  
  • Ques8ons?   •  Thanks!     •  Also:  check  out   –  ACM  TIST  Special  Issue  on  RecSys  Benchmarking  –  bit.ly/RecSysBe     –  The  ACM  RecSys  Wiki  –  www.recsyswiki.com     2013-­‐10-­‐13   LSRS'13   18