Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

I like it... I like it Not


Published on

Presenting our work analyzing natural noise in user ratings for recommender systems. This presentation was done in the UMAP 2009 conference in Trento, Italy

I like it... I like it Not

  1. 1. I like it... I like it not <ul><ul><li>Evaluating User Ratings Noise in </li></ul></ul><ul><ul><li>Recommender Systems </li></ul></ul><ul><ul><li>Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver </li></ul></ul><ul><ul><li>Telefonica Research </li></ul></ul>
  2. 2. Recommender Systems are everywhere <ul><li>Netflix: 2/3 of the movies rented were recommended </li></ul><ul><li>Google News: recommendations generate 38% more clickthrough </li></ul><ul><li>Amazon: 35% sales from recommendations </li></ul><ul><li>“We are leaving the age of Information and entering the Age of Recommendation” - The Long Tail (Chris Anderson) </li></ul>
  3. 3. The Netflix Prize <ul><li>500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE) </li></ul><ul><ul><li>This is what Netflix thinks a 10% improvement is worth for their business </li></ul></ul><ul><ul><li>49K contestants on 40K teams from 184 countries. </li></ul></ul><ul><ul><li>41K valid submissions from 5K teams; 64 submissions in the “last 24 hours” </li></ul></ul>
  4. 4. But, is there a limit to RS accuracy? <ul><li>Evolution of accuracy in Netflix Prize </li></ul>
  5. 5. The Magic Barrier <ul><li>Magic Barrier = Limit on prediction accuracy due to noise in original data </li></ul><ul><li>Natural Noise = involuntary noise introduced by users when giving feedback </li></ul><ul><ul><li>Due to (a) mistakes , and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items) . </li></ul></ul><ul><li>Magic Barrier >= Natural Noise Threshold </li></ul><ul><ul><li>We cannot predict with less error than the resolution in the original data </li></ul></ul>
  6. 6. The Question in the Wind
  7. 7. Our related research questions <ul><li>Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure? </li></ul><ul><li>Q2. How large is the prediction error due to these inconsistencies? </li></ul><ul><li>Q3. What factors affect user inconsistencies? </li></ul>
  8. 8. Experimental Setup (I) <ul><li>Test-retest procedure: you need at least 3 trials to separate </li></ul><ul><ul><li>Reliability : how much you can trust the instrument you are using (i.e. ratings) </li></ul></ul><ul><ul><ul><li>r = r 12 r 23 /r 13 </li></ul></ul></ul><ul><ul><li>Stability : drift in user opinion </li></ul></ul><ul><ul><ul><li>s 12 =r 13 /r 23 ; s 23 =r 13 /r 12 ; s 13 =r 13 ²/r 12 r 23 </li></ul></ul></ul><ul><li>Users rated movies in 3 trials </li></ul><ul><ul><li>Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3 </li></ul></ul>
  9. 9. Experimental Setup (II) <ul><li>100 Movies selected from Netflix dataset doing a stratified random sampling on popularity </li></ul><ul><li>Ratings on a 1 to 5 star scale </li></ul><ul><ul><li>Special “not seen” symbol. </li></ul></ul><ul><li>Trial 1 and 3 = random order; trial 2 = ordered by popularity </li></ul><ul><li>118 participants </li></ul>
  10. 10. Results
  11. 11. Comparison to Netflix Data <ul><li>Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate) </li></ul>
  12. 12. Test-retest Stability and Reliability <ul><li>Overall reliability = 0.924 (good reliabilities are expected to be > 0.9) </li></ul><ul><ul><li>Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower </li></ul></ul><ul><li>Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951 </li></ul><ul><ul><li>Stabilities might also be accounting for “learning effect” (note s12<s23) </li></ul></ul>
  13. 13. Analysis of User Inconsistencies <ul><li>Effect of the Not-Seen. Given a pair of consecutive trials: </li></ul><ul><ul><li>More than 10% of items rated in a trial are then not rated in the following one </li></ul></ul><ul><ul><li>More than 20% of items only rated in one </li></ul></ul><ul><li>RMSE due to Inconsistencies </li></ul><ul><ul><li>Higher between R1 and R3 (same order, longer time) </li></ul></ul><ul><ul><li>Lower between R2 and R3 (removed “learning” effects?) </li></ul></ul>
  14. 14. Impacting Variables (I) <ul><li>Rating Scale Effect </li></ul><ul><ul><li>Extreme ratings are more consistent </li></ul></ul><ul><ul><li>2 and 3 are the least consistent </li></ul></ul><ul><ul><li>34% of inconsistencies are between 2 and 3 and 25% between 3 and 4 </li></ul></ul><ul><ul><li>90% of inconsistencies are + 1 </li></ul></ul>
  15. 15. Impacting Variables (II) <ul><li>Item Order Effect </li></ul><ul><ul><li>R1is the trial with most inconsistencies </li></ul></ul><ul><ul><li>R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination) </li></ul></ul><ul><ul><li>R2 minimizes inconsistencies because of order (reducing “contrast effect”). </li></ul></ul>
  16. 16. Impacting Variables (and III) <ul><li>User Rating Speed Effect </li></ul><ul><ul><li>Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning) </li></ul></ul><ul><ul><li>In R2 evaluation time starts decreasing until users find segment of “popular” movies </li></ul></ul><ul><ul><li>Rating speed is not correlated with inconsistencies </li></ul></ul>
  17. 17. Long-term Errors and Stability <ul><li>New trial 7 months later with a subset of the users (36 out of the 118 in the original set) </li></ul><ul><ul><li>R1 <–> 15 days <-> R3 <–> 7 months <–> R4 : All same random order </li></ul></ul><ul><li>New Reliability (significantly lower): r = 0.8763 (less than 0.9) </li></ul><ul><li>New Stabilities (still high): s12 = 1.0025, s34 = 0.9706, and s14 = 0.9730 </li></ul><ul><li>RMSE (much higher): </li></ul><ul><ul><li>R13 = 0.6143, R14 = 0.6822, and R34 = 0.6835 for the intersection, and R13 = 0.7445, R14 = 0.8156 , R34 = 0.8014 for the union </li></ul></ul>
  18. 18. Conclusions <ul><li>Recommender Systems (and related Collaborative Filtering applications) are becoming extremely popular </li></ul><ul><ul><li>Large research investments in coming up with better algorithms </li></ul></ul><ul><ul><li>However, understanding user feedback is many times much more important for the end result </li></ul></ul><ul><li>To lower the Magic Barrier, RS should find ways of obtaining better and less noisy feedback from users, and model user response in the algorithm. </li></ul>
  19. 19. I like it... I like it not <ul><ul><li>Thanks! </li></ul></ul><ul><ul><li>Questions? </li></ul></ul>