I like it... I like it Not

1,700 views

Published on

Presenting our work analyzing natural noise in user ratings for recommender systems. This presentation was done in the UMAP 2009 conference in Trento, Italy

I like it... I like it Not

  1. 1. I like it... I like it not <ul><ul><li>Evaluating User Ratings Noise in </li></ul></ul><ul><ul><li>Recommender Systems </li></ul></ul><ul><ul><li>Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver </li></ul></ul><ul><ul><li>Telefonica Research </li></ul></ul>
  2. 2. Recommender Systems are everywhere <ul><li>Netflix: 2/3 of the movies rented were recommended </li></ul><ul><li>Google News: recommendations generate 38% more clickthrough </li></ul><ul><li>Amazon: 35% sales from recommendations </li></ul><ul><li>“We are leaving the age of Information and entering the Age of Recommendation” - The Long Tail (Chris Anderson) </li></ul>
  3. 3. The Netflix Prize <ul><li>500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE) </li></ul><ul><ul><li>This is what Netflix thinks a 10% improvement is worth for their business </li></ul></ul><ul><ul><li>49K contestants on 40K teams from 184 countries. </li></ul></ul><ul><ul><li>41K valid submissions from 5K teams; 64 submissions in the “last 24 hours” </li></ul></ul>
  4. 4. But, is there a limit to RS accuracy? <ul><li>Evolution of accuracy in Netflix Prize </li></ul>
  5. 5. The Magic Barrier <ul><li>Magic Barrier = Limit on prediction accuracy due to noise in original data </li></ul><ul><li>Natural Noise = involuntary noise introduced by users when giving feedback </li></ul><ul><ul><li>Due to (a) mistakes , and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items) . </li></ul></ul><ul><li>Magic Barrier >= Natural Noise Threshold </li></ul><ul><ul><li>We cannot predict with less error than the resolution in the original data </li></ul></ul>
  6. 6. The Question in the Wind
  7. 7. Our related research questions <ul><li>Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure? </li></ul><ul><li>Q2. How large is the prediction error due to these inconsistencies? </li></ul><ul><li>Q3. What factors affect user inconsistencies? </li></ul>
  8. 8. Experimental Setup (I) <ul><li>Test-retest procedure: you need at least 3 trials to separate </li></ul><ul><ul><li>Reliability : how much you can trust the instrument you are using (i.e. ratings) </li></ul></ul><ul><ul><ul><li>r = r 12 r 23 /r 13 </li></ul></ul></ul><ul><ul><li>Stability : drift in user opinion </li></ul></ul><ul><ul><ul><li>s 12 =r 13 /r 23 ; s 23 =r 13 /r 12 ; s 13 =r 13 ²/r 12 r 23 </li></ul></ul></ul><ul><li>Users rated movies in 3 trials </li></ul><ul><ul><li>Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3 </li></ul></ul>
  9. 9. Experimental Setup (II) <ul><li>100 Movies selected from Netflix dataset doing a stratified random sampling on popularity </li></ul><ul><li>Ratings on a 1 to 5 star scale </li></ul><ul><ul><li>Special “not seen” symbol. </li></ul></ul><ul><li>Trial 1 and 3 = random order; trial 2 = ordered by popularity </li></ul><ul><li>118 participants </li></ul>
  10. 10. Results
  11. 11. Comparison to Netflix Data <ul><li>Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate) </li></ul>
  12. 12. Test-retest Stability and Reliability <ul><li>Overall reliability = 0.924 (good reliabilities are expected to be > 0.9) </li></ul><ul><ul><li>Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower </li></ul></ul><ul><li>Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951 </li></ul><ul><ul><li>Stabilities might also be accounting for “learning effect” (note s12<s23) </li></ul></ul>
  13. 13. Analysis of User Inconsistencies <ul><li>Effect of the Not-Seen. Given a pair of consecutive trials: </li></ul><ul><ul><li>More than 10% of items rated in a trial are then not rated in the following one </li></ul></ul><ul><ul><li>More than 20% of items only rated in one </li></ul></ul><ul><li>RMSE due to Inconsistencies </li></ul><ul><ul><li>Higher between R1 and R3 (same order, longer time) </li></ul></ul><ul><ul><li>Lower between R2 and R3 (removed “learning” effects?) </li></ul></ul>
  14. 14. Impacting Variables (I) <ul><li>Rating Scale Effect </li></ul><ul><ul><li>Extreme ratings are more consistent </li></ul></ul><ul><ul><li>2 and 3 are the least consistent </li></ul></ul><ul><ul><li>34% of inconsistencies are between 2 and 3 and 25% between 3 and 4 </li></ul></ul><ul><ul><li>90% of inconsistencies are + 1 </li></ul></ul>
  15. 15. Impacting Variables (II) <ul><li>Item Order Effect </li></ul><ul><ul><li>R1is the trial with most inconsistencies </li></ul></ul><ul><ul><li>R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination) </li></ul></ul><ul><ul><li>R2 minimizes inconsistencies because of order (reducing “contrast effect”). </li></ul></ul>
  16. 16. Impacting Variables (and III) <ul><li>User Rating Speed Effect </li></ul><ul><ul><li>Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning) </li></ul></ul><ul><ul><li>In R2 evaluation time starts decreasing until users find segment of “popular” movies </li></ul></ul><ul><ul><li>Rating speed is not correlated with inconsistencies </li></ul></ul>
  17. 17. Long-term Errors and Stability <ul><li>New trial 7 months later with a subset of the users (36 out of the 118 in the original set) </li></ul><ul><ul><li>R1 <–> 15 days <-> R3 <–> 7 months <–> R4 : All same random order </li></ul></ul><ul><li>New Reliability (significantly lower): r = 0.8763 (less than 0.9) </li></ul><ul><li>New Stabilities (still high): s12 = 1.0025, s34 = 0.9706, and s14 = 0.9730 </li></ul><ul><li>RMSE (much higher): </li></ul><ul><ul><li>R13 = 0.6143, R14 = 0.6822, and R34 = 0.6835 for the intersection, and R13 = 0.7445, R14 = 0.8156 , R34 = 0.8014 for the union </li></ul></ul>
  18. 18. Conclusions <ul><li>Recommender Systems (and related Collaborative Filtering applications) are becoming extremely popular </li></ul><ul><ul><li>Large research investments in coming up with better algorithms </li></ul></ul><ul><ul><li>However, understanding user feedback is many times much more important for the end result </li></ul></ul><ul><li>To lower the Magic Barrier, RS should find ways of obtaining better and less noisy feedback from users, and model user response in the algorithm. </li></ul>
  19. 19. I like it... I like it not <ul><ul><li>Thanks! </li></ul></ul><ul><ul><li>Questions? </li></ul></ul>

×