Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User-driven Approaches to Recsys


Published on

Presentation summarizing some of our latest research in Recommender Systems

Published in: Technology, Education
  • Be the first to comment

User-driven Approaches to Recsys

  1. 1. It's all about the User... User-driven Approaches to the Recommendation Problem Xavier Amatriain Telefonica Research    
  2. 2. But first...    
  3. 3. About me Up until 2005    
  4. 4. About me 2005 ­ 2007    
  5. 5. About me 2007 ­ ..    
  6. 6. But first... About Telefonica and Telefonica R&D    
  7. 7. Telefonica is a fast-growing Telecom 1989 2000 2008 Clients About 12 About 68 About 260 million million million subscribers customers customers Services Basic Wireline and mobile Integrated ICT telephone and voice, data and solutions for all data services Internet services customers Geographies Operations in Operations in Spain 25 countries 16 countries Staff About 71,000 About 149,000 About 257,000 professionals professionals professionals Finances Rev: 4,273 M€ Rev: 28,485 M€ Rev: 57,946 M€ EPS(1): 0.45 € EPS(1): 0.67 € EPS: 1.63 €   (1) EPS: Earnings per share  
  8. 8. Currently among the largest in the world Telco sector worldwide ranking by market cap (US$ bn) Source: Bloomberg, 06/12/09    
  9. 9. Telefonica R&D (TID) is the Research and Development Unit of the Telefónica Group MISSION “To contribute to the improvement of the n Founded in 1988 Telefónica Group’s n Largest private R&D center in Spain competitivness through n More than 1100 professionals technological innovation” n Five centers in Spain and two in Latin America Telefónica was in 2008 the first Spanish company by R&D Investment and the third in the EU Applied R&D research 61 M€ R&D Products / Services / Processes 594 M€ development 4.384 M€ Technological Innovation (1)    
  10. 10. Internet Scientific Areas Content Distribution and P2P Wireless and Mobile Systems Social Networks Next generation Managed Wireless bundling Information Propagation P2P-TV Device2Device Content Social Search Engines Future Internet: Content Distribution Networking Infrastructure for Social Large Scale mobile data based cloud computing Delay Tolerant Bulk analysis Distribution Network Transparency    
  11. 11. Multimedia Scientific Areas Multimedia Mobile and Ubicomp HCC Core Multimedia Data Context Multimodal User Analysis, Search Awareness Interfaces & Retrieval Urban Computing Expression, Gesture, Video, Audio, Emotion Recognition Image, Music, Mobile Multimedia Text, Sensor Data & Search Personalization & Recommendation Understanding, Wearable Systems Summarization, Physiological Visualization Monitoring Super Telepresence    
  12. 12. Data Mining & User Modeling Areas SOCIAL NETWORK ANALYSYS & BUSINESS INT. - Analytical CRM - Trend-spotting, service propagation & churn - Social Graph Analysis (construction, dynamics) USER MODELING - Application to new services (technology for development) - Cognitive, socio-cultural, and contextual modeling - Behavioral user modeling (service-use patterns) DATA MINING Integration of statistical & knowledge-based techniques - - Stream mining   Large scale & distributed machine learning -  
  13. 13. Index Now seriously, this is where the index should go!    
  14. 14. Introduction: What are Recommender Systems?    
  15. 15. The Age of Search has come to an end ... long live the Age of Recommendation! Chris Anderson in “The Long Tail” “We are leaving the age of information and entering the age of recommendation” CNN Money, “The race to create a 'smart' Google”: “The Web, they say, is leaving the era of search and entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.”    
  16. 16. Information overload “People read around 10 MB worth of  material a day, hear 400 MB a day, and  see one MB of information every second” The Economist, November 2006    
  17. 17. The value of recommendations Netflix: 2/3 of the movies rented are recommended Google News: recommendations generate 38% more clickthrough Amazon: 35% sales from recommendations Choicestream: 28% of the people would buy more music if they found what they liked. u    
  18. 18. The “Recommender problem” Estimate a utility function that is able to automatically predict how much a user will like an item that is unknown for her. Based on: Past behavior Relations to other users Item similarity Context ...    
  19. 19. The “Recommender problem” Let C be a large set of all users and let S be a large set of all possible items that can be recommended (e.g books, movies, or restaurants). Let u be a utility function that measures the usefulness of item s to user c, i.e., u : C X S→R, where R is a totally ordered set. Then, for each user c є C, we want to choose such item s’ є S that maximizes u. Utility of an item is usually represented by rating but can also can be an arbitrary function, including a profit function.    
  20. 20. Approaches to Recommendation Collaborative Filtering Recommend items based only on the users past behavior User-based Find similar users to me and recommend what they liked Item-based Find similar items to those that I have previously liked Content-based Recommend based on features inherent to the items Social recommendations (trust-based)    
  21. 21. Recommendation Techniques    
  22. 22. The Netflix Prize  500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE)  49K contestants on 40K teams from 184 countries.  41K valid submissions from 5K teams; 64 submissions per day  Wining approach uses hundreds of predictors from several teams  Is this general?      Why did it take so long?
  23. 23. What works It depends on the domain and particular problem However, in the general case it has been demonstrated that (currently) the best isolated approach is CF. Item-based in general more efficient and better but mixing CF approaches can improve result Other approaches can be hybridized to improve results in specific cases (cold-start problem...) What matters: Data preprocessing: outlier removal, denoising, removal of global effects (e.g. individual user's average) “Smart” dimensionality reduction using MF such as SVD Combining classifiers    
  24. 24. I like it... I like it not Evaluating User Ratings Noise in Recommender Systems Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver Telefonica Research    
  25. 25. The Recommender Problem  Two ways to address it 1. Improve the Algorithm    
  26. 26. The Recommender Problem  Two ways to address it 2. Improve the Input Data Time for Data  Cleaning!    
  27. 27. User Feedback is Noisy    
  28. 28. Natural Noise Limits our User Model DID YOU HEAR WHAT  I LIKE??!! ...and Our Prediction Accuracy    
  29. 29. The Magic Barrier  Magic Barrier = Limit on prediction accuracy due to noise in original data  Natural Noise = involuntary noise introduced by users when giving feedback  Due to (a) mistakes, and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items).  Magic Barrier >= Natural Noise Threshold  We cannot predict with less error than the resolution in the original data    
  30. 30. Our related research questions  Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure?  Q2. How large is the prediction error due to these inconsistencies?  Q3. What factors affect user inconsistencies?    
  31. 31. Experimental Setup (I)  Test-retest procedure: you need at least 3 trials to separate  Reliability: how much you can trust the instrument you are using (i.e. ratings)  r = r12 r23 /r13  Stability: drift in user opinion  s12 =r13 /r23 ; s23 =r13 /r12 ; s13 =r13 ²/r12 r23  Users rated movies in 3 trials  Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3    
  32. 32. Experimental Setup (II)  100 Movies selected from Netflix dataset doing a stratified random sampling on popularity  Ratings on a 1 to 5 star scale  Special “not seen” symbol.  Trial 1 and 3 = random order; trial 2 = ordered by popularity  118 participants    
  33. 33. Results    
  34. 34. Comparison to Netflix Data  Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate)    
  35. 35. Test-retest Reliability and Stability  Overall reliability = 0.924 (good reliabilities are expected to be > 0.9)  Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower  Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951  Stabilities might also be accounting for “learning effect” (note s12<s23)    
  36. 36. Users are Inconsistent ● What is the probability of making an inconsistency  given an original rating    
  37. 37. Users are Inconsistent Mild ratings are  noisier ● What is the percentage of inconsistencies given an  original rating    
  38. 38. Users are Inconsistent Negative  ratings are  noisier ● What is the percentage of inconsistencies given an  original rating    
  39. 39. Prediction Accuracy #Ti #Tj # RMSE     T1, T2 2185 1961 1838 2308 0.573 0.707 T1, T3 2185 1909 1774 2320 0.637 0.765 T2, T3 1969 1909 1730 2140 0.557 0.694 ● Pairwise RMSE between trials considering  intersection and union of both sets    
  40. 40. Prediction Accuracy Max error in  trials that are  #Ti #Tj # RMSE most distant in  time     T1, T2 2185 1961 1838 2308 0.573 0.707 T1, T3 2185 1909 1774 2320 0.637 0.765 T2, T3 1969 1909 1730 2140 0.557 0.694 ● Pairwise RMSE between trials considering  intersection and union of both sets    
  41. 41. Prediction Accuracy Significant less  error when 2nd   #Ti #Tj # RMSE trial is involved     T1, T2 2185 1961 1838 2308 0.573 0.707 T1, T3 2185 1909 1774 2320 0.637 0.765 T2, T3 1969 1909 1730 2140 0.557 0.694 ● Pairwise RMSE between trials considering  intersection and union of both sets    
  42. 42. Algorithm Robustness to NN Alg./Trial T1 T2 T3 Tworst /Tbest User  1.2011 1.1469 1.1945 4.7% Average Item  1.0555 1.0361 1.0776 4% Average User­based  0.9990 0.9640 1.0171 5.5% kNN Item­based  1.0429 1.0031 1.0417 4% kNN SVD 1.0244 0.9861 1.0285 4.3% ● RMSE for different Recommendation algorithms  when predicting each of the trials    
  43. 43. Algorithm Robustness to NN Trial 2 is  consistently the  Alg./Trial T1 T2 T3 Tworst /Tbest least noisy User  1.2011 1.1469 1.1945 4.7% Average Item  1.0555 1.0361 1.0776 4% Average User­based  0.9990 0.9640 1.0171 5.5% kNN Item­based  1.0429 1.0031 1.0417 4% kNN SVD 1.0244 0.9861 1.0285 4.3% ● RMSE for different Recommendation algorithms  when predicting each of the trials    
  44. 44. Algorithm Robustness to NN (2) Training­Testing  T1-T2 T1-T3 T2-T3 Dataset User Average 1.1585 1.2095 1.2036 Movie Average 1.0305 1.0648 1.0637 User­based kNN 0.9693 1.0143 1.0184 Item­based kNN 1.0009 1.0406 1.0590 SVD 0.9741 1.0491 1.0118 ● RMSE for different Recommendation algorithms  when predicting ratings in one trial (testing) from  ratings on another (training)    
  45. 45. Algorithm Robustness to NN (2) Training­Testing  T1-T2 T1-T3 T2-T3 Dataset User Average 1.1585 1.2095 1.2036 Movie Average 1.0305 1.0648 1.0637 User­based kNN 0.9693 1.0143 1.0184 Item­based kNN 1.0009 1.0406 1.0590 SVD Noise is minimized  0.9741 1.0491 1.0118 when we predict  Trial 2 ● RMSE for different Recommendation algorithms  when predicting ratings in one trial (testing) from  ratings on another (training)    
  46. 46. Let's recap  Users are inconsistent  Inconsistencies can depend on many things including how the items are presented  Inconsistencies produce natural noise  Natural noise reduces our prediction accuracy independently of the algorithm    
  47. 47. Item order effect  R1 is the trial with most inconsistencies  R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination)  R2 minimizes inconsistencies because of order (reducing “contrast effect”).    
  48. 48. User Rating Speed Effect  Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning)  In R2 evaluation time starts decreasing until users find segment of “popular” movies  Rating speed is not correlated with inconsistencies    
  49. 49. So... What can we do?    
  50. 50. Different proposals  In order to deal with noise in user feedback we have so far proposed 3 different approaches: 1. Denoise user feedback by using a re-rating approach (Recsys09) 2. Instead of regular users, take feedback from experts, which we expect to be less noisy (SIGIR09) 3. Combine ensembles of datasets to identify which works better for each user (IJCAI09)    
  51. 51. Rate it Again Rate it Again Increasing Recommendation Accuracy by User re-Rating Xavier Amatriain (with J.M. Pujol, N. Tintarev, N. Oliver) Telefonica Research    
  52. 52. Rate it again  By asking users to rate items again we can remove noise in the dataset  Improvements of up to 14% in accuracy!  Because we don't want all users to re-rate all items we design ways to do partial denoising  Data-dependent: only denoise extreme ratings  User-dependent: detect “noisy” users    
  53. 53. Algorithm  Given a rating dataset where (some) items have been re-rated,  Two fairness conditions: 1. Algorithm should remove as few ratings as possible (i.e. only when there is some certainty that the rating is only adding noise) 2.Algorithm should not make up new ratings but decide on which of the existing ones are valid.    
  54. 54. Algorithm  One source re-rating case:  Given the following milding function:    
  55. 55. Results  One-source re-rating (Denoised⊚Denoising) T1⊚T2 ΔT1 T1⊚T3 ΔT1 T2⊚T3 ΔT2 User­based kNN 0.8861 11.3% 0.8960 10.3% 0.8984 6.8% SVD 0.9121 11.0% 0.9274 9.5% 0.9159 7.1%  Two-source re-rating (Denoising T1with the other 2) Datasets T1⊚(T2, T3) ΔT1 User­based kNN 0.8647 13.4% SVD 0.8800 14.1%    
  56. 56. Denoise outliers ●  Improvement in RMSE when doing one­source as a function of  the percentage of denoised ratings and users: selecting only noisy    users and extreme ratings  
  57. 57. The Wisdom of the Few A Collaborative Filtering Approach Based on Expert Opinions from the Web Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver Telefonica Research (Barcelona) Neal Lathia UCL (London)    
  58. 58. Crowds are not always wise  Collaborative filtering is the preferred approach for Recommender Systems  Recommendations are drawn from your past behavior and that of similar users in the system  Standard CF approach:  Find your Neighbors from the set of other users  Recommend things that your Neighbors liked and you have not “seen”  Problem: predictions are based on a large dataset that is sparse and noisy    
  59. 59. Overview of the Approach  expert = individual that we can trust to have produced thoughtful, consistent and reliable evaluations (ratings) of items in a given domain  Expert-based Collaborative Filtering  Find neighbors from a reduced set of experts instead of regular users. 1. Identify domain experts with reliable ratings 2. For each user, compute “expert neighbors” 3. Compute recommendations similar to standard kNN CF    
  60. 60. Advantages of the Approach  Noise  Cold Start problem  Experts introduce less  Experts rate items as natural noise soon as they are  Malicious Ratings available  Dataset can be monitored  Scalability to avoid shilling  Dataset is several order of  Data Sparsity magnitudes smaller  Reduced set of domain  Privacy experts can be motivated  Recommendations can be to rate items computed locally    
  61. 61. Mining the Web for Expert Ratings  Collections of expert ratings can be obtained almost directly on the web: we crawled the Rotten Tomatoes movie critics mash-up  Only those (169) with more than 250 ratings in the Neflix dataset were used    
  62. 62. Dataset Analysis. Summary  Experts...  are much less sparse  rate movies all over the rating scale instead of being biased towards rating only “good” movies (different incentives).  but, they seem to consistently agree on the good movies.  have a lower overall standard deviation per movie: they tend to agree more than regular users.  tend to deviate less from their personal average rating.    
  63. 63. Evaluation Procedure  Use the 169 experts to predict ratings from 10.000 users sampled from the Netflix dataset  Prediction MAE using a 80-20 holdout procedure (5-fold cross-validation)  Top-N precision by classifying items as being “recommendable” given a threshold  Results show Expert CF to behave similar to standard CF  But... we have a user study backing up the approach    
  64. 64. User Study  57 participants, only 14.5 ratings/participant  50% of the users consider Expert-based CF to be good or very good  Expert-based CF: only algorithm with an average rating over 3 (on a 0-4 scale)    
  65. 65. Current Work  Music recommendations (using, mobile geo-located recommendations...    
  66. 66. Adaptive Data Sources Collaborative Filtering With Adaptive Information Sources (ITWP @ IJCAI) With Neal Lathia UCL (London)    
  67. 67. Adaptive data sources like- minded? similarity friends? trust user modeling experts? reputation    
  68. 68. Adaptive Data sources  Given  a simple, un-tuned, kNN predictor and multiple information sources  A problem  users are subjective, accuracy varies with source  A promise  optimal classification of users to best source produces incredibly accurate predictions    
  69. 69. Conclusions    
  70. 70. Conclusions  For many applications such as Recommender Systems (but also Search, Advertising, and even Networks) understanding data and users is vital  Algorithms can only be as good as the data they use as input  Importance of User/Data Mining is going to be a growing trend in many areas in the coming years    
  71. 71. Thanks! Questions? Xavier Amatriain