Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

It's all About the Data


Published on

Presentation about Data-driven approaches to Recommender Systems. This was an invited talk at Boston University (July 2009)

Published in: Technology
  • Telefónica quiere seguir creciendo en América Latina, así lo expresó Alierta, quien espera alcanzar para 2012 alrededor de 210 millones de conexiones en este continente
    Are you sure you want to  Yes  No
    Your message goes here

It's all About the Data

  1. 1. It's all about the Data... <ul><ul><li>Data-driven Approaches to the </li></ul></ul><ul><ul><li>Recommendation Problem </li></ul></ul><ul><ul><li>Xavier Amatriain </li></ul></ul><ul><ul><li>Telefonica Research </li></ul></ul>
  2. 2. But first... About Telefonica and Telefonica R&D
  3. 3. About 71,000 professionals About 257,000 professionals Staff Services Finances Rev: 4,273 M€ EPS(1): 0.45 € Integrated ICT solutions for all customers Clients About 12 million subscribers About 260 million customers Basic telephone and data services 1989 Spain Operations in 25 countries Geographies Rev: 57,946 M€ EPS: 1.63 € 2000 2008 About 149,000 professionals About 68 million customers Wireline and mobile voice, data and Internet services (1) EPS: Earnings per share Rev: 28,485 M€ EPS(1): 0.67 € Operations in 16 countries Telefonica is a fast-growing Telecom
  4. 4. Telco sector worldwide ranking by market cap (US$ bn) Currently among the largest in the world Source: Bloomberg, 06/12/09
  5. 5. Argentina: 20.9 million Brazil: 61.4 million Central America: 6.1 million Colombia: 12.6 million Chile: 10.1 million Ecuador: 3.3 million Mexico: 15.7 million Peru: 15.2 million Uruguay: 1.5 million Venezuela: 12.0 million <ul><li>2 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>2 </li></ul><ul><li>2 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>2 </li></ul><ul><li>2 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>2 </li></ul><ul><li>2 </li></ul>Notes: - Central America includes Guatemala, Panama, El Salvador and Nicaragua - Total accesses figure includes Narrowband Internet accesses of Terra Brasil and Terra Colombia, and Broadband Internet accesses of Terra Brasil, Telefónica de Argentina, Terra Guatemala and Terra México. Data as of March ‘09 Total Accesses (as of March ‘09) 159.5 million Leader in South America Wireline market rank Mobile market rank
  6. 6. Spain: 47.2 million UK: 20.8 million Germany : 16.0 million Ireland : 1.7 million Czech Republic : 7.7 million Slovakia : 0.4 million <ul><li>1 </li></ul><ul><li>4 </li></ul><ul><li>2 </li></ul>Wireline market rank Mobile market rank <ul><li>3 </li></ul>Data as of March ‘09 And a significant footprint in Europe Total Accesses (as of March ’09) 93.8 million <ul><li>2 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul><ul><li>1 </li></ul>
  7. 7. Telefonica R&D (TID) is the Research and Development Unit of the Telefónica Group MISSION “ To contribute to the improvement of the Telefónica Group’s competitivness through technological innovation” <ul><li>Founded in 1988 </li></ul><ul><li>Largest private R&D center in Spain </li></ul><ul><li>More than 1100 professionals </li></ul><ul><li>Five centers in Spain and two in Latin America </li></ul>Telefónica was in 2008 the first Spanish company by R&D Investment and the third in the EU Products / Services / Processes development Technological Innovation (1) R&D 594 M€ 4.384 M€ Applied research R&D 61 M€
  8. 8. TID Scientific Groups: Publications, Patents, TechTransfer TI+D Scientific Groups Pablo Rodriguez Internet Scientific Director Nuria Oliver Multimedia Scientific Director Data Mining and User Modeling Acting Scientific Director
  9. 9. Internet Scientific Areas Content Distribution and P2P Next generation Managed P2P-TV Future Internet: Content Networking Delay Tolerant Bulk Distribution Network Transparency Social Networks Information Propagation Social Search Engines Infrastructure for Social based cloud computing Wireless and Mobile Systems Wireless bundling Device2Device Content Distribution Large Scale mobile data analysis
  10. 10. Multimedia Scientific Areas Multimedia Core Multimedia Data Analysis, Search & Retrieval Video, Audio, Image, Music, Text, Sensor Data Understanding, Summarization, Visualization Mobile and Ubicomp Context Awareness Urban Computing Mobile Multimedia & Search Wearable Physiological Monitoring HCC Multimodal User Interfaces Expression, Gesture, Emotion Recognition Personalization & Recommendation Systems Super Telepresence
  11. 11. Data Mining & User Modeling Areas DATA MINING <ul><li>Integration of statistical & knowledge-based techniques </li></ul><ul><li>- Stream mining </li></ul><ul><li>Large scale & distributed machine learning </li></ul>USER MODELING <ul><li>Application to new services (technology for development) </li></ul><ul><li>Cognitive, socio-cultural, and contextual modeling </li></ul><ul><li>Behavioral user modeling (service-use patterns) </li></ul>SOCIAL NETWORK ANALYSYS & BUSINESS INT. <ul><li>Analytical CRM </li></ul><ul><li>Trend-spotting, service propagation & churn </li></ul><ul><li>Social Graph Analysis (construction, dynamics) </li></ul>
  12. 12. I like it... I like it not <ul><ul><li>Evaluating User Ratings Noise in </li></ul></ul><ul><ul><li>Recommender Systems </li></ul></ul><ul><ul><li>Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver </li></ul></ul><ul><ul><li>Telefonica Research </li></ul></ul>
  13. 13. Recommender Systems are everywhere <ul><li>Netflix: 2/3 of the movies rented were recommended </li></ul><ul><li>Google News: recommendations generate 38% more clickthrough </li></ul><ul><li>Amazon: 35% sales from recommendations </li></ul><ul><li>“We are leaving the age of Information and entering the Age of Recommendation” - The Long Tail (Chris Anderson) </li></ul>
  14. 14. The Netflix Prize <ul><li>500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE) </li></ul><ul><ul><li>This is what Netflix thinks a 10% improvement is worth for their business </li></ul></ul><ul><ul><li>49K contestants on 40K teams from 184 countries. </li></ul></ul><ul><ul><li>41K valid submissions from 5K teams; 64 submissions in the “last 24 hours” </li></ul></ul>
  15. 15. But, is there a limit to RS accuracy? <ul><li>Evolution of accuracy in Netflix Prize </li></ul>
  16. 16. The Magic Barrier <ul><li>Magic Barrier = Limit on prediction accuracy due to noise in original data </li></ul><ul><li>Natural Noise = involuntary noise introduced by users when giving feedback </li></ul><ul><ul><li>Due to (a) mistakes , and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items) . </li></ul></ul><ul><li>Magic Barrier >= Natural Noise Threshold </li></ul><ul><ul><li>We cannot predict with less error than the resolution in the original data </li></ul></ul>
  17. 17. The Question in the Wind
  18. 18. Our related research questions <ul><li>Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure? </li></ul><ul><li>Q2. How large is the prediction error due to these inconsistencies? </li></ul><ul><li>Q3. What factors affect user inconsistencies? </li></ul>
  19. 19. Experimental Setup (I) <ul><li>Test-retest procedure: you need at least 3 trials to separate </li></ul><ul><ul><li>Reliability : how much you can trust the instrument you are using (i.e. ratings) </li></ul></ul><ul><ul><ul><li>r = r 12 r 23 /r 13 </li></ul></ul></ul><ul><ul><li>Stability : drift in user opinion </li></ul></ul><ul><ul><ul><li>s 12 =r 13 /r 23 ; s 23 =r 13 /r 12 ; s 13 =r 13 ²/r 12 r 23 </li></ul></ul></ul><ul><li>Users rated movies in 3 trials </li></ul><ul><ul><li>Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3 </li></ul></ul>
  20. 20. Experimental Setup (II) <ul><li>100 Movies selected from Netflix dataset doing a stratified random sampling on popularity </li></ul><ul><li>Ratings on a 1 to 5 star scale </li></ul><ul><ul><li>Special “not seen” symbol. </li></ul></ul><ul><li>Trial 1 and 3 = random order; trial 2 = ordered by popularity </li></ul><ul><li>118 participants </li></ul>
  21. 21. Results
  22. 22. Comparison to Netflix Data <ul><li>Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate) </li></ul>
  23. 23. Test-retest Stability and Reliability <ul><li>Overall reliability = 0.924 (good reliabilities are expected to be > 0.9) </li></ul><ul><ul><li>Removing mild ratings yields higher reliabilities, while removing extreme ratings yields lower </li></ul></ul><ul><li>Stabilities: s12 = 0.973, s23 = 0.977, and s13 = 0.951 </li></ul><ul><ul><li>Stabilities might also be accounting for “learning effect” (note s12<s23) </li></ul></ul>
  24. 24. Analysis of User Inconsistencies <ul><li>Effect of the Not-Seen. Given a pair of consecutive trials: </li></ul><ul><ul><li>More than 10% of items rated in a trial are then not rated in the following one </li></ul></ul><ul><ul><li>More than 20% of items only rated in one </li></ul></ul><ul><li>RMSE due to Inconsistencies </li></ul><ul><ul><li>Higher between R1 and R3 (same order, longer time) </li></ul></ul><ul><ul><li>Lower between R2 and R3 (removed “learning” effects?) </li></ul></ul>
  25. 25. Impacting Variables (I) <ul><li>Rating Scale Effect </li></ul><ul><ul><li>Extreme ratings are more consistent </li></ul></ul><ul><ul><li>2 and 3 are the least consistent </li></ul></ul><ul><ul><li>34% of inconsistencies are between 2 and 3 and 25% between 3 and 4 </li></ul></ul><ul><ul><li>90% of inconsistencies are + 1 </li></ul></ul>
  26. 26. Impacting Variables (II) <ul><li>Item Order Effect </li></ul><ul><ul><li>R1is the trial with most inconsistencies </li></ul></ul><ul><ul><li>R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination) </li></ul></ul><ul><ul><li>R2 minimizes inconsistencies because of order (reducing “contrast effect”). </li></ul></ul>
  27. 27. Impacting Variables (and III) <ul><li>User Rating Speed Effect </li></ul><ul><ul><li>Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning) </li></ul></ul><ul><ul><li>In R2 evaluation time starts decreasing until users find segment of “popular” movies </li></ul></ul><ul><ul><li>Rating speed is not correlated with inconsistencies </li></ul></ul>
  28. 28. Long-term Errors and Stability <ul><li>New trial 7 months later with a subset of the users (36 out of the 118 in the original set) </li></ul><ul><ul><li>R1 <–> 15 days <-> R3 <–> 7 months <–> R4 : All same random order </li></ul></ul><ul><li>New Reliability (significantly lower): r = 0.8763 (less than 0.9) </li></ul><ul><li>New Stabilities (still high): s12 = 1.0025, s34 = 0.9706, and s14 = 0.9730 </li></ul><ul><li>RMSE (much higher): </li></ul><ul><ul><li>R13 = 0.6143, R14 = 0.6822, and R34 = 0.6835 for the intersection, and R13 = 0.7445, R14 = 0.8156 , R34 = 0.8014 for the union </li></ul></ul>
  29. 29. Conclusions <ul><li>Recommender Systems (and related Collaborative Filtering applications) are becoming extremely popular </li></ul><ul><ul><li>Large research investments in coming up with better algorithms </li></ul></ul><ul><ul><li>However, understanding user feedback is many times much more important for the end result </li></ul></ul><ul><li>To lower the Magic Barrier, RS should find ways of obtaining better and less noisy feedback from users, and model user response in the algorithm. </li></ul>
  30. 30. So... What can we do?
  31. 31. Different proposals <ul><li>In order to deal with noise in user feedback we have so far proposed 3 different approaches: </li></ul><ul><ul><li>Instead of regular users, take feedback from experts, which we expect to be less noisy (SIGIR09) </li></ul></ul><ul><ul><li>Combine ensembles of datasets to identify which works better for each user (IJCAI09) </li></ul></ul><ul><ul><li>Denoise user feedback by using a re-rating approach (Recsys09) </li></ul></ul><ul><li>I will detail (1) and briefly mention the other 2. </li></ul>
  32. 32. The Wisdom of the Few <ul><ul><li>A Collaborative Filtering Approach Based on Expert Opinions from the Web </li></ul></ul><ul><ul><li>Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver </li></ul></ul><ul><ul><li>Telefonica Research (Barcelona) </li></ul></ul><ul><ul><li>Neal Lathia </li></ul></ul><ul><ul><li>UCL (London) </li></ul></ul>
  33. 33. First, a little quiz <ul><li>Name that Book.... </li></ul>“ It is really only experts who can reliably account for their reactions”
  34. 34. Crowds are not always wise <ul><li>Collaborative filtering is the preferred approach for Recommender Systems </li></ul><ul><ul><li>Recommendations are drawn from your past behavior and that of similar users in the system </li></ul></ul><ul><ul><li>Standard CF approach: </li></ul></ul><ul><ul><ul><li>Find your Neighbors from the set of other users </li></ul></ul></ul><ul><ul><ul><li>Recommend things that your Neighbors liked and you have not “seen” </li></ul></ul></ul><ul><li>Problem: predictions are based on a large dataset that is sparse and noisy </li></ul>
  35. 35. Overview of the Approach <ul><li>expert = individual that we can trust to have produced thoughtful, consistent and reliable evaluations (ratings) of items in a given domain </li></ul><ul><li>Expert-based Collaborative Filtering </li></ul><ul><ul><li>Find neighbors from a reduced set of experts instead of regular users. </li></ul></ul><ul><ul><ul><li>Identify domain experts with reliable ratings </li></ul></ul></ul><ul><ul><ul><li>For each user, compute “ expert neighbors ” </li></ul></ul></ul><ul><ul><ul><li>Compute recommendations similar to standard kNN CF </li></ul></ul></ul>
  36. 36. Advantages of the Approach <ul><li>Noise </li></ul><ul><ul><li>Experts introduce less natural noise </li></ul></ul><ul><li>Malicious Ratings </li></ul><ul><ul><li>Dataset can be monitored to avoid shilling </li></ul></ul><ul><li>Data Sparsity </li></ul><ul><ul><li>Reduced set of domain experts can be motivated to rate items </li></ul></ul><ul><li>Cold Start problem </li></ul><ul><ul><li>Experts rate items as soon as they are available </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>Dataset is several order of magnitudes smaller </li></ul></ul><ul><li>Privacy </li></ul><ul><ul><li>Recommendations can be computed locally </li></ul></ul>
  37. 37. Take home message <ul><li>Expert Collaborative Filtering </li></ul><ul><ul><li>Is a new approach to recommendation but it builds up on standard CF </li></ul></ul><ul><ul><li>Addresses many of standard CF shortcomings </li></ul></ul><ul><ul><li>At least in some conditions, users prefer it over standard CF approaches </li></ul></ul>
  38. 38. User study
  39. 39. User Study <ul><li>57 participants, only 14.5 ratings/participant </li></ul><ul><li>50% of the users consider Expert-based CF to be good or very good </li></ul><ul><li>Expert-based CF: only algorithm with an average rating over 3 (on a 0-4 scale) </li></ul>
  40. 40. User Study <ul><li>Results to the questions: “The recommendation list includes movies I like/dislike” (1-4 Likert) </li></ul><ul><li>Experts-CF clearly outperforms other methods </li></ul>
  41. 41. Expert Collaborative Filtering
  42. 42. Expert-based CF <ul><li>Given user u  U and d , find the set of experts E '  E such that: &quot; e  E '  sim ( u , e )  </li></ul><ul><li>confidence threshold t = the minimum number of expert neighbors who must have rated the item in order to trust their prediction. </li></ul><ul><ul><li>Given an item i , find E ''  E ' s.t. &quot; e  E '' r ei  unrated . </li></ul></ul><ul><ul><ul><li>if n < t ⇒ no prediction, user mean is returned. </li></ul></ul></ul><ul><ul><ul><li>if n  ⇒ rating can be predicted: similarity-weighted average of the ratings input from each expert e in E '' </li></ul></ul></ul>
  43. 43. Experts vs. Users Analysis
  44. 44. Mining the Web for Expert Ratings <ul><li>Collections of expert ratings can be obtained almost directly on the web: we crawled the Rotten Tomatoes movie critics mash-up </li></ul><ul><ul><li>Only those (169) with more than 250 ratings in the Neflix dataset were used </li></ul></ul>
  45. 45. Dataset Analysis (# ratings) <ul><li>Sparsity coefficient: 0.01 (users) vs. 0.07 (experts) </li></ul><ul><li>Average movie has 1K user ratings vs. 100 expert ratings </li></ul><ul><li>Average expert rated 400 movies, 10% rated > 1K </li></ul>
  46. 46. Dataset Analysis ( average) <ul><li>Users: average movie rating ~0.55 (3.2⋆); </li></ul><ul><ul><li>10%  0.45(2.8⋆),10%  0.7(3.8⋆) </li></ul></ul><ul><li>Experts: average movie rating ~0.6 (3.4⋆) </li></ul><ul><ul><li>10%  0.4(2.6⋆), 10%  0.8 (4.2⋆) </li></ul></ul><ul><li>user ratings centered 0.7 (3.8⋆) </li></ul><ul><li>expert ratings centered 0.6 (3.4⋆): small variability </li></ul><ul><ul><li>only 10% of the experts have a mean score  0.55 (3.2⋆) and another 10%  0.7 (3.8⋆) </li></ul></ul>
  47. 47. Dataset Analysis (std) <ul><li>Users: </li></ul><ul><ul><li>per movie centered around 0.25 (1⋆), little variation </li></ul></ul><ul><ul><li>per user centered around 0.25, larger variability </li></ul></ul><ul><li>Experts: </li></ul><ul><ul><li>lower std per movie (0.15) and larger variation. </li></ul></ul><ul><ul><li>average std per expert = 0.2, small variability. </li></ul></ul>
  48. 48. Dataset Analysis. Summary <ul><li>Experts... </li></ul><ul><ul><li>are much less sparse </li></ul></ul><ul><ul><li>rate movies all over the rating scale instead of being biased towards rating only “good” movies (different incentives). </li></ul></ul><ul><ul><li>but, they seem to consistently agree on the good movies. </li></ul></ul><ul><ul><li>have a lower overall standard deviation per movie: they tend to agree more than regular users. </li></ul></ul><ul><ul><li>tend to deviate less from their personal average rating. </li></ul></ul>
  49. 49. Experimental Results
  50. 50. Evaluation Procedure <ul><li>Use the 169 experts to predict ratings from 10.000 users sampled from the Netflix dataset </li></ul><ul><li>Prediction MAE using a 80-20 holdout procedure (5-fold cross-validation) </li></ul><ul><li>Top-N precision by classifying items as being “recommendable” given a threshold </li></ul><ul><li>Still, take results with a grain of salt... we have a user study backing up the approach </li></ul>
  51. 51. Results. Prediction MAE <ul><li>Setting our parameters to  =10 and  =0.01, we obtain a MAE of 0.781 and a coverage of 97.7% </li></ul><ul><ul><li>expert-CF yields a significant accuracy improvement with respect to using the experts’ average </li></ul></ul><ul><ul><li>Accuracy is worse than standard CF (with better coverage) </li></ul></ul>
  52. 52. Role of Thresholds <ul><li>MAE is inversely proportional to the similarity threshold (  ) until the 0.06 mark, when it starts to increase as we move to higher  values. </li></ul><ul><ul><li>below 0.0 it degrades rapidly: too many experts; </li></ul></ul><ul><li>Coverage decreases as we increase  . </li></ul><ul><ul><li>For the optimal MAE point of 0.06, coverage is still above 70%. </li></ul></ul><ul><li>MAE as a function of the confidence threshold (  )  =0.0 and  =0.01(optimal around 9) </li></ul>
  53. 53. Comparison to standard CF <ul><li>Standard NN CF has MAE around 10% but coverage is also 10% lower </li></ul><ul><li>Expert-CF only works worse for the 10% of the users with lower MAE </li></ul>
  54. 54. Results2. Top-N Precision <ul><li>Precision of the Top-N Recommendations as a function of the “recommendable” threshold </li></ul><ul><li>For a threshold of 4, NN-CF outperforms expert-based but if we lower it to 3 they are almost equal </li></ul>
  55. 55. Conclusions <ul><li>Different approach to the Recommendation problem </li></ul><ul><li>At least in some conditions, users prefer recommendations from similar experts than similar users. </li></ul><ul><li>Expert-based CF has the potential to address many of standard CF shortcomings </li></ul>
  56. 56. Future/Curent Work <ul><li>We are currently exploring its performance in other domains and implementing a distributed expert-based CF application (work with Jae-Wook Ahn, Pittsburgh U.) </li></ul>
  57. 57. Adaptive Data Sources Collaborative Filtering With Adaptive Information Sources (ITWP @ IJCAI) With Neal Lathia UCL (London)
  58. 58. Adaptive data sources user modeling experts? friends? like-minded? similarity trust reputation
  59. 59. Adaptive Data sources <ul><li>Given </li></ul><ul><ul><li>a simple, un-tuned, kNN predictor and multiple information sources </li></ul></ul><ul><li>A problem </li></ul><ul><ul><li>users are subjective , accuracy varies with source </li></ul></ul><ul><li>A promise </li></ul><ul><ul><li>optimal classification of users to best source produces incredibly accurate predictions </li></ul></ul>
  60. 60. Adaptive Data sources <ul><li>A question </li></ul><ul><ul><li>how to classify users to source set? </li></ul></ul>
  61. 61. Rate it Again Increasing Recommendation Accuracy by User re-Rating (Recsys October 09) Recsys 09 NY
  62. 62. Rate it again <ul><li>By asking users to rate items again we can remove noise in the dataset </li></ul><ul><ul><li>Improvements of up to 14% in accuracy! </li></ul></ul><ul><li>Because we don't want all users to re-rate all items we design ways to do partial denoising </li></ul><ul><ul><li>Data-dependent: only denoise extreme ratings </li></ul></ul><ul><ul><li>User-dependent: detect “noisy” users </li></ul></ul>
  63. 63. Conclusions <ul><li>For many applications such as Recommender Systems (but also Search, Advertising, and even Networks) understanding data and users is vital </li></ul><ul><li>Algorithms can only be as good as the data they use as input </li></ul><ul><li>The use of User Mining techniques is going to be a growing trend in many areas in the coming years </li></ul>
  64. 64. The Wisdom of the Few <ul><ul><li>Thanks! </li></ul></ul><ul><ul><li>Questions? </li></ul></ul>