Netflix Recommendations - Beyond the 5 Stars


Published on

Talk I gave on 10/22/2012 at the ACM SF-Bay Area chapter meeting hosted in LinkedIn

Published in: Technology

Netflix Recommendations - Beyond the 5 Stars

  1. 1. Ne#lix  Recommenda/ons  Beyond  the  5  Stars        ACM  SF-­‐Bay  Area  October  22,  2012    Xavier  Amatriain  Personaliza?on  Science  and  Engineering  -­‐  NeDlix   @xamat  
  2. 2. Outline1.  The Netflix Prize & the Recommendation Problem2.  Anatomy of Netflix Personalization3.  Data & Models4.  And… a)  Consumer (Data) Science b)  Or Software Architectures
  3. 3. 3
  4. 4. SVDWhat we were interested in:§  High quality recommendationsProxy question: Results§  Accuracy in predicted rating •  Top 2 algorithms still in production§  Improve by 10% = $1million! RBM
  5. 5. What about the final prize ensembles?§  Our offline studies showed they were too computationally intensive to scale§  Expected improvement not worth the engineering effort§  Plus…. Focus had already shifted to other issues that had more impact than rating prediction. 5
  6. 6. Change of focus 2006 2012 6
  7. 7. Anatomy ofNetflixPersonalization Everything is a Recommendation
  8. 8. Everything is personalized Ranking Note: Recommendations Rows are per household, not individual user 8
  9. 9. Top 10 Personalization awarenessAll Dad Dad&Mom Daughter All All? Daughter Son Mom Mom Diversity 9
  10. 10. Support for Recommendations Social Support 10
  11. 11. Social Recommendations 11
  12. 12. Watch again & Continue Watching 12
  13. 13. Genres13
  14. 14. Genre rows§  Personalized genre rows focus on user interest §  Also provide context and “evidence” §  Important for member satisfaction – moving personalized rows to top on devices increased retention§  How are they generated? §  Implicit: based on user’s recent plays, ratings, & other interactions §  Explicit taste preferences §  Hybrid:combine the above §  Also take into account: §  Freshness - has this been shown before? §  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
  15. 15. Genres - personalization 15
  16. 16. Genres - personalization 16
  17. 17. Genres- explanations 17
  18. 18. Genres- explanations 18
  19. 19. Genres – user involvement 19
  20. 20. Genres – user involvement 20
  21. 21. Similars §  Displayed in many different contexts §  In response to user actions/ context (search, queue add…) §  More like… rows
  22. 22. Anatomy of a Personalization - Recap§  Everything is a recommendation: not only rating prediction, but also ranking, row selection, similarity…§  We strive to make it easy for the user, but…§  We want the user to be aware and be involved in the recommendation process§  Deal with implicit/explicit and hybrid feedback§  Add support/explanations for recommendations§  Consider issues such as diversity or freshness 22
  23. 23. Data &Models
  24. 24. Big Data @Netflix §  Almost 30M subscribers §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  2B hours streamed in Q4 2011 §  1B hours in June 2012 24
  25. 25. Smart Models §  Logistic/linear regression §  Elastic nets §  SVD and other MF models §  Restricted Boltzmann Machines §  Markov Chains §  Different clustering approaches §  LDA §  Association Rules §  Gradient Boosted Decision Trees §  … 25
  26. 26. SVDX[n x m] = U[n x r] S [ r x r] (V[m x r])T§  X: m x n matrix (e.g., m users, n videos)§  U: m x r matrix (m users, r concepts)§  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)§  V: r x n matrix (n videos, r concepts)
  27. 27. Simon Funk’s SVD§  One of the most interesting findings during the Netflix Prize came out of a blog post§  Incremental, iterative, and approximate way to compute the SVD using gradient descent 27
  28. 28. SVD for Rating Prediction f§  User factor vectors pu ∈ ℜ f and item-factors vector qv ∈ ℜ§  Baseline buv = µ + bu + bv (user & item deviation from average) T§  Predict rating as ruv = buv + pu qv§  SVD++ (Koren et. Al) asymmetric variation w. implicit feedback $ − 1 − 1 T & R(u) 2 r = buv + q & uv v ∑ (ruj − buj )x j + N(u) 2 ∑ yj ) ) % (§  Where j∈R(u) j∈N (u) §  qv , xv , yv ∈ ℜ f are three item factor vectors §  Users are not parametrized, but rather represented by: §  R(u): items rated by user u §  N(u): items for which the user has given implicit preference (e.g. rated vs. not rated) 28
  29. 29. Artificial Neural Networks – 4 generations§  1st - Perceptrons (~60s) §  Single layer of hand-coded features §  Linear activation function §  Fundamentally limited in what they can learn to do.§  2nd - Back-propagation (~80s) §  Back-propagate error signal to get derivatives for learning §  Non-linear activation function§  3rd - Belief Networks (~90s) §  Directed acyclic graph composed of (visible & hidden) stochastic variables with weighted connections. §  Infer the states of the unobserved variables & learn interactions between variables to make network more likely to generate observed data. 29
  30. 30. Restricted Boltzmann Machines§  Restrict the connectivity to make learning easier. §  Only one layer of hidden units. §  Although multiple layers are possible hidden §  No connections between hidden units. j§  Hidden units are independent given the visible states.. §  So we can quickly get an unbiased sample from the posterior distribution over hidden “causes” i when given a data-vector visible§  RBMs can be stacked to form Deep Belief Nets (DBN) – 4th generation of ANNs
  31. 31. RBM for the Netflix Prize 31
  32. 32. Ranking Key algorithm, sorts titles in most contexts
  33. 33. Ranking§  Ranking = Scoring + Sorting + Filtering §  Factors bags of movies for presentation to a user §  Accuracy§  Goal: Find the best possible ordering of a §  Novelty set of videos for a user within a specific §  Diversity context in real-time §  Freshness§  Objective: maximize consumption §  Scalability§  Aspirations: Played & “enjoyed” titles have §  … best score§  Akin to CTR forecast for ads/search results
  34. 34. Ranking§  Popularity is the obvious baseline§  Ratings prediction is a clear secondary data input that allows for personalization§  We have added many other features (and tried many more that have not proved useful)§  What about the weights? §  Based on A/B testing §  Machine-learned
  35. 35. Example: Two features, linear model 1  Predicted Rating 2   Final  Ranking   3   4   Linear  Model:   frank(u,v)  =  w1  p(v)  +  w2  r(u,v)  +  b   5   Popularity 35
  36. 36. Ranking
  37. 37. Ranking
  38. 38. Ranking
  39. 39. Ranking
  40. 40. Learning to rank§  Machine learning problem: goal is to construct ranking model from training data§  Training data can have partial order or binary judgments (relevant/not relevant).§  Resulting order of the items typically induced from a numerical score§  Learning to rank is a key element for personalization§  You can treat the problem as a standard supervised classification problem 40
  41. 41. Learning to Rank Approaches1.  Pointwise §  Ranking function minimizes loss function defined on individual relevance judgment §  Ranking score based on regression or classification §  Ordinal regression, Logistic regression, SVM, GBDT, …2.  Pairwise §  Loss function is defined on pair-wise preferences §  Goal: minimize number of inversions in ranking §  Ranking problem is then transformed into the binary classification problem §  RankSVM, RankBoost, RankNet, FRank…
  42. 42. Learning to rank - metrics DCG NDCG = IDCG§  Quality of ranking measured using metrics as n relevancei DCG = relevance1 + ∑ §  Normalized Discounted Cumulative Gain 2 log 2 i §  Mean Reciprocal Rank (MRR) 1 1 §  Fraction of Concordant Pairs (FCP) MRR = H ∑ rank(h ) h∈H i §  Others…§  But, it is hard to optimize machine-learned ∑CP(x , x ) i j models directly on these measures (they are FCP = i≠ j n(n −1) not differentiable) 2§  Recent research on models that directly optimize ranking measures 42
  43. 43. Learning to Rank Approaches3.  Listwise a.  Indirect Loss Function §  RankCosine: similarity between ranking list and ground truth as loss function §  ListNet: KL-divergence as loss function by defining a probability distribution §  Problem: optimization of listwise loss function may not optimize IR metrics b.  Directly optimizing IR measures (difficult since they are not differentiable) §  Directly optimize IR measures through Genetic Programming §  Directly optimize measures with Simulated Annealing §  Gradient descent on smoothed version of objective function (e.g. CLiMF presented at Recsys 2012 or TFMAP at SIGIR 2012) §  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints §  AdaRank uses boosting to optimize NDCG
  44. 44. Similars §  Different similarities computed from different sources: metadata, ratings, viewing data… §  Similarities can be treated as data/features §  Machine Learned models improve our concept of “similarity” 44
  45. 45. Data & Models - Recap§  All sorts of feedback from the user can help generate better recommendations§  Need to design systems that capture and take advantage of all this data§  The right model is as important as the right data§  It is important to come up with new theoretical models, but also need to think about application to a domain, and practical issues§  Rating prediction models are only part of the solution to recommendation (think about ranking, similarity…) 45
  46. 46. More data or better models? Really? Anand Rajaraman: Stanford & Senior VP at Walmart Global eCommerce (former Kosmix) 46
  47. 47. More data or better models?Sometimes, it’s notabout more data 47
  48. 48. More data or better models? [Banko and Brill, 2001]Norvig: “Google does nothave better Algorithms,only more Data” Many features/ low-bias models 48
  49. 49. More data or better models? Model performance vs. sample size (actual Netflix system) 0.09 0.08 0.07 0.06 0.05 Sometimes, it’s not about more data 0.04 0.03 0.02 0.01 0 0 1000000 2000000 3000000 4000000 5000000 6000000 49
  50. 50. More data or better models? Data without a sound approach = noise 50
  51. 51. Consumer(Data) Science
  52. 52. Consumer Science§  Main goal is to effectively innovate for customers§  Innovation goals §  “If you want to increase your success rate, double your failure rate.” – Thomas Watson, Sr., founder of IBM §  The only real failure is the failure to innovate §  Fail cheaply §  Know why you failed/succeeded 52
  53. 53. Consumer (Data) Science1.  Start with a hypothesis: §  Algorithm/feature/design X will increase member engagement with our service, and ultimately member retention2.  Design a test §  Develop a solution or prototype §  Think about dependent & independent variables, control, significance…3.  Execute the test4.  Let data speak for itself 53
  54. 54. Offline/Online testing process days Weeks to months Offline Online A/B Rollout Feature to testing [success] testing [success] all users [fail] 54
  55. 55. Offline testing§  Optimize algorithms offline§  Measure model performance, using metrics such as: §  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…§  Offline performance used as an indication to make informed decisions on follow-up A/B tests§  A critical (and unsolved) issue is how offline metrics can correlate with A/B test results.§  Extremely important to define a coherent offline evaluation framework (e.g. How to create training/testing datasets is not trivial) 55
  56. 56. Executing A/B tests§  Many different metrics, but ultimately trust user engagement (e.g. hours of play and customer retention)§  Think about significance and hypothesis testing §  Our tests usually have thousands of members and 2-20 cells§  A/B Tests allow you to try radical ideas or test many approaches at the same time. §  We typically have hundreds of customer A/B tests running§  Decisions on the product always data-driven 56
  57. 57. What to measure§  OEC: Overall Evaluation Criteria§  In an AB test framework, the measure of success is key§  Short-term metrics do not always align with long term goals §  E.g. CTR: generating more clicks might mean that our recommendations are actually worse§  Use long term metrics such as LTV (Life time value) whenever possible §  In Netflix, we use member retention 57
  58. 58. What to measure§  Short-term metrics can sometimes be informative, and may allow for faster decision-taking §  At Netflix we use many such as hours streamed by users or %hours from a given algorithm§  But, be aware of several caveats of using early decision mechanisms Initial effects appear to trend. See “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” [Kohavi et. Al. KDD 12] 58
  59. 59. Consumer Data Science - Recap§  Consumer Data Science aims to innovate for the customer by running experiments and letting data speak§  This is mainly done through online AB Testing§  However, we can speed up innovation by experimenting offline§  But, both for online and offline experimentation, it is important to chose the right metric and experimental framework 59
  60. 60. Architectures 60
  61. 61. Technology hTp://   61
  62. 62. 62
  63. 63. Event & DataDistribution 63
  64. 64. Event & Data Distribution•  UI devices should broadcast many different kinds of user events •  Clicks •  Presentations •  Browsing events •  …•  Events vs. data •  Some events only need to be propagated and trigger an action (low latency, low information per event) •  Others need to be processed and “turned into” data (higher latency, higher information quality). •  And… there are many in between•  Real-time event flow managed through internal tool (Manhattan)•  Data flow mostly managed through Hadoop. 64
  65. 65. Offline Jobs 65
  66. 66. Offline Jobs•  Two kinds of offline jobs •  Model training •  Batch offline computation of recommendations/ intermediate results•  Offline queries either in Hive or PIG•  Need a publishing mechanism that solves several issues •  Notify readers when result of query is ready •  Support different repositories (s3, cassandra…) •  Handle errors, monitoring… •  We do this through Hermes 66
  67. 67. Computation 67
  68. 68. Computation•  Two ways of computing personalized results •  Batch/offline •  Online•  Each approach has pros/cons •  Offline +  Allows more complex computations +  Can use more data -  Cannot react to quick changes -  May result in staleness •  Online +  Can respond quickly to events +  Can use most recent data -  May fail because of SLA -  Cannot deal with “complex” computations•  It’s not an either/or decision •  Both approaches can be combined 68
  69. 69. Signals & Models 69
  70. 70. Signals & Models•  Both offline and online algorithms are based on three different inputs: •  Models: previously trained from existing data •  (Offline) Data: previously processed and stored information •  Signals: fresh data obtained from live services •  User-related data •  Context data (session, date, time…) 70
  71. 71. Results 71
  72. 72. Results•  Recommendations can be serviced from: •  Previously computed lists •  Online algorithms •  A combination of both•  The decision on where to service the recommendation from can respond to many factors including context.•  Also, important to think about the fallbacks (what if plan A fails)•  Previously computed lists/intermediate results can be stored in a variety of ways •  Cache •  Cassandra •  Relational DB 72
  73. 73. Alerts and Monitoring§  A non-trivial concern in large-scale recommender systems§  Monitoring: continuously observe quality of system§  Alert: fast notification if quality of system goes below a certain pre-defined threshold§  Questions: §  What do we need to monitor? §  How do we know something is “bad enough” to alert 73
  74. 74. What to monitor Did something go§  Staleness wrong here? §  Monitor time since last data update 74
  75. 75. What to monitor§  Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do 75
  76. 76. What to monitor§  Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do Did something go wrong here? 76
  77. 77. What to monitor§  Algorithmic source for users §  Monitor how users interact with different algorithms Algorithm X Did something go wrong here? New version 77
  78. 78. When to alert§  Alerting thresholds are hard to tune §  Avoid unnecessary alerts (the “learn-to-ignore problem”) §  Avoid important issues being noticed before the alert happens§  Rules of thumb §  Alert on anything that will impact user experience significantly §  Alert on issues that are actionable §  If a noticeable event happens without an alert… add a new alert for next time 78
  79. 79. Conclusions 79
  80. 80. The Personalization Problem§  The Netflix Prize simplified the recommendation problem to predicting ratings§  But… §  User ratings are only one of the many data inputs we have §  Rating predictions are only part of our solution §  Other algorithms such as ranking or similarity are very important§  We can reformulate the recommendation problem §  Function to optimize: probability a user chooses something and enjoys it enough to come back to the service 80
  81. 81. More data + Better models + More accurate metrics +Better approaches & architectures Lots of room for improvement! 81
  82. 82. Thanks! We’re hiring!Xavier Amatriain (@xamat)