Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tutorial on Deep Learning in Recommender System, Lars summer school 2019


Published on

I had a fun time giving tutorial on the topic of deep learning in recommender systems at Latin America School on Recommender Systems (LARS) in Fortaleza, Brazil.

Published in: Engineering
  • John Buffi is a retired police offer who lost his home to Superstorm Sandy. He now uses the "Demolisher" system to help take care of his 91-year-old father and children. John says: "My only statement is "WOW"...I thought your other systems were special but this is going to turn out to be the " Holy Grail" of all MLB systems, no doubt! ●●●
    Are you sure you want to  Yes  No
    Your message goes here
  • People used to laugh at me behind my back before I was in shape or successful. Once I lost a lot of weight, I was so excited that I opened my own gym, and began helping others. I began to get quite a large following of students, and finally, I didn't catch someone laughing at me behind my back any longer. CLICK HERE NOW ■■■
    Are you sure you want to  Yes  No
    Your message goes here

Tutorial on Deep Learning in Recommender System, Lars summer school 2019

  1. 1. Tutorial on Deep Learning in Recommender Systems Anoop Deoras ACM - LARS, Fortaleza 10/10/2019 @adeoras
  2. 2. ● Models from Linear Family: ■ Matrix Factorization, Asymmetric Matrix Factorization, SLIM and Topic Models, .. ● Models from Non-Linear Family: ■ Variational Autoencoders, Sequence and Convolutional models, .. ● Modeling the Context ● Interpreting the inner workings of a Neural Network Recommender Model ● Reinforcement Learning in RecSys Outline of the tutorial
  3. 3. ~150M Members, 190 Countries
  4. 4. ● Recommendation Systems are means to an end. ● Our primary goal: ○ Maximize Netflix member’s enjoyment of the selected show ■ Enjoyment integrated over time ○ Minimize the time it takes to find them ■ Interaction cost integrated over time Personalization ● Personalization
  5. 5. Everything is a recommendation!
  6. 6. Ordering of the titles in each row is personalized
  7. 7. Selection and placement of the row types is personalized
  8. 8. Profile 1 Profile 2 Personalized Images
  9. 9. Personalized Messages
  10. 10. Impracticality of Showing everything
  11. 11. We Personalize our recommendation! This Talk Answers: HOW ?
  12. 12. ● 1999-2005: Netflix Prize: ○ >10% improvement, win $1,000,000 ● Top performing model(s) ended up be a variation of Matrix Factorization [SVD++, Koren, et al] ● Although Netflix’s rec system has moved on, MF is still the foundational method on which most collaborative filtering systems are based today. ● Next, let us discuss model evolution and their applicability. Background
  13. 13. Models from Linear Family
  14. 14. Matrix of Observed Ratings 1.0 2.0 3.0 4.0 5.0 1.0 2.0 Users Videos Observed Ratings ● User-Item rating matrix ● Ratings are explicit feedback ● Very large but sparse
  15. 15. Matrix Completion 1.0 ? ? 2.0 ? 3.0 ? ? ? 4.0 ? 5.0 1.0 ? 2.0 Users Videos Observed Ratings ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not rated.
  16. 16. Factorizating the Matrix ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not rated.
  17. 17. Alt Least Squares / Grad Desc... ● Minimize the Frobenius norm.
  18. 18. ● Ratings are extremely hard to get. ○ Signal becomes sparser ○ Netflix data was 0.1% dense !! ● They are not calibrated. ○ Some users will never give a 5 ○ Some users will always give a 5 Problems with Explicit Feedback
  19. 19. From Explicit to Implicit 1 1 1 1 1 1 1 Users Videos Observed Preference ● User-Item preference matrix ● Values are implicit feedback ● Very large but sparse. ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not interacted yet. ● NOTE: you will negatives (0s)
  20. 20. Alt Least Squares / Grad Desc...
  21. 21. Scoring Q P Videos Users Linear Factor Interaction Item factor for video-j User factor for user-i
  22. 22. ● We cannot possibly store user latent vector because: ○ Potentially too many users and we train on a subset ○ New users come to the service ● At runtime, what do we know ? ○ Item latent representations and user’s ratings/interactions (eg plays) How does one obtain User factors ?
  23. 23. Alternating Least Squares -- “Fold-In” User Videos Observed Plays 1 1 ● User-Item preference vector ● Values are implicit feedback ● Just run least square optimization during serving
  24. 24. Asymmetric Matrix Factorization ● Similar setup to MF. ● Start with a sparse matrix with implicit/explicit feedback ● Similar goal, approximate R with product of two latent matrices
  25. 25. Asymmetric Matrix Factorization (AMF) (MF) (AMF) N(U) all the videos U played Video embedding over user history
  26. 26. AMF viewed as successive matrix ops (MF) (AMF) ● Similar setup to MF. ● Start with a sparse matrix with implicit/explicit feedback ● Similar goal, approximate R with product of two latent matrices Indicator Function Item Embedding Item Embedding
  27. 27. AMF as a linear-Neural Network
  28. 28. ● Look up plays/interactions the user had with items in the catalog of interest. ○ Average the latent representations of items ○ That’s it. ● Frees you up from the tedious (alternating) least squares optimization during inference / serving. No Least Square solver necessary
  29. 29. SLIM (Sparse Linear Method) ≈R I(R) Diagonal replaced with zeros Y items items 0 (AMF) (SLIM) SLIM: Sparse Linear Methods for top N Recommendation, Ning, ICDM 2011
  30. 30. Basic Intuition behind Soft Clustering Models ● Imagine you walked into a room full of movie enthusiasts, from all over the world, from all walks of life, and your goal was to come out with a great movie recommendation. ● Would you obtain popular vote ? Would that satisfy you ?
  31. 31. Basic Intuition behind Soft Clustering Models ● Now consider forming groups of people with similar taste based on the videos that they previously enjoyed.
  32. 32. Basic Intuition behind Soft Clustering Models ● Describe yourself using what you have watched. ● Try to associate yourself with these groups and obtain a weighted “personalized” popularity vote.
  33. 33. User’s distribution over the topics 0.15 0.630.22
  34. 34. Topic’s internal distribution over videos 0.05 0.09 0.10
  35. 35. Topic Models (Latent Dirichlet Alloc) K U P α θ φt v β Total Topics Taste Convex Combinations of topics proportions and movie proportions within topic
  36. 36. Topic Models (LDA): Scoring Q P Videos Users Topic Conditional distribution for video-j Distribution over topics for user-i
  37. 37. LDA a special case of MF MF LDA
  38. 38. ● LDA as a special case of MF ○ Latent factors are prob distributions ● AMF is equivalent to 1 hidden layer linear feedforward ● AMF as a special case of MF ● SLIM as a special case of AMF Relating all the models
  39. 39. ● Netflix/Youtube/.. use case: ○ Want to model country, time of day, day of week, device, .. ● Country as the context, some challenges: ○ Each country offers a different catalog. How do we model it ? ● Time of day, day of week as the context, some challenges: ○ Discrete or continuous variables ? Contextualizing these models
  40. 40. Country as the context in LDA models Country A catalog Country B catalog Users in Country A play both Friends and HIMYM Users in Country B cannot play both Friends and HIMYM Model is forced to split HIMYM plays. topic k : Outcome: Parameters are being consumed to explain catalog differences. topic j: Topic with high mass on HIMYM and Friends Topic with high mass on HIMYM
  41. 41. Catalogue Censoring in Topic Models K U P α θ φt v β Total Topics Taste c Censoring pattern m Global Recommendation System for Overlapping Media Catalogue, Todd, US Patent App
  42. 42. ● Censored multinomials cannot be formed as elegantly as in LDA. ● We can sample negatives only from the country’s catalog. ○ Why waste model’s energy in demoting titles that user will never see anyways. Censoring in MF/AMF ?
  43. 43. Time context in Topic Models K U P α θ φk v β Total Topics Taste t Observed time µ Topics over Time: A Non Markov Continuous-Time Model fo Topic Trends. , Wang, KDD 2006
  44. 44. SIMPLE !
  45. 45. Fully contextualizing Topic Models K U P α θ φk v β Total Topics Taste t Observed time µ c Censoring pattern m
  46. 46. What about device ? Yet another Random Variable Gibbs Sampling derivation Never mind
  47. 47. ● Let us make LDA non linear -- aka Variational Autoencoder ● Let us make AMF non linear -- aka Auto Encoder ● Let us go beyond the world of generative models. ○ How about conditional models ? ■ Ffwd, LSTMs, CNN … ● Lets make these models context ready. ● Lets talk about Reinforcement Learning in RecSys Lets enter the world of non linearity
  48. 48. Models from Non-Linear Family
  49. 49. ● Better generalization beyond linear models for user-item interactions. ● Unified representation of heterogeneous signals (e.g. add image/audio/textual content as side information to item embeddings via convolutional NNs). ● Exploitation of sequential information in actions leading up to recommendation (e.g. LSTM on viewing/purchase/search history to predict what will be watched/purchased/searched next). ● DL toolkits provide unprecedented flexibility in experimenting with loss functions (e.g. in toolkits like TensorFlow/pyTorch/Keras etc. switching the loss from classification loss to ranking loss is trivial. The optimization is taken care of.) Why use DL in RecSys ?
  50. 50. ● End to End differentiable ○ Helpful in the RL setting for instance ● Provide suitable inductive biases catered to the input data ● NNs are composite → gigantic end to end differential NN ○ Toolkits such as TFlow make it simple to implement ● Indispensable for multi modal data, such as text and images ○ News recommendation for instance The most attractive properties of NNs
  51. 51. ● While we may have millions (even billions) of users and millions of items, RecSys, however, is a small data problem ○ Each user interacts with only finite number of items ● Neural networks project discrete tokens into continuous space naturally ○ Collaborative filtering in continuous space RecSys has a BIG small data problem
  52. 52. ● Lack of interpretability ○ Although we have had some breakthrough at Netflix ● Needs Big Data ○ DL models, being very high capacity models, flourish under large training datasets. ○ Often poor performance is reported on small setups. ● HyperParam tuning ○ Hyper params unique to problem setting. Drawbacks
  53. 53. ● AutoEncoder like ● Matrix Factorization like ● Conditional Models (Language model) like ● Hybrids NN models categorized into 4 main categories
  54. 54. AutoEncoder Family Feed Forward k-hot play (t-n)... play (t-1) reconstruction ● General idea: reconstruct input. ● Reconstruction helps generalize to unseen items.
  55. 55. Earliest adaptation: Restricted Boltzmann Machines From recent presentation by Alexandros Karatzoglou One hidden layer. User feedback on items interacted with, are propagated back to all items. Very similar to an autoencoder!
  56. 56. Collaborative Denoising Auto-Encoder Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu, WSDM 2016 ● Treats the feedback on items y that the user U has interacted with (input layer) as a noisy version of the user’s preferences on all items (output layer) ● Introduces a user specific input node and hidden bias node, while the item weights are shared across all users.
  57. 57. From AMF to a Neural Network Non linearities (AMF) (NN)
  58. 58. Variational Autoencoders zu u Taste fθ 𝞵 𝞼 u Encoder Decoder fѰ fѰ DNN Soft-max over entire vocabulary Variational Autoencoders for Collaborative Filtering, Liang et al.
  59. 59. Non Linear Factorization Models RNN/Ffwd .. Feed Forward User,Cntxt play (t-n)... play (t-1)cntxt item metadata Item Prob of play ...
  60. 60. Neural Network Matrix Factorization ● Treats Matrix Factorization from non linearity perspective. ● Neural Network Matrix Factorization, Dziugaite, arxiv 2015
  61. 61. Deep Factorization Machines DeepFM: A Factorization-Machine based Neural Network for CTR Prediction, Guo IJCAL 2017 ● Treats Factorization Machine from non linearity perspective. ● Higher order feature interactions -- Deep NN ● Lower order feature interactions -- Factorization Machine
  62. 62. Wide + Deep Models for Recommendations In a recommender setting, you may want to train with a wide set of cross-product feature transformations , so that the model essentially memorizes these sparse feature combinations (rules): Meh! Yay! Wide and Deep Learning for Recommender Systems, Cheng et al, RecSys (2016)
  63. 63. Wide + Deep Models for Recommendations (W+D) On the other hand, you may want the ability to generalize using the representational power of a deep network. But deep nets can over-generalize.
  64. 64. Wide + Deep = Memorization + Generalization Best of both worlds: Jointly train a deep + wide network. The cross-feature transformation in the wide model component can memorize all those sparse, specific rules, while the deep model component can generalize to similar items via embeddings.
  65. 65. Wide + Deep Models in Google Product Wide + Deep Model for app recommendations.
  66. 66. ● All the models we saw till now were Generative Models ● They describe the data, the observation ● Often we care about knowing what is the next play a user will watch or the next thing our user will buy ● How about conditional models ? ○ Model probability of next play directly ○ Prob (play | everything we know about the user at time t) ● Borrow from Language Modeling community ○ Prob (next word | all the words before) Generative versus Conditional Models
  67. 67. Neural Multi Class Models play (t-n) ... play (t-1) cntxt Soft-max over entire vocabulary play (t-n)... play (t-1)cntxt Soft-max over entire vocabulary N-GRAM BoW-n Feed Forward User,Cntxt P(next-video | <user, cntxt>)
  68. 68. The Youtube Recommendation model A two Stage Approach with two deep networks: ● The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. ● The ranking network scores each video according to a desired objective function using a rich set of features describing the video and user. The highest scoring videos are presented to the user, ranked by their score Deep Neural Networks for Youtube Recommendations, Covington et al, RecSys (2016)
  69. 69. The Youtube Recommendation model Deep candidate generation model architecture ● embedded sparse features concatenated with dense features. Embeddings are averaged before concatenation to transform variable sized bags of sparse IDs into fixed-width vectors suitable for input to the hidden layers. ● All hidden layers are fully connected. ● In training, a cross-entropy loss is minimized with gradient descent on the output of the sampled softmax. ● At serving, an approximate nearest neighbor lookup is performed to generate hundreds of candidate video recommendations. Stage One
  70. 70. The Youtube Recommendation model Stage Two Deep ranking network architecture ● uses embedded categorical features (both univalent and multivalent) with shared embeddings and powers of normalized continuous features. ● All layers are fully connected. In practice, hundreds of features are fed into the network.
  71. 71. Neural Multi Class Sequential Models play (t-1) cntxt Soft-max over entire vocabulary state (t-1) RNN Family play (t-2) ... play (t-1) Soft-max over entire vocabulary cntxt play (t-4)play (t-3) play (t-n)play (t-n+1) CNN Family state (t) Recurrent Convolutn P(next-video | <user, cntxt>)
  72. 72. Session-based recommendation with Recurrent Neural Networks (GRU4Rec) ● Treat each user session as sequence of clicks ● Predict next item in the session sequence Session-based recommendation with Recurrent Neural Networks, Hidasi et al, ICLR (2016)
  73. 73. Adding Item metadata to GRU4Rec: Parallel RNN ● Separate RNNs for each input type ○ Item ID ○ Image feature vector obtained from CNN (last avg. pooling layer)
  74. 74. Results (internal Netflix dataset)
  75. 75. Notes on Context Modeling
  76. 76. Modeling Context in Traditional Models ● Often hard, challenging ● Mathematical complexities ● Data sparsity aggravates convergence and generalization ● MF/AMF does not even offer a principled way to encode context.
  77. 77. Catalogue Censoring for Country Context in NN ● Create a censored mask with out of catalogue videos ● Mask the output layer (logits) ● Use the masked layer for cross entropy loss. ● Save model energy from figuring out the catalogue differences. play (t-n)... play (t-1)country Soft-max over entire vocabulary Feed Forward
  78. 78. Continuous time context ● Continuous serving time value can be used directly. ● No bucketization necessary. ● User features can be enhanced to include time features for the plays. play (t-n)... play (t-1)time Soft-max over entire vocabulary Feed Forward
  79. 79. What about device ? Yet another Random Variable input node Gibbs Sampling derivation
  80. 80. ● MF/AMF does an explicit user-item interaction ● DeepFM, Wide+Deep style models try to embed factorization machines in their architectures ○ They go for explicit feature interactions ● In conditional models (RNN, CNN etc), features are simply concatenated. ○ You would need a pretty deep network of non linear layers to learn explicit interaction ○ The power of explicit User-item interaction
  81. 81. Latent Cross idea ● Do an explicit interaction of context variables with user features. ● Move context variables closer to prediction layer. Latent Cross: Making use of Context in Recurrent Recommender Systems, Beutel et al, WSDM (2018)
  82. 82. NN provides simplicity: V/S
  83. 83. Some Notes on Interpretability
  84. 84. Interpreting a CNN CF Model ● Deeper CNN layers have discovered higher level features in images: ○ Edges ○ Faces etc ● What would a CNN learn if it is trained on user-item interaction dataset? ○ Can it discover semantic topics ?
  85. 85. Interpreting a CNN CF Model HorroR Filter Kids Filter Narcotics Filter
  86. 86. Reinforcement Learning in RecSys
  87. 87. RecSys has a very big small data problem ● Industrial RecSys (Netflix, Youtube, ..) deals with very large action space. ● Many millions (billions) of users and many millions of items. ● Each user, however, interacts with only finite items, making user-item patterns very sparse.
  88. 88. RecSys has a non stationary problem ● The underlying dynamics are constantly changing: ○ New items come in ○ Old items go out ○ Popularities go up and down ○ Members travel from one country to another ○ Users’ taste change with time
  89. 89. RecSys’ main focus ● Address sparsity ● Address temporal dynamics ● .. But mostly from the point of view of maximizing short term rewards. ○ Often myopic ■ Watch minutes for the next play ○ Business often cares about long term user rewards: ■ Satisfaction ⇒ Joy ⇒ Member retaining
  90. 90. RecSys meets Reinforcement Learning (RL) ● Reinforcement Learning is a framework to optimize for long term rewards: ○ Making robots walk ○ Learning a game of Go .. ● Its time for us to marry RecSys with RL if we want to optimize for long term member satisfaction.
  91. 91. Some preliminaries: Markov Decision Process (MDP) Everything we know about the user Our Recs Changing User preferences Long term reward that our user gives us Some starting point / user state Penalizer for achieving the long term reward late
  92. 92. Policy, π = RecSys model in RL setting ● Maximization can be done using our good old friend -- SGD ● We need gradient of the E[] term ● But first: ○ ● If we have many many user trajectories under π i.e. user joins, we recommend, he/she watches, gets satisfaction, we update our recs, users again watches, gets more satisfaction …. ● Then we want that π, which best learns how to update the recs so as to maximize accumulated satisfaction
  93. 93. The log trick reveals: RL = Supervised Learning ● Weighted log likelihood ● Typically, choose weights to be durations ● Other considerations: ○ User’s explicit feedback ○ User’s churn Tok K off policy correction for a REINFORCE recommender system, Chen et al, WSDM (2019)
  94. 94. A tractable framework to do systematic explore ● RL ⇒ supervised learning: allows us to be tractable and yet allow us to do systematic explore and exploit. ● More robust recommender systems. ● Reward and action-state can co-exist ○ Rating = reward ○ Play = action-state ● It’s off policy though ⇒ need some correcting terms. Often a challenge
  95. 95. Multi Task Learning ● Conflicting objectives ○ Biasing towards relevance ○ Biasing towards popularity ○ Biasing towards videos users like ○ Biasing towards videos users play ● Implicit bias in the data ○ Position bias ● 2 objectives: ○ Engagement ○ Satisfaction Recommending what to watch next: a multitask ranking system, Zhao et al, RecSys (2019)
  96. 96. Conclusion and Gratitude
  97. 97. Some concluding remarks ● Ratings are sparse and noisy and uncalibrated ● With scale, higher capacity models DO work ● However, simplicity still pays off when starting up in a new domain ● Interpretation helps to answer: ○ Why did you recommend me THAT ● RL is the new kid on the block. He is cool. Why ? ○ Myopic versys Long term is a thing to worry about
  98. 98. My gratitude ● Sincere Thanks to the entire LARS organizing committee. ● Thanks to everyone who listened to me for 2 hours !! ● Thanks to Alexandros Karatzoglou (Google) and Netflix Colleagues: Dawen Liang, Ehtsham Elahi and Aish Fenton. THANK YOU !