Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Tutorial on Deep Learning in
Recommender Systems
Anoop Deoras
ACM - LARS, Fortaleza
10/10/2019
@adeoras
● Models from Linear Family:
■ Matrix Factorization, Asymmetric Matrix Factorization, SLIM and Topic Models, ..
● Models f...
~150M Members, 190 Countries
● Recommendation Systems are means to an end.
● Our primary goal:
○ Maximize Netflix member’s enjoyment of the selected sho...
Everything is a recommendation!
Ordering of the titles in each row is personalized
Selection and placement of the row types is personalized
Profile 1
Profile 2
Personalized
Images
Personalized
Messages
Impracticality
of Showing
everything
We Personalize our recommendation!
This Talk Answers: HOW ?
● 1999-2005: Netflix Prize:
○ >10% improvement, win $1,000,000
● Top performing model(s) ended up be a variation of Matrix
...
Models from Linear
Family
Matrix of Observed Ratings
1.0 2.0
3.0
4.0 5.0 1.0
2.0
Users
Videos
Observed Ratings
● User-Item rating matrix
● Ratings a...
Matrix Completion
1.0 ? ? 2.0 ?
3.0 ? ?
?
4.0 ? 5.0 1.0
? 2.0
Users
Videos
Observed Ratings
● The problem is that of
compl...
Factorizating the Matrix
● The problem is that of
completing the matrix.
● Helps us generalize and
recommend titles user h...
Alt Least Squares / Grad Desc...
● Minimize the Frobenius norm.
● Ratings are extremely hard to get.
○ Signal becomes sparser
○ Netflix data was 0.1% dense !!
● They are not calibrated.
○...
From Explicit to Implicit
1 1
1
1 1 1
1
Users
Videos
Observed Preference
● User-Item preference matrix
● Values are implic...
Alt Least Squares / Grad Desc...
Scoring
Q
P
Videos
Users
Linear Factor
Interaction
Item factor
for video-j
User factor for
user-i
● We cannot possibly store user latent vector because:
○ Potentially too many users and we train on a subset
○ New users c...
Alternating Least Squares -- “Fold-In”
User
Videos
Observed Plays
1 1
● User-Item preference vector
● Values are implicit ...
Asymmetric Matrix Factorization
● Similar setup to MF.
● Start with a sparse matrix with
implicit/explicit feedback
● Simi...
Asymmetric Matrix Factorization (AMF)
(MF)
(AMF)
N(U) all the
videos U played
Video embedding
over user history
AMF viewed as successive matrix ops
(MF)
(AMF)
● Similar setup to MF.
● Start with a sparse matrix with
implicit/explicit ...
AMF as a linear-Neural Network
● Look up plays/interactions the user had with items in the catalog of
interest.
○ Average the latent representations of i...
SLIM (Sparse Linear Method)
≈R I(R)
Diagonal
replaced with
zeros
Y
items
items
0
(AMF)
(SLIM)
SLIM: Sparse Linear Methods ...
Basic Intuition behind Soft Clustering Models
● Imagine you walked into a room full of movie enthusiasts, from all over
th...
Basic Intuition behind Soft Clustering Models
● Now consider forming groups of people with similar taste based on the
vide...
Basic Intuition behind Soft Clustering Models
● Describe yourself using what you have watched.
● Try to associate yourself...
User’s distribution over the topics
0.15 0.630.22
Topic’s internal distribution over videos
0.05
0.09
0.10
Topic Models (Latent Dirichlet Alloc)
K
U
P
α θ φt v
β
Total
Topics
Taste
Convex Combinations of
topics proportions and mo...
Topic Models (LDA): Scoring
Q
P
Videos
Users Topic
Conditional
distribution
for video-j
Distribution
over topics for
user-i
LDA a special case of MF
MF
LDA
● LDA as a special case of MF
○ Latent factors are prob distributions
● AMF is equivalent to 1 hidden layer linear feedfor...
● Netflix/Youtube/.. use case:
○ Want to model country, time of day, day of week, device, ..
● Country as the context, some...
Country as the context in LDA models
Country A catalog Country B catalog
Users in Country A play both Friends and HIMYM Us...
Catalogue Censoring in Topic Models
K
U
P
α θ φt v
β
Total
Topics
Taste
c
Censoring
pattern
m
Global Recommendation System...
● Censored multinomials cannot be formed as elegantly as in LDA.
● We can sample negatives only from the country’s catalog...
Time context in Topic Models
K
U
P
α θ φk v
β
Total
Topics
Taste
t
Observed
time
µ
Topics over Time: A Non Markov Continuo...
SIMPLE !
Fully contextualizing Topic Models
K
U
P
α θ φk v
β
Total
Topics
Taste
t
Observed
time
µ
c
Censoring
pattern
m
What about device ?
Yet another Random Variable
Gibbs Sampling derivation
Never mind
● Let us make LDA non linear -- aka Variational Autoencoder
● Let us make AMF non linear -- aka Auto Encoder
● Let us go b...
Models from Non-Linear
Family
● Better generalization beyond linear models for user-item interactions.
● Unified representation of heterogeneous signals ...
● End to End differentiable
○ Helpful in the RL setting for instance
● Provide suitable inductive biases catered to the inp...
● While we may have millions (even billions) of users and millions of
items, RecSys, however, is a small data problem
○ Ea...
● Lack of interpretability
○ Although we have had some breakthrough at Netflix
● Needs Big Data
○ DL models, being very hig...
● AutoEncoder like
● Matrix Factorization like
● Conditional Models (Language model) like
● Hybrids
NN models categorized ...
AutoEncoder Family
Feed
Forward
k-hot
play
(t-n)...
play
(t-1)
reconstruction
● General idea: reconstruct
input.
● Reconst...
Earliest adaptation: Restricted Boltzmann Machines
From recent presentation by
Alexandros Karatzoglou
One hidden layer.
Us...
Collaborative
Denoising
Auto-Encoder
Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu et.al., WSDM ...
From AMF to a Neural Network
Non
linearities
(AMF)
(NN)
Variational Autoencoders
zu
u
Taste
fθ
𝞵 𝞼
u
Encoder
Decoder
fѰ
fѰ
DNN
Soft-max over entire vocabulary
Variational Autoenc...
Non Linear Factorization Models
RNN/Ffwd ..
Feed
Forward
User,Cntxt
play
(t-n)...
play
(t-1)cntxt
item
metadata
Item
Prob ...
Neural Network
Matrix
Factorization
● Treats Matrix Factorization
from non linearity
perspective.
●
Neural Network Matrix ...
Deep
Factorization
Machines
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction, Guo et.al. IJCAL 2017...
Wide + Deep Models for Recommendations
In a recommender setting, you may want to train with a wide set of
cross-product fe...
Wide + Deep Models for Recommendations (W+D)
On the other hand, you may want the ability to generalize using the
represent...
Wide + Deep = Memorization + Generalization
Best of both worlds:
Jointly train a deep + wide
network. The cross-feature
tr...
Wide + Deep Models in Google Product
Wide + Deep Model
for app
recommendations.
● All the models we saw till now were Generative Models
● They describe the data, the observation
● Often we care about kn...
Neural Multi Class Models
play (t-n)
...
play (t-1)
cntxt
Soft-max over entire vocabulary
play
(t-n)...
play
(t-1)cntxt
So...
The Youtube Recommendation model
A two Stage Approach with two deep networks:
● The candidate generation network takes eve...
The Youtube Recommendation model
Deep candidate generation model architecture
● embedded sparse features concatenated with...
The Youtube Recommendation model
Stage Two
Deep ranking network
architecture
● uses embedded categorical
features (both un...
Neural Multi Class Sequential Models
play
(t-1)
cntxt
Soft-max over entire vocabulary
state
(t-1)
RNN Family
play
(t-2)
.....
Session-based recommendation with Recurrent
Neural Networks (GRU4Rec)
● Treat each user session as
sequence of clicks
● Pr...
Adding Item metadata to GRU4Rec: Parallel RNN
● Separate RNNs for each input
type
○ Item ID
○ Image feature vector
obtaine...
Results (internal Netflix dataset)
Notes on Context
Modeling
Modeling Context in Traditional Models
● Often hard, challenging
● Mathematical complexities
● Data sparsity aggravates
co...
Catalogue Censoring for Country Context in NN
● Create a censored mask with
out of catalogue videos
● Mask the output laye...
Continuous time context
● Continuous serving time value
can be used directly.
● No bucketization necessary.
● User feature...
What about device ?
Yet another Random Variable input node
Gibbs Sampling derivation
● MF/AMF does an explicit user-item interaction
● DeepFM, Wide+Deep style models try to embed factorization machines
in th...
Latent Cross idea
● Do an explicit interaction of
context variables with user
features.
● Move context variables closer
to...
NN provides simplicity:
V/S
Some Notes on
Interpretability
Interpreting a CNN CF Model
● Deeper CNN layers have discovered higher level features in images:
○ Edges
○ Faces etc
● Wha...
Interpreting a CNN CF Model
HorroR Filter
Kids Filter
Narcotics Filter
Reinforcement Learning
in RecSys
RecSys has a very big small data problem
● Industrial RecSys (Netflix, Youtube, ..) deals with very large action
space.
● M...
RecSys has a non stationary problem
● The underlying dynamics are constantly changing:
○ New items come in
○ Old items go ...
RecSys’ main focus
● Address sparsity
● Address temporal dynamics
● .. But mostly from the point of view of maximizing sho...
RecSys meets Reinforcement Learning (RL)
● Reinforcement Learning is a framework to optimize for long term
rewards:
○ Maki...
Some preliminaries: Markov Decision Process (MDP)
Everything
we know
about the
user
Our Recs
Changing
User
preferences
Lon...
Policy, π = RecSys model in RL setting
● Maximization can be done
using our good old friend --
SGD
● We need gradient of t...
The log trick reveals: RL = Supervised Learning
● Weighted log likelihood
● Typically, choose weights to be
durations
● Ot...
A tractable framework to do systematic explore
● RL ⇒ supervised learning: allows us to be tractable and yet allow us to
d...
Multi Task Learning
● Conflicting objectives
○ Biasing towards relevance
○ Biasing towards popularity
○ Biasing towards vid...
Conclusion and Gratitude
Some concluding remarks
● Ratings are sparse and noisy and uncalibrated
● With scale, higher capacity models DO work
● How...
My gratitude
● Sincere Thanks to the entire LARS organizing committee.
● Thanks to everyone who listened to me for 2 hours...
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

6

Share

Download to read offline

Tutorial on Deep Learning in Recommender System, Lars summer school 2019

Download to read offline

I had a fun time giving tutorial on the topic of deep learning in recommender systems at Latin America School on Recommender Systems (LARS) in Fortaleza, Brazil.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Tutorial on Deep Learning in Recommender System, Lars summer school 2019

  1. 1. Tutorial on Deep Learning in Recommender Systems Anoop Deoras ACM - LARS, Fortaleza 10/10/2019 @adeoras
  2. 2. ● Models from Linear Family: ■ Matrix Factorization, Asymmetric Matrix Factorization, SLIM and Topic Models, .. ● Models from Non-Linear Family: ■ Variational Autoencoders, Sequence and Convolutional models, .. ● Modeling the Context ● Interpreting the inner workings of a Neural Network Recommender Model ● Reinforcement Learning in RecSys Outline of the tutorial
  3. 3. ~150M Members, 190 Countries
  4. 4. ● Recommendation Systems are means to an end. ● Our primary goal: ○ Maximize Netflix member’s enjoyment of the selected show ■ Enjoyment integrated over time ○ Minimize the time it takes to find them ■ Interaction cost integrated over time Personalization ● Personalization
  5. 5. Everything is a recommendation!
  6. 6. Ordering of the titles in each row is personalized
  7. 7. Selection and placement of the row types is personalized
  8. 8. Profile 1 Profile 2 Personalized Images
  9. 9. Personalized Messages
  10. 10. Impracticality of Showing everything
  11. 11. We Personalize our recommendation! This Talk Answers: HOW ?
  12. 12. ● 1999-2005: Netflix Prize: ○ >10% improvement, win $1,000,000 ● Top performing model(s) ended up be a variation of Matrix Factorization [SVD++, Koren, et al] ● Although Netflix’s rec system has moved on, MF is still the foundational method on which most collaborative filtering systems are based today. ● Next, let us discuss model evolution and their applicability. Background
  13. 13. Models from Linear Family
  14. 14. Matrix of Observed Ratings 1.0 2.0 3.0 4.0 5.0 1.0 2.0 Users Videos Observed Ratings ● User-Item rating matrix ● Ratings are explicit feedback ● Very large but sparse
  15. 15. Matrix Completion 1.0 ? ? 2.0 ? 3.0 ? ? ? 4.0 ? 5.0 1.0 ? 2.0 Users Videos Observed Ratings ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not rated.
  16. 16. Factorizating the Matrix ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not rated.
  17. 17. Alt Least Squares / Grad Desc... ● Minimize the Frobenius norm.
  18. 18. ● Ratings are extremely hard to get. ○ Signal becomes sparser ○ Netflix data was 0.1% dense !! ● They are not calibrated. ○ Some users will never give a 5 ○ Some users will always give a 5 Problems with Explicit Feedback
  19. 19. From Explicit to Implicit 1 1 1 1 1 1 1 Users Videos Observed Preference ● User-Item preference matrix ● Values are implicit feedback ● Very large but sparse. ● The problem is that of completing the matrix. ● Helps us generalize and recommend titles user has not interacted yet. ● NOTE: you will negatives (0s)
  20. 20. Alt Least Squares / Grad Desc...
  21. 21. Scoring Q P Videos Users Linear Factor Interaction Item factor for video-j User factor for user-i
  22. 22. ● We cannot possibly store user latent vector because: ○ Potentially too many users and we train on a subset ○ New users come to the service ● At runtime, what do we know ? ○ Item latent representations and user’s ratings/interactions (eg plays) How does one obtain User factors ?
  23. 23. Alternating Least Squares -- “Fold-In” User Videos Observed Plays 1 1 ● User-Item preference vector ● Values are implicit feedback ● Just run least square optimization during serving
  24. 24. Asymmetric Matrix Factorization ● Similar setup to MF. ● Start with a sparse matrix with implicit/explicit feedback ● Similar goal, approximate R with product of two latent matrices
  25. 25. Asymmetric Matrix Factorization (AMF) (MF) (AMF) N(U) all the videos U played Video embedding over user history
  26. 26. AMF viewed as successive matrix ops (MF) (AMF) ● Similar setup to MF. ● Start with a sparse matrix with implicit/explicit feedback ● Similar goal, approximate R with product of two latent matrices Indicator Function Item Embedding Item Embedding
  27. 27. AMF as a linear-Neural Network
  28. 28. ● Look up plays/interactions the user had with items in the catalog of interest. ○ Average the latent representations of items ○ That’s it. ● Frees you up from the tedious (alternating) least squares optimization during inference / serving. No Least Square solver necessary
  29. 29. SLIM (Sparse Linear Method) ≈R I(R) Diagonal replaced with zeros Y items items 0 (AMF) (SLIM) SLIM: Sparse Linear Methods for top N Recommendation, Ning et.al., ICDM 2011
  30. 30. Basic Intuition behind Soft Clustering Models ● Imagine you walked into a room full of movie enthusiasts, from all over the world, from all walks of life, and your goal was to come out with a great movie recommendation. ● Would you obtain popular vote ? Would that satisfy you ?
  31. 31. Basic Intuition behind Soft Clustering Models ● Now consider forming groups of people with similar taste based on the videos that they previously enjoyed.
  32. 32. Basic Intuition behind Soft Clustering Models ● Describe yourself using what you have watched. ● Try to associate yourself with these groups and obtain a weighted “personalized” popularity vote.
  33. 33. User’s distribution over the topics 0.15 0.630.22
  34. 34. Topic’s internal distribution over videos 0.05 0.09 0.10
  35. 35. Topic Models (Latent Dirichlet Alloc) K U P α θ φt v β Total Topics Taste Convex Combinations of topics proportions and movie proportions within topic
  36. 36. Topic Models (LDA): Scoring Q P Videos Users Topic Conditional distribution for video-j Distribution over topics for user-i
  37. 37. LDA a special case of MF MF LDA
  38. 38. ● LDA as a special case of MF ○ Latent factors are prob distributions ● AMF is equivalent to 1 hidden layer linear feedforward ● AMF as a special case of MF ● SLIM as a special case of AMF Relating all the models
  39. 39. ● Netflix/Youtube/.. use case: ○ Want to model country, time of day, day of week, device, .. ● Country as the context, some challenges: ○ Each country offers a different catalog. How do we model it ? ● Time of day, day of week as the context, some challenges: ○ Discrete or continuous variables ? Contextualizing these models
  40. 40. Country as the context in LDA models Country A catalog Country B catalog Users in Country A play both Friends and HIMYM Users in Country B cannot play both Friends and HIMYM Model is forced to split HIMYM plays. topic k : Outcome: Parameters are being consumed to explain catalog differences. topic j: Topic with high mass on HIMYM and Friends Topic with high mass on HIMYM
  41. 41. Catalogue Censoring in Topic Models K U P α θ φt v β Total Topics Taste c Censoring pattern m Global Recommendation System for Overlapping Media Catalogue, Todd et.al., US Patent App
  42. 42. ● Censored multinomials cannot be formed as elegantly as in LDA. ● We can sample negatives only from the country’s catalog. ○ Why waste model’s energy in demoting titles that user will never see anyways. Censoring in MF/AMF ?
  43. 43. Time context in Topic Models K U P α θ φk v β Total Topics Taste t Observed time µ Topics over Time: A Non Markov Continuous-Time Model fo Topic Trends. , Wang et.al., KDD 2006
  44. 44. SIMPLE !
  45. 45. Fully contextualizing Topic Models K U P α θ φk v β Total Topics Taste t Observed time µ c Censoring pattern m
  46. 46. What about device ? Yet another Random Variable Gibbs Sampling derivation Never mind
  47. 47. ● Let us make LDA non linear -- aka Variational Autoencoder ● Let us make AMF non linear -- aka Auto Encoder ● Let us go beyond the world of generative models. ○ How about conditional models ? ■ Ffwd, LSTMs, CNN … ● Lets make these models context ready. ● Lets talk about Reinforcement Learning in RecSys Lets enter the world of non linearity
  48. 48. Models from Non-Linear Family
  49. 49. ● Better generalization beyond linear models for user-item interactions. ● Unified representation of heterogeneous signals (e.g. add image/audio/textual content as side information to item embeddings via convolutional NNs). ● Exploitation of sequential information in actions leading up to recommendation (e.g. LSTM on viewing/purchase/search history to predict what will be watched/purchased/searched next). ● DL toolkits provide unprecedented flexibility in experimenting with loss functions (e.g. in toolkits like TensorFlow/pyTorch/Keras etc. switching the loss from classification loss to ranking loss is trivial. The optimization is taken care of.) Why use DL in RecSys ?
  50. 50. ● End to End differentiable ○ Helpful in the RL setting for instance ● Provide suitable inductive biases catered to the input data ● NNs are composite → gigantic end to end differential NN ○ Toolkits such as TFlow make it simple to implement ● Indispensable for multi modal data, such as text and images ○ News recommendation for instance The most attractive properties of NNs
  51. 51. ● While we may have millions (even billions) of users and millions of items, RecSys, however, is a small data problem ○ Each user interacts with only finite number of items ● Neural networks project discrete tokens into continuous space naturally ○ Collaborative filtering in continuous space RecSys has a BIG small data problem
  52. 52. ● Lack of interpretability ○ Although we have had some breakthrough at Netflix ● Needs Big Data ○ DL models, being very high capacity models, flourish under large training datasets. ○ Often poor performance is reported on small setups. ● HyperParam tuning ○ Hyper params unique to problem setting. Drawbacks
  53. 53. ● AutoEncoder like ● Matrix Factorization like ● Conditional Models (Language model) like ● Hybrids NN models categorized into 4 main categories
  54. 54. AutoEncoder Family Feed Forward k-hot play (t-n)... play (t-1) reconstruction ● General idea: reconstruct input. ● Reconstruction helps generalize to unseen items.
  55. 55. Earliest adaptation: Restricted Boltzmann Machines From recent presentation by Alexandros Karatzoglou One hidden layer. User feedback on items interacted with, are propagated back to all items. Very similar to an autoencoder!
  56. 56. Collaborative Denoising Auto-Encoder Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu et.al., WSDM 2016 ● Treats the feedback on items y that the user U has interacted with (input layer) as a noisy version of the user’s preferences on all items (output layer) ● Introduces a user specific input node and hidden bias node, while the item weights are shared across all users.
  57. 57. From AMF to a Neural Network Non linearities (AMF) (NN)
  58. 58. Variational Autoencoders zu u Taste fθ 𝞵 𝞼 u Encoder Decoder fѰ fѰ DNN Soft-max over entire vocabulary Variational Autoencoders for Collaborative Filtering, Liang et al.
  59. 59. Non Linear Factorization Models RNN/Ffwd .. Feed Forward User,Cntxt play (t-n)... play (t-1)cntxt item metadata Item Prob of play ...
  60. 60. Neural Network Matrix Factorization ● Treats Matrix Factorization from non linearity perspective. ● Neural Network Matrix Factorization, Dziugaite et.al., arxiv 2015
  61. 61. Deep Factorization Machines DeepFM: A Factorization-Machine based Neural Network for CTR Prediction, Guo et.al. IJCAL 2017 ● Treats Factorization Machine from non linearity perspective. ● Higher order feature interactions -- Deep NN ● Lower order feature interactions -- Factorization Machine
  62. 62. Wide + Deep Models for Recommendations In a recommender setting, you may want to train with a wide set of cross-product feature transformations , so that the model essentially memorizes these sparse feature combinations (rules): Meh! Yay! Wide and Deep Learning for Recommender Systems, Cheng et al, RecSys (2016)
  63. 63. Wide + Deep Models for Recommendations (W+D) On the other hand, you may want the ability to generalize using the representational power of a deep network. But deep nets can over-generalize.
  64. 64. Wide + Deep = Memorization + Generalization Best of both worlds: Jointly train a deep + wide network. The cross-feature transformation in the wide model component can memorize all those sparse, specific rules, while the deep model component can generalize to similar items via embeddings.
  65. 65. Wide + Deep Models in Google Product Wide + Deep Model for app recommendations.
  66. 66. ● All the models we saw till now were Generative Models ● They describe the data, the observation ● Often we care about knowing what is the next play a user will watch or the next thing our user will buy ● How about conditional models ? ○ Model probability of next play directly ○ Prob (play | everything we know about the user at time t) ● Borrow from Language Modeling community ○ Prob (next word | all the words before) Generative versus Conditional Models
  67. 67. Neural Multi Class Models play (t-n) ... play (t-1) cntxt Soft-max over entire vocabulary play (t-n)... play (t-1)cntxt Soft-max over entire vocabulary N-GRAM BoW-n Feed Forward User,Cntxt P(next-video | <user, cntxt>)
  68. 68. The Youtube Recommendation model A two Stage Approach with two deep networks: ● The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. ● The ranking network scores each video according to a desired objective function using a rich set of features describing the video and user. The highest scoring videos are presented to the user, ranked by their score Deep Neural Networks for Youtube Recommendations, Covington et al, RecSys (2016)
  69. 69. The Youtube Recommendation model Deep candidate generation model architecture ● embedded sparse features concatenated with dense features. Embeddings are averaged before concatenation to transform variable sized bags of sparse IDs into fixed-width vectors suitable for input to the hidden layers. ● All hidden layers are fully connected. ● In training, a cross-entropy loss is minimized with gradient descent on the output of the sampled softmax. ● At serving, an approximate nearest neighbor lookup is performed to generate hundreds of candidate video recommendations. Stage One
  70. 70. The Youtube Recommendation model Stage Two Deep ranking network architecture ● uses embedded categorical features (both univalent and multivalent) with shared embeddings and powers of normalized continuous features. ● All layers are fully connected. In practice, hundreds of features are fed into the network.
  71. 71. Neural Multi Class Sequential Models play (t-1) cntxt Soft-max over entire vocabulary state (t-1) RNN Family play (t-2) ... play (t-1) Soft-max over entire vocabulary cntxt play (t-4)play (t-3) play (t-n)play (t-n+1) CNN Family state (t) Recurrent Convolutn P(next-video | <user, cntxt>)
  72. 72. Session-based recommendation with Recurrent Neural Networks (GRU4Rec) ● Treat each user session as sequence of clicks ● Predict next item in the session sequence Session-based recommendation with Recurrent Neural Networks, Hidasi et al, ICLR (2016)
  73. 73. Adding Item metadata to GRU4Rec: Parallel RNN ● Separate RNNs for each input type ○ Item ID ○ Image feature vector obtained from CNN (last avg. pooling layer)
  74. 74. Results (internal Netflix dataset)
  75. 75. Notes on Context Modeling
  76. 76. Modeling Context in Traditional Models ● Often hard, challenging ● Mathematical complexities ● Data sparsity aggravates convergence and generalization ● MF/AMF does not even offer a principled way to encode context.
  77. 77. Catalogue Censoring for Country Context in NN ● Create a censored mask with out of catalogue videos ● Mask the output layer (logits) ● Use the masked layer for cross entropy loss. ● Save model energy from figuring out the catalogue differences. play (t-n)... play (t-1)country Soft-max over entire vocabulary Feed Forward
  78. 78. Continuous time context ● Continuous serving time value can be used directly. ● No bucketization necessary. ● User features can be enhanced to include time features for the plays. play (t-n)... play (t-1)time Soft-max over entire vocabulary Feed Forward
  79. 79. What about device ? Yet another Random Variable input node Gibbs Sampling derivation
  80. 80. ● MF/AMF does an explicit user-item interaction ● DeepFM, Wide+Deep style models try to embed factorization machines in their architectures ○ They go for explicit feature interactions ● In conditional models (RNN, CNN etc), features are simply concatenated. ○ You would need a pretty deep network of non linear layers to learn explicit interaction ○ The power of explicit User-item interaction
  81. 81. Latent Cross idea ● Do an explicit interaction of context variables with user features. ● Move context variables closer to prediction layer. Latent Cross: Making use of Context in Recurrent Recommender Systems, Beutel et al, WSDM (2018)
  82. 82. NN provides simplicity: V/S
  83. 83. Some Notes on Interpretability
  84. 84. Interpreting a CNN CF Model ● Deeper CNN layers have discovered higher level features in images: ○ Edges ○ Faces etc ● What would a CNN learn if it is trained on user-item interaction dataset? ○ Can it discover semantic topics ?
  85. 85. Interpreting a CNN CF Model HorroR Filter Kids Filter Narcotics Filter
  86. 86. Reinforcement Learning in RecSys
  87. 87. RecSys has a very big small data problem ● Industrial RecSys (Netflix, Youtube, ..) deals with very large action space. ● Many millions (billions) of users and many millions of items. ● Each user, however, interacts with only finite items, making user-item patterns very sparse.
  88. 88. RecSys has a non stationary problem ● The underlying dynamics are constantly changing: ○ New items come in ○ Old items go out ○ Popularities go up and down ○ Members travel from one country to another ○ Users’ taste change with time
  89. 89. RecSys’ main focus ● Address sparsity ● Address temporal dynamics ● .. But mostly from the point of view of maximizing short term rewards. ○ Often myopic ■ Watch minutes for the next play ○ Business often cares about long term user rewards: ■ Satisfaction ⇒ Joy ⇒ Member retaining
  90. 90. RecSys meets Reinforcement Learning (RL) ● Reinforcement Learning is a framework to optimize for long term rewards: ○ Making robots walk ○ Learning a game of Go .. ● Its time for us to marry RecSys with RL if we want to optimize for long term member satisfaction.
  91. 91. Some preliminaries: Markov Decision Process (MDP) Everything we know about the user Our Recs Changing User preferences Long term reward that our user gives us Some starting point / user state Penalizer for achieving the long term reward late
  92. 92. Policy, π = RecSys model in RL setting ● Maximization can be done using our good old friend -- SGD ● We need gradient of the E[] term ● But first: ○ ● If we have many many user trajectories under π i.e. user joins, we recommend, he/she watches, gets satisfaction, we update our recs, users again watches, gets more satisfaction …. ● Then we want that π, which best learns how to update the recs so as to maximize accumulated satisfaction
  93. 93. The log trick reveals: RL = Supervised Learning ● Weighted log likelihood ● Typically, choose weights to be durations ● Other considerations: ○ User’s explicit feedback ○ User’s churn Tok K off policy correction for a REINFORCE recommender system, Chen et al, WSDM (2019)
  94. 94. A tractable framework to do systematic explore ● RL ⇒ supervised learning: allows us to be tractable and yet allow us to do systematic explore and exploit. ● More robust recommender systems. ● Reward and action-state can co-exist ○ Rating = reward ○ Play = action-state ● It’s off policy though ⇒ need some correcting terms. Often a challenge
  95. 95. Multi Task Learning ● Conflicting objectives ○ Biasing towards relevance ○ Biasing towards popularity ○ Biasing towards videos users like ○ Biasing towards videos users play ● Implicit bias in the data ○ Position bias ● 2 objectives: ○ Engagement ○ Satisfaction Recommending what to watch next: a multitask ranking system, Zhao et al, RecSys (2019)
  96. 96. Conclusion and Gratitude
  97. 97. Some concluding remarks ● Ratings are sparse and noisy and uncalibrated ● With scale, higher capacity models DO work ● However, simplicity still pays off when starting up in a new domain ● Interpretation helps to answer: ○ Why did you recommend me THAT ● RL is the new kid on the block. He is cool. Why ? ○ Myopic versys Long term is a thing to worry about
  98. 98. My gratitude ● Sincere Thanks to the entire LARS organizing committee. ● Thanks to everyone who listened to me for 2 hours !! ● Thanks to Alexandros Karatzoglou (Google) and Netflix Colleagues: Dawen Liang, Ehtsham Elahi and Aish Fenton. THANK YOU !
  • AlpernerPhD

    Feb. 5, 2020
  • guptasr

    Jan. 7, 2020
  • jcunniet

    Jan. 1, 2020
  • minde

    Dec. 18, 2019
  • vintagejay

    Dec. 18, 2019
  • AhmetZER10

    Nov. 24, 2019

I had a fun time giving tutorial on the topic of deep learning in recommender systems at Latin America School on Recommender Systems (LARS) in Fortaleza, Brazil.

Views

Total views

1,767

On Slideshare

0

From embeds

0

Number of embeds

34

Actions

Downloads

89

Shares

0

Comments

0

Likes

6

×