Deep Learning for
Recommender Systems
Yves Raimond & Justin Basilico
January 25, 2017
Re·Work Deep Learning Summit San Francisco
@moustaki @JustinBasilico
The value of recommendations
● A few seconds to find something
great to watch…
● Can only show a few titles
● Enjoyment directly impacts
customer satisfaction
● Generates over $1B per year of
Netflix revenue
● How? Personalize everything
Deep learning for
recommendations: a first try
0 1 0 1 0
0 0 1 1 0
1 0 0 1 1
0 1 0 0 0
0 0 0 0 1
UsersItems
Traditional Recommendation Setup
U≈R
V
A Matrix Factorization view
U
A Feed-Forward Network view
V
U
A (deeper) feed-forward view
V
Mean
squared loss?
A quick & dirty experiment
● MovieLens-20M
○ Binarized ratings
○ Two weeks validation, two weeks test
● Comparing two models
○ ‘Standard’ MF, with hyperparameters:
■ L2 regularization
■ Rank
○ Feed-forward net, with hyperparameters:
■ L2 regularization (for all layers + embeddings)
■ Embeddings dimensionality
■ Number of hidden layers
■ Hidden layer dimensionalities
■ Activations
● After hyperparameter search for both models, what do we get?
What’s going on?
● Very similar models: representation learning
through embeddings, MSE loss, gradient-based
optimization
● Main difference is that we can learn a different
embedding combination than a dot product
● … but embeddings are arbitrary representations
● … and capturing pairwise interactions through a
feed-forward net requires a very large model
Conclusion?
● Not much benefit in the ‘traditional’
recommendation setup of a deep versus a
properly tuned model
● … Is this talk over?
Breaking the ‘traditional’ recsys setup
● Adding extra data / inputs
● Modeling different facets of users and items
● Alternative framings of the problem
Alternative data
Content-based side information
● VBPR: helping cold-start by augmenting item
factors with visual factors from CNNs [He et. al.,
2015]
● Content2Vec [Nedelec et. al., 2017]
● Learning to approximate MF item embeddings
from content [Dieleman, 2014]
Metadata-based side information
● Factorization Machines [Rendle, 2010] with
side-information
○ Extending the factorization framework to an arbitrary
number of inputs
● Meta-Prod2Vec [Vasile et. al., 2016]
○ Regularize item embeddings using side-information
● DCF [Li et al., 2016]
● Using associated textual information for
recommendations [Bansal et. al., 2016]
YouTube Recommendations
● Two stage ranker:
candidate generation
(shrinking set of items to
rank) and ranking
(classifying actual
impressions)
● Two feed-forward, fully
connected, networks with
hundreds of features
[Covington et. al., 2016]
Alternative models
Restricted Boltzmann Machines
● RBMs for Collaborative
Filtering [Salakhutdinov,
Minh & Hinton, 2007]
● Part of the ensemble that
won the $1M Netflix Prize
● Used in our rating
prediction system for
several years
Auto-encoders
● RBMs are hard to train
● CF-NADE [Zheng et al., 2016]
○ Define (random) orderings over conditionals
and model with a neural network
● Denoising auto-encoders: CDL [Wang et
al., 2015], CDAE [Wu et al., 2016]
● Variational auto-encoders [Liang et al.,
2017]
(*)2Vec
● Prod2Vec [Grbovic et al., 2015],
Item2Vec [Barkan & Koenigstein,
2016], Pin2Vec [Ma, 2017]
● Item-item co-occurrence
factorization (instead of user-item
factorization)
● The two approaches can be
blended [Liang et al., 2016]
prod2vec
(Skip-gram)
user2vec
(Continuous Bag of Words)
Wide + Deep models
● Wide model:
memorize
sparse, specific
rules
● Deep model:
generalize to
similar items via
embeddings
[Cheng et. al., 2016]
Deep Wide
(many parameters due
to cross product)
Alternative framings
Sequence prediction
● Treat recommendations as a
sequence classification problem
○ Input: sequence of user actions
○ Output: next action
● E.g. Gru4Rec [Hidasi et. al., 2016]
○ Input: sequence of items in a
sessions
○ Output: next item in the session
● Also co-evolution: [Wu et al., 2017],
[Dai et al., 2017]
Contextual sequence prediction
● Input: sequence of contextual user actions, plus
current context
● Output: probability of next action
● E.g. “Given all the actions a user has taken so far,
what’s the most likely video they’re going to play right
now?”
● e.g. [Smirnova & Vasile, 2017], [Beutel et. al., 2018]
Contextual sequence data
2017-12-10 15:40:22
2017-12-23 19:32:10
2017-12-24 12:05:53
2017-12-27 22:40:22
2017-12-29 19:39:36
2017-12-30 20:42:13
Context ActionSequence
per user
?
Time
Time-sensitive sequence prediction
● Recommendations are actions at a moment in time
○ Proper modeling of time and system dynamics is critical
● Experiment on a Netflix internal dataset
○ Context:
■ Discrete time
● Day-of-week: Sunday, Monday, …
● Hour-of-day
■ Continuous time (Timestamp)
○ Predict next play (temporal split data)
Results
Other framings
● Causality in recommendations
○ Explicitly modeling the consequence of a recommender systems’ intervention
[Schnabel et al., 2016]
● Recommendation as question answering
○ E.g. “I loved Billy Madison, My Neighbor Totoro, Blades of Glory, Bio-Dome,
Clue, and Happy Gilmore. I’m looking for a Music movie.” [Dodge et al., 2016]
● Deep Reinforcement Learning for
recommendations [Zhao et al, 2017]
Conclusion
Takeaways
● Deep Learning can work well for Recommendations... when you go
beyond the classic problem definition
● Similarities between DL and MF are a good thing: Lots of MF work
can be translated to DL
● Lots of open areas to improve recommendations using deep
learning
● Think beyond solving existing problems with new tools and instead
what new problems they can solve
More Resources
● RecSys 2017 tutorial by Karatzoglou and Hidasi
● RecSys Summer School slides by Hidasi
● DLRS Workshop 2016, 2017
● Recommenders Shallow/Deep by Sudeep Das
● Survey paper by Zhang, Yao & Sun
● GitHub repo of papers by Nandi
Thank you.
@moustaki @JustinBasilico
Yves Raimond & Justin Basilico
Yes, we’re hiring...

Deep Learning for Recommender Systems

  • 1.
    Deep Learning for RecommenderSystems Yves Raimond & Justin Basilico January 25, 2017 Re·Work Deep Learning Summit San Francisco @moustaki @JustinBasilico
  • 2.
    The value ofrecommendations ● A few seconds to find something great to watch… ● Can only show a few titles ● Enjoyment directly impacts customer satisfaction ● Generates over $1B per year of Netflix revenue ● How? Personalize everything
  • 3.
  • 4.
    0 1 01 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 UsersItems Traditional Recommendation Setup
  • 5.
  • 6.
  • 7.
    U A (deeper) feed-forwardview V Mean squared loss?
  • 8.
    A quick &dirty experiment ● MovieLens-20M ○ Binarized ratings ○ Two weeks validation, two weeks test ● Comparing two models ○ ‘Standard’ MF, with hyperparameters: ■ L2 regularization ■ Rank ○ Feed-forward net, with hyperparameters: ■ L2 regularization (for all layers + embeddings) ■ Embeddings dimensionality ■ Number of hidden layers ■ Hidden layer dimensionalities ■ Activations ● After hyperparameter search for both models, what do we get?
  • 11.
    What’s going on? ●Very similar models: representation learning through embeddings, MSE loss, gradient-based optimization ● Main difference is that we can learn a different embedding combination than a dot product ● … but embeddings are arbitrary representations ● … and capturing pairwise interactions through a feed-forward net requires a very large model
  • 12.
    Conclusion? ● Not muchbenefit in the ‘traditional’ recommendation setup of a deep versus a properly tuned model ● … Is this talk over?
  • 13.
    Breaking the ‘traditional’recsys setup ● Adding extra data / inputs ● Modeling different facets of users and items ● Alternative framings of the problem
  • 14.
  • 15.
    Content-based side information ●VBPR: helping cold-start by augmenting item factors with visual factors from CNNs [He et. al., 2015] ● Content2Vec [Nedelec et. al., 2017] ● Learning to approximate MF item embeddings from content [Dieleman, 2014]
  • 16.
    Metadata-based side information ●Factorization Machines [Rendle, 2010] with side-information ○ Extending the factorization framework to an arbitrary number of inputs ● Meta-Prod2Vec [Vasile et. al., 2016] ○ Regularize item embeddings using side-information ● DCF [Li et al., 2016] ● Using associated textual information for recommendations [Bansal et. al., 2016]
  • 17.
    YouTube Recommendations ● Twostage ranker: candidate generation (shrinking set of items to rank) and ranking (classifying actual impressions) ● Two feed-forward, fully connected, networks with hundreds of features [Covington et. al., 2016]
  • 18.
  • 19.
    Restricted Boltzmann Machines ●RBMs for Collaborative Filtering [Salakhutdinov, Minh & Hinton, 2007] ● Part of the ensemble that won the $1M Netflix Prize ● Used in our rating prediction system for several years
  • 20.
    Auto-encoders ● RBMs arehard to train ● CF-NADE [Zheng et al., 2016] ○ Define (random) orderings over conditionals and model with a neural network ● Denoising auto-encoders: CDL [Wang et al., 2015], CDAE [Wu et al., 2016] ● Variational auto-encoders [Liang et al., 2017]
  • 21.
    (*)2Vec ● Prod2Vec [Grbovicet al., 2015], Item2Vec [Barkan & Koenigstein, 2016], Pin2Vec [Ma, 2017] ● Item-item co-occurrence factorization (instead of user-item factorization) ● The two approaches can be blended [Liang et al., 2016] prod2vec (Skip-gram) user2vec (Continuous Bag of Words)
  • 22.
    Wide + Deepmodels ● Wide model: memorize sparse, specific rules ● Deep model: generalize to similar items via embeddings [Cheng et. al., 2016] Deep Wide (many parameters due to cross product)
  • 23.
  • 24.
    Sequence prediction ● Treatrecommendations as a sequence classification problem ○ Input: sequence of user actions ○ Output: next action ● E.g. Gru4Rec [Hidasi et. al., 2016] ○ Input: sequence of items in a sessions ○ Output: next item in the session ● Also co-evolution: [Wu et al., 2017], [Dai et al., 2017]
  • 25.
    Contextual sequence prediction ●Input: sequence of contextual user actions, plus current context ● Output: probability of next action ● E.g. “Given all the actions a user has taken so far, what’s the most likely video they’re going to play right now?” ● e.g. [Smirnova & Vasile, 2017], [Beutel et. al., 2018]
  • 26.
    Contextual sequence data 2017-12-1015:40:22 2017-12-23 19:32:10 2017-12-24 12:05:53 2017-12-27 22:40:22 2017-12-29 19:39:36 2017-12-30 20:42:13 Context ActionSequence per user ? Time
  • 27.
    Time-sensitive sequence prediction ●Recommendations are actions at a moment in time ○ Proper modeling of time and system dynamics is critical ● Experiment on a Netflix internal dataset ○ Context: ■ Discrete time ● Day-of-week: Sunday, Monday, … ● Hour-of-day ■ Continuous time (Timestamp) ○ Predict next play (temporal split data)
  • 28.
  • 29.
    Other framings ● Causalityin recommendations ○ Explicitly modeling the consequence of a recommender systems’ intervention [Schnabel et al., 2016] ● Recommendation as question answering ○ E.g. “I loved Billy Madison, My Neighbor Totoro, Blades of Glory, Bio-Dome, Clue, and Happy Gilmore. I’m looking for a Music movie.” [Dodge et al., 2016] ● Deep Reinforcement Learning for recommendations [Zhao et al, 2017]
  • 30.
  • 31.
    Takeaways ● Deep Learningcan work well for Recommendations... when you go beyond the classic problem definition ● Similarities between DL and MF are a good thing: Lots of MF work can be translated to DL ● Lots of open areas to improve recommendations using deep learning ● Think beyond solving existing problems with new tools and instead what new problems they can solve
  • 32.
    More Resources ● RecSys2017 tutorial by Karatzoglou and Hidasi ● RecSys Summer School slides by Hidasi ● DLRS Workshop 2016, 2017 ● Recommenders Shallow/Deep by Sudeep Das ● Survey paper by Zhang, Yao & Sun ● GitHub repo of papers by Nandi
  • 33.
    Thank you. @moustaki @JustinBasilico YvesRaimond & Justin Basilico Yes, we’re hiring...