Recent Trends in
Personalization:
A Netflix Perspective
Justin Basilico
ICML 2019 Adaptive & Multi-Task Learning Workshop
2019-06-15
@JustinBasilico
Why do we personalize?
Help members find content
to watch and enjoy to maximize
member satisfaction and retention
Spark joy
What do we personalize?
Ordering of videos is personalized
From what we recommend
Ranking
Selection and placement of rows is personalized
... to how we construct a pageRows
Personalized images
... to what images to select
... to reaching out to our members
Everything is a recommendation!
Over 80% of what
people watch
comes from our
recommendations
Overview in [Gomez-Uribe & Hunt, 2016]
Isn’t this solved yet?
○ Every person is unique with a variety of interests
○ Help people find what they want when they’re not sure what they want
○ Large datasets but small data per user
… and potentially biased by the output of your system
○ Cold-start problems on all sides
○ Non-stationary, context-dependent, mood-dependent
○ More than just accuracy: Diversity, novelty, freshness, fairness, ...
○ ...
No, personalization is hard!
Some recent trends in approaching these challenges:
1. Deep Learning
2. Causality
3. Bandits & Reinforcement Learning
4. Fairness
5. Experience Personalization
Trending Now
Trend 1: Deep Learning in
Recommendations
What~2012 ~2017
Deep Learning becomes popular in
Machine Learning
Deep Learning becomes popular in
Recommender Systems
What took so long?
Traditional Recommendations
Collaborative Filtering:
Recommend items that
similar users have chosen
0 1 0 1 0
0 0 1 1 0
1 0 0 1 1
0 1 0 0 0
0 0 0 0 1
Users
Items
U≈R
V
A Matrix Factorization view
2
U
A Feed-Forward Network view
V
2
U
A (deeper) feed-forward view
V
Mean
squared loss?
… isn’t always the best
U
V
Mean squared
loss
?
V
… but opens up many possibilities
Softmax
Avg / Stack/
Sequence
DNN / RNN / CNN
Input
interactions
(X)
(X)
p(Y)
2018-12-2319:32:10
2018-12-2412:05:53
2019-01-0215:40:22
Sequence prediction
● Treat recommendations as a
sequence classification problem
○ Input: sequence of user actions
○ Output: next action
● E.g. Gru4Rec [Hidasi et. al., 2016]
○ Input: sequence of items in a sessions
○ Output: next item in the session
● Also co-evolution: [Wu et al.,
2017], [Dai et al., 2017]
Leveraging other data
● Example: YouTube Recommender
[Covington et. al., 2016]
● Two stage ranker: candidate
generation (shrinking set of items
to rank) and ranking (classifying
actual impressions)
● Two feed-forward, fully
connected, networks with
hundreds of features
Contextual sequence data
2017-12-10 15:40:22
2017-12-23 19:32:10
2017-12-24 12:05:53
2017-12-27 22:40:22
2017-12-29 19:39:36
2017-12-30 20:42:13
Context ItemSequence
per user
?
Time
Time-sensitive sequence prediction
● Proper modeling of time and system dynamics is critical
○ Recommendations are actions at a moment in time
● Experiment on a Netflix internal dataset
○ Input: Sequence of past plays and time context
■ Discrete time: Day-of-week (Mon, Tue, …) & Hour-of-day
■ Continuous time (aka timestamp)
○ Label: Predict next play (temporal split data)
Results
Trend 2: Causality
From Correlation to Causation
● Most recommendation
algorithms are correlational
○ Some early recommendation
algorithms literally computed
correlations between users
and items
● Did you watch a movie
because you liked it? Or
because we showed it to
you? Or both? p(Y|X) → p(Y|X, do(R))
(from http://www.tylervigen.com/spurious-correlations)
Feedback loops
Impression bias
inflates plays
Leads to inflated
item popularity
More plays
More
impressions
Oscillations in
distribution of genre
recommendations
Feedback loops can cause biases to be
reinforced by the recommendation system!
[Chaney et al., 2018]: simulations showing that this can reduce the
usefulness of the system
Lots of feedback loops...
Closed Loop
Training
Data
Watches Model
Recs
Closed Loop
Training
Data
Watches Model
Recs
Danger Zone
Closed Loop
Training
Data
Watches Model
Recs
Danger Zone
Search
Training
Data
Watches Model
Recs
Open Loop
Closed Loop
Training
Data
Watches Model
Recs
Danger Zone
Search
Training
Data
Watches Model
Recs
Open Loop
Debiasing Recommendations
● IPS Estimator for MF [Schnabel et al., 2016]
○ Train a debiasing model and reweight the data
● Causal Embeddings [Bonner & Vasile, 2018]
○ Jointly learn debiasing model and task model
○ Regularize the two towards each other
● Doubly-Robust MF [Wang et al., 2019]
Trend 3: Bandits &
Reinforcement Learning in
Recommendations
● Uncertainty around user interests and new items
● Sparse and indirect feedback
● Changing trends
● Break feedback loops
● Want to explore to learn
Why contextual bandits for recommendations?
▶Early news example: [Li et al., 2010]
Bart [McInerney et al., 2018]
● Bandit selecting both items and explanations for
Spotify homepage
● Factorization Machine with epsilon-greedy explore
over personalized candidate set
● Counterfactual risk minimization to train the bandit
Which artwork to show?
Artwork Personalization as
Contextual Bandit
● Environment: Netflix homepage
● Context: Member, device, page, etc.
● Learner: Artwork selector for a show
● Action: Display specific image for show
● Reward: Member has positive engagement
Artwork Selector
▶
Offline Replay Results
● Bandit finds good images
● Personalization is better
● Artwork variety matters
● Personalization wiggles
around best images
Lift in Replay in the various algorithms as
compared to the Random baseline
More info in our blog post
Going Long-Term
● Want to maximize long-term user satisfaction and retention
● Involves many user visits, recommendation actions and delayed reward
● … sounds like Reinforcement Learning
● High-dimensional action space: Recommending a single item is O(|C|);
typically want to do ranking or page construction, which is combinatorial
● High-dimensional state space: Users are represented in the state, along
with the relevant history
● Off-policy training: Need to learn from existing system actions
● Concurrency: Don’t observe full trajectories, need to learn simultaneously
from many interactions
● Changing action space: New actions (items) become available and need to
be cold-started.
● No good simulator: Requires knowing feedback for user on recommended
items
Challenges of Reinforcement Learning for
Recommendations
List-wise [Zhao et al., 2017] or Page-wise recommendation [Zhao et al. 2018]
based on [Dulac-Arnold et al., 2016]
Embeddings for actions
● Generator to choose user action from recommendation
● Reward trained like a discriminator
● LSTM or Position-Weight architecture
● Learning over sets via cascading Deep Q Networks
○ Different Q function per position
GAN-inspired as a user simulator
[Chen et al., 2019]
● Train candidate generator using
REINFORCE
● Exploration done using softmax with
temperature
● Off-policy correction with adaptation for
top-k recommendations
● Trust region policy optimization to keep
close to logging policy
Policy Gradient for YouTube
Recommendations [Chen et al., 2019]
Trend 4: Fairness
Personalization has a big impact in people’s lives
How do we make sure that it is fair?
Calibrated Recommendations [Steck, 2018]
● Fairness as matching distribution of user interests
● Accuracy as an objective can lead to unbalanced predictions
● Simple example:
● Many recommendation algorithms exhibit this behavior of exaggerating the
dominant interests and crowd out less frequent ones
30 action70 romance
30% action70% romance
User:
Expectation:
100% romanceReality: Maximizes accuracy
- Genre-distribution of each item is given:
- Genre-distribution of user’s play history:
… add prior for other genres:
- Genre-distribution of recommended list:
(for diversity)
(or other categorization)
Calibration Metric
Calibration Results (MovieLens 20M)
Baseline model (wMF):
Many users receive
uncalibrated rec’s
After reranking:
Rec’s are much more
calibrated (smaller )
Userdensity
More calibrated (KL divergence)
Submodular
Reranker:
Fairness through Pairwise Comparisons
[Beutel et al., 2019]
● Recommendations are fair if likelihood of clicked item being ranked above
an unclicked item is the same across two groups
○ Intra-group pairwise accuracy - Restrict to pairs within group
○ Inter-group pairwise accuracy - Restrict to pairs between groups
● Training: Add pairwise regularizer based on randomized data to collect
fairness feedback
Trend 5:
Experience Personalization
Personalizing how we recommend
(not just what we recommend…)
● Algorithm level: Ideal balance of diversity, popularity,
novelty, freshness, etc. may depend on the person
● Display level: How you present items or explain
recommendations can also be personalized
● Interaction level: Balancing the needs of lean-back
users and power users
Page/Slate Optimization
● Select multiple actions that go together and
receive feedback on group
● Personalizing based on within-session
browsing behavior [Wu et al., 2015]
● Off-policy evaluation for slates
[Swaminathan, et al., 2016]
● Slate optimization as VAE [Jiang et al., 2019]
● Marginal posterior sampling for slate bandits
[Dimakopoulou et al., 2019]
More dimensions to personalize
Rows
Trailer
Evidence
Synopsis
Image
Row Title
Metadata
Ranking
More Adaptive UI
Rating Ranking Pages
4.7
Experience
Evolution of our Personalization Approach
Potential Connections with
Multi-Task / Meta Learning?
Applications as tasks
● Many related personalization tasks in a
recommender system
● Examples:
○ [Zhao et al., 2015] - Outputs for different tasks
○ [Bansal et al., 2016] - Jointly learn to recommend
and predict metadata for items
○ [Ma et al., 2018] - Jointly learn watch and enjoy
○ [Lu et al., 2018] - Jointly learn for rating prediction
and explanation
○ [Hadash et al., 2018] - Jointly learn ranking and
rating prediction
User
history
Ranking
Page
Rating
Explanation
Search
Image
Context ...
Other views
● Users-as-tasks: Treat each user as a task and learn from others users
○ Example: [Ning & Karapis, 2010] finds similar users and does
support vector regression
● Items-as-tasks: Treat each item as a separate model to learn
● Contexts-as-tasks: Treat different contexts (time, device, region, …)
as separate tasks
● Domains-as-tasks: Leverage representations of users in one domain
to help in another (e.g. different kinds of items, different genres)
○ Example: [Li et al., 2009] on movies <-> books
Conclusion
1. Deep Learning
2. Causality
3. Bandits & Reinforcement Learning
4. Fairness
5. Experience Personalization
6. Multi-task & Meta Learning?
Lots of opportunity for Machine Learning
in Personalization
Thank you
Questions?
@JustinBasilico Yes, we’re hiring...
Justin Basilico

Recent Trends in Personalization: A Netflix Perspective

  • 1.
    Recent Trends in Personalization: ANetflix Perspective Justin Basilico ICML 2019 Adaptive & Multi-Task Learning Workshop 2019-06-15 @JustinBasilico
  • 2.
    Why do wepersonalize?
  • 3.
    Help members findcontent to watch and enjoy to maximize member satisfaction and retention
  • 4.
  • 5.
    What do wepersonalize?
  • 6.
    Ordering of videosis personalized From what we recommend Ranking
  • 7.
    Selection and placementof rows is personalized ... to how we construct a pageRows
  • 8.
    Personalized images ... towhat images to select
  • 9.
    ... to reachingout to our members
  • 10.
    Everything is arecommendation! Over 80% of what people watch comes from our recommendations Overview in [Gomez-Uribe & Hunt, 2016]
  • 11.
  • 12.
    ○ Every personis unique with a variety of interests ○ Help people find what they want when they’re not sure what they want ○ Large datasets but small data per user … and potentially biased by the output of your system ○ Cold-start problems on all sides ○ Non-stationary, context-dependent, mood-dependent ○ More than just accuracy: Diversity, novelty, freshness, fairness, ... ○ ... No, personalization is hard!
  • 13.
    Some recent trendsin approaching these challenges: 1. Deep Learning 2. Causality 3. Bandits & Reinforcement Learning 4. Fairness 5. Experience Personalization Trending Now
  • 14.
    Trend 1: DeepLearning in Recommendations
  • 15.
    What~2012 ~2017 Deep Learningbecomes popular in Machine Learning Deep Learning becomes popular in Recommender Systems What took so long?
  • 16.
    Traditional Recommendations Collaborative Filtering: Recommenditems that similar users have chosen 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 Users Items
  • 17.
  • 18.
  • 19.
    U A (deeper) feed-forwardview V Mean squared loss?
  • 20.
    … isn’t alwaysthe best U V Mean squared loss ?
  • 21.
    V … but opensup many possibilities Softmax Avg / Stack/ Sequence DNN / RNN / CNN Input interactions (X) (X) p(Y) 2018-12-2319:32:10 2018-12-2412:05:53 2019-01-0215:40:22
  • 22.
    Sequence prediction ● Treatrecommendations as a sequence classification problem ○ Input: sequence of user actions ○ Output: next action ● E.g. Gru4Rec [Hidasi et. al., 2016] ○ Input: sequence of items in a sessions ○ Output: next item in the session ● Also co-evolution: [Wu et al., 2017], [Dai et al., 2017]
  • 23.
    Leveraging other data ●Example: YouTube Recommender [Covington et. al., 2016] ● Two stage ranker: candidate generation (shrinking set of items to rank) and ranking (classifying actual impressions) ● Two feed-forward, fully connected, networks with hundreds of features
  • 24.
    Contextual sequence data 2017-12-1015:40:22 2017-12-23 19:32:10 2017-12-24 12:05:53 2017-12-27 22:40:22 2017-12-29 19:39:36 2017-12-30 20:42:13 Context ItemSequence per user ? Time
  • 25.
    Time-sensitive sequence prediction ●Proper modeling of time and system dynamics is critical ○ Recommendations are actions at a moment in time ● Experiment on a Netflix internal dataset ○ Input: Sequence of past plays and time context ■ Discrete time: Day-of-week (Mon, Tue, …) & Hour-of-day ■ Continuous time (aka timestamp) ○ Label: Predict next play (temporal split data)
  • 26.
  • 27.
  • 28.
    From Correlation toCausation ● Most recommendation algorithms are correlational ○ Some early recommendation algorithms literally computed correlations between users and items ● Did you watch a movie because you liked it? Or because we showed it to you? Or both? p(Y|X) → p(Y|X, do(R)) (from http://www.tylervigen.com/spurious-correlations)
  • 29.
    Feedback loops Impression bias inflatesplays Leads to inflated item popularity More plays More impressions Oscillations in distribution of genre recommendations Feedback loops can cause biases to be reinforced by the recommendation system! [Chaney et al., 2018]: simulations showing that this can reduce the usefulness of the system
  • 30.
  • 31.
  • 32.
  • 33.
    Closed Loop Training Data Watches Model Recs DangerZone Search Training Data Watches Model Recs Open Loop
  • 34.
    Closed Loop Training Data Watches Model Recs DangerZone Search Training Data Watches Model Recs Open Loop
  • 35.
    Debiasing Recommendations ● IPSEstimator for MF [Schnabel et al., 2016] ○ Train a debiasing model and reweight the data ● Causal Embeddings [Bonner & Vasile, 2018] ○ Jointly learn debiasing model and task model ○ Regularize the two towards each other ● Doubly-Robust MF [Wang et al., 2019]
  • 36.
    Trend 3: Bandits& Reinforcement Learning in Recommendations
  • 37.
    ● Uncertainty arounduser interests and new items ● Sparse and indirect feedback ● Changing trends ● Break feedback loops ● Want to explore to learn Why contextual bandits for recommendations? ▶Early news example: [Li et al., 2010]
  • 38.
    Bart [McInerney etal., 2018] ● Bandit selecting both items and explanations for Spotify homepage ● Factorization Machine with epsilon-greedy explore over personalized candidate set ● Counterfactual risk minimization to train the bandit
  • 39.
  • 40.
    Artwork Personalization as ContextualBandit ● Environment: Netflix homepage ● Context: Member, device, page, etc. ● Learner: Artwork selector for a show ● Action: Display specific image for show ● Reward: Member has positive engagement Artwork Selector ▶
  • 41.
    Offline Replay Results ●Bandit finds good images ● Personalization is better ● Artwork variety matters ● Personalization wiggles around best images Lift in Replay in the various algorithms as compared to the Random baseline More info in our blog post
  • 42.
    Going Long-Term ● Wantto maximize long-term user satisfaction and retention ● Involves many user visits, recommendation actions and delayed reward ● … sounds like Reinforcement Learning
  • 43.
    ● High-dimensional actionspace: Recommending a single item is O(|C|); typically want to do ranking or page construction, which is combinatorial ● High-dimensional state space: Users are represented in the state, along with the relevant history ● Off-policy training: Need to learn from existing system actions ● Concurrency: Don’t observe full trajectories, need to learn simultaneously from many interactions ● Changing action space: New actions (items) become available and need to be cold-started. ● No good simulator: Requires knowing feedback for user on recommended items Challenges of Reinforcement Learning for Recommendations
  • 44.
    List-wise [Zhao etal., 2017] or Page-wise recommendation [Zhao et al. 2018] based on [Dulac-Arnold et al., 2016] Embeddings for actions
  • 45.
    ● Generator tochoose user action from recommendation ● Reward trained like a discriminator ● LSTM or Position-Weight architecture ● Learning over sets via cascading Deep Q Networks ○ Different Q function per position GAN-inspired as a user simulator [Chen et al., 2019]
  • 46.
    ● Train candidategenerator using REINFORCE ● Exploration done using softmax with temperature ● Off-policy correction with adaptation for top-k recommendations ● Trust region policy optimization to keep close to logging policy Policy Gradient for YouTube Recommendations [Chen et al., 2019]
  • 47.
  • 48.
    Personalization has abig impact in people’s lives How do we make sure that it is fair?
  • 49.
    Calibrated Recommendations [Steck,2018] ● Fairness as matching distribution of user interests ● Accuracy as an objective can lead to unbalanced predictions ● Simple example: ● Many recommendation algorithms exhibit this behavior of exaggerating the dominant interests and crowd out less frequent ones 30 action70 romance 30% action70% romance User: Expectation: 100% romanceReality: Maximizes accuracy
  • 50.
    - Genre-distribution ofeach item is given: - Genre-distribution of user’s play history: … add prior for other genres: - Genre-distribution of recommended list: (for diversity) (or other categorization) Calibration Metric
  • 51.
    Calibration Results (MovieLens20M) Baseline model (wMF): Many users receive uncalibrated rec’s After reranking: Rec’s are much more calibrated (smaller ) Userdensity More calibrated (KL divergence) Submodular Reranker:
  • 52.
    Fairness through PairwiseComparisons [Beutel et al., 2019] ● Recommendations are fair if likelihood of clicked item being ranked above an unclicked item is the same across two groups ○ Intra-group pairwise accuracy - Restrict to pairs within group ○ Inter-group pairwise accuracy - Restrict to pairs between groups ● Training: Add pairwise regularizer based on randomized data to collect fairness feedback
  • 53.
  • 54.
    Personalizing how werecommend (not just what we recommend…) ● Algorithm level: Ideal balance of diversity, popularity, novelty, freshness, etc. may depend on the person ● Display level: How you present items or explain recommendations can also be personalized ● Interaction level: Balancing the needs of lean-back users and power users
  • 55.
    Page/Slate Optimization ● Selectmultiple actions that go together and receive feedback on group ● Personalizing based on within-session browsing behavior [Wu et al., 2015] ● Off-policy evaluation for slates [Swaminathan, et al., 2016] ● Slate optimization as VAE [Jiang et al., 2019] ● Marginal posterior sampling for slate bandits [Dimakopoulou et al., 2019]
  • 56.
    More dimensions topersonalize Rows Trailer Evidence Synopsis Image Row Title Metadata Ranking
  • 57.
  • 58.
    Rating Ranking Pages 4.7 Experience Evolutionof our Personalization Approach
  • 59.
  • 60.
    Applications as tasks ●Many related personalization tasks in a recommender system ● Examples: ○ [Zhao et al., 2015] - Outputs for different tasks ○ [Bansal et al., 2016] - Jointly learn to recommend and predict metadata for items ○ [Ma et al., 2018] - Jointly learn watch and enjoy ○ [Lu et al., 2018] - Jointly learn for rating prediction and explanation ○ [Hadash et al., 2018] - Jointly learn ranking and rating prediction User history Ranking Page Rating Explanation Search Image Context ...
  • 61.
    Other views ● Users-as-tasks:Treat each user as a task and learn from others users ○ Example: [Ning & Karapis, 2010] finds similar users and does support vector regression ● Items-as-tasks: Treat each item as a separate model to learn ● Contexts-as-tasks: Treat different contexts (time, device, region, …) as separate tasks ● Domains-as-tasks: Leverage representations of users in one domain to help in another (e.g. different kinds of items, different genres) ○ Example: [Li et al., 2009] on movies <-> books
  • 62.
  • 63.
    1. Deep Learning 2.Causality 3. Bandits & Reinforcement Learning 4. Fairness 5. Experience Personalization 6. Multi-task & Meta Learning? Lots of opportunity for Machine Learning in Personalization
  • 64.
    Thank you Questions? @JustinBasilico Yes,we’re hiring... Justin Basilico