Netflix uses a variety of techniques to provide personalized recommendations to users. Some key aspects include:
1. Netflix recommendations are generated using both offline and online techniques. Offline techniques allow for more complex computations but results may become stale, while online techniques can respond quickly but have stricter time constraints.
2. Recommendations are generated using a variety of data sources and machine learning models, including SVD, RBMs, gradient boosted trees, and other techniques. Both the data and models are important for generating high quality recommendations.
3. Netflix tests recommendations using both offline and online A/B testing techniques. Offline testing is used to evaluate new models and ideas before launching online tests involving real users
1. Netflix Recommendations
Beyond the 5 Stars
ACM SF-‐Bay Area
October 22, 2012
Xavier Amatriain
Personalization Science and Engineering -
‐Netflix
@xamat
2. Outline
1. The Netflix Prize & the Recommendation
Problem
2. Anatomy of Netflix Personalization
3. Data & Models
4. And…
a) Consumer (Data) Science
b) Or Software Architectures
4. What we were interested in:
High quality recommendations
Proxy question:
Accuracy in predicted rating
Improve by 10% = $1million!
Results
• Top 2 algorithms still in
production
SVD
RBM
5. What about the final prize ensembles?
Our offline studies showed they were too computationally
intensive to scale
Expected improvement not worth the engineering effort
Plus…. Focus had already shifted to other issues that
had more impact than rating prediction.
5
14. Genre rows
Personalized genre rows focus on user interest
Also provide context and “evidence”
Important for member satisfaction – moving personalized rows to top on
devices increased retention
How are they generated?
Implicit: based on user’s recent plays, ratings, & other interactions
Explicit taste preferences
Hybrid:combine the above
Also take into account:
Freshness - has this been shown before?
Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
21. Displayed in
many different
contexts
In response to
user actions/
context (search,
queue add…)
More like… rows
Similars
22. Anatomy of a Personalization - Recap
Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity…
We strive to make it easy for the user, but…
We want the user to be aware and be involved in the
recommendation process
Deal with implicit/explicit and hybrid feedback
Add support/explanations for recommendations
Consider issues such as diversity or freshness
22
24. Big Data @Netflix
Plays:
Almost 30M subscribers
Ratings: 4M/day
Searches: 3M/day
30M/day
2B hours streamed in Q4
2011
1B hours in June 2012
24
25. Smart Models
25
Logistic/linear regression
Elastic nets
SVD and other MF models
Restricted Boltzmann Machines
Markov Chains
Different clustering approaches
LDA
Association Rules
Gradient Boosted Decision Trees
…
26. SVD
X[n x m] = U[n x r] S r x r] (V[m x r])T
X: m x n matrix (e.g., m users, n videos)
U: m x r matrix (m users, r concepts)
S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)
V: r x n matrix (n videos, r concepts)
27. Simon Funk’s SVD
One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
27
28. SVD for Rating Prediction
Where f
qv, xv, yv are three item factor vectors
Users are not parametrized, but rather represented by:
R(u): items rated by user u
N(u): items for which the user has given implicit preference (e.g. rated vs. not rated)
f
pu
f
qv
r' T
b p q
uv uv u v
User factor vectors and item-factors vector
Baseline buv bu bv (user & item deviation from average)
Predict rating as
SVD++ (Koren et. Al) asymmetric variation w. implicit feedback
r'
uj uj j
jR(u)
28
b qT
R(u)
2 (r b )x N(u) 2
uv uv v
1 1
jN (u)
y
j
29. Artificial Neural Networks – 4 generations
1st - Perceptrons (~60s)
Single layer of hand-coded features
Linear activation function
Fundamentally limited in what they can learn to do.
2nd - Back-propagation (~80s)
Back-propagate error signal to get derivatives for learning
Non-linear activation function
3rd - Belief Networks (~90s)
Directed acyclic graph composed of (visible & hidden) stochastic variables
with weighted connections.
Infer the states of the unobserved variables & learn interactions between
variables to make network more likely to generate observed data.
29
30. Restricted Boltzmann Machines
Restrict the connectivity to make learning easier.
Only one layer of hidden units.
Although multiple layers are possible
No connections between hidden units.
Hidden units are independent given the visible
states..
So we can quickly get an unbiased sample from
the posterior distribution over hidden “causes”
when given a data-vector
RBMs can be stacked to form Deep Belief
Nets (DBN) – 4th generation of ANNs
i
hidden
j
visible
33. Ranking
Ranking = Scoring + Sorting + Filtering
bags of movies for presentation to a user
Goal: Find the best possible ordering of a
set of videos for a user within a specific
context in real-time
Objective: maximize consumption
Aspirations: Played & “enjoyed” titles have
best score
Akin to CTR forecast for ads/search results
Factors
Accuracy
Novelty
Diversity
Freshness
Scalability
…
34. Ranking
Popularity is the obvious baseline
Ratings prediction is a clear secondary data
input that allows for personalization
We have added many other features (and tried
many more that have not proved useful)
What about the weights?
Based on A/B testing
Machine-learned
35. Example: Two features, linear model
35
Popularity
Predicted
Rating
1
2
3
4
5
Linear Model:
frank(u,v) = w1 p(v) + w2 r(u,v) + b
Final
Ranking
40. Learning to rank
Machine learning problem: goal is to construct ranking
model from training data
Training data can have partial order or binary judgments
(relevant/not relevant).
Resulting order of the items typically induced from a
numerical score
Learning to rank is a key element for personalization
You can treat the problem as a standard supervised
classification problem
40
41. Learning to Rank Approaches
1. Pointwise
Ranking function minimizes loss function defined on individual
relevance judgment
Ranking score based on regression or classification
Ordinal regression, Logistic regression, SVM, GBDT, …
2. Pairwise
Loss function is defined on pair-wise preferences
Goal: minimize number of inversions in ranking
Ranking problem is then transformed into the binary classification
problem
RankSVM, RankBoost, RankNet, FRank…
42. Learning to rank - metrics
Quality of ranking measured using metrics as
Normalized Discounted Cumulative Gain
Mean Reciprocal Rank (MRR)
Fraction of Concordant Pairs (FCP)
Others…
But, it is hard to optimize machine-learned
models directly on these measures (they are
not differentiable)
Recent research on models that directly
optimize ranking measures
42
NDCG
DCG
IDCG
DCG relevance1
relevance
log i
2 2
n
i
H hH rank(hi )
1 1
MRR
FCP ij
CP(xi, xj )
n(n 1)
2
43. Learning to Rank Approaches
3. Listwise
a. Indirect Loss Function
RankCosine: similarity between ranking list and ground truth as loss function
ListNet: KL-divergence as loss function by defining a probability distribution
Problem: optimization of listwise loss function may not optimize IR metrics
b. Directly optimizing IR measures (difficult since they are not differentiable)
Directly optimize IR measures through Genetic Programming
Directly optimize measures with Simulated Annealing
Gradient descent on smoothed version of objective function (e.g. CLiMF
presented at Recsys 2012 or TFMAP at SIGIR 2012)
SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
AdaRank uses boosting to optimize NDCG
44. 44
Similars
Different similarities computed
from different sources: metadata,
ratings, viewing data…
Similarities can be treated as
data/features
Machine Learned models
improve our concept of “similarity”
45. Data & Models - Recap
All sorts of feedback from the user can help generate better
recommendations
Need to design systems that capture and take advantage of
all this data
The right model is as important as the right data
It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity…)
45
46. More data or better models?
46
Really?
Anand Rajaraman: Stanford & Senior VP at
Walmart Global eCommerce (former Kosmix)
48. [Banko and Brill, 2001]
48
Norvig: “Google does not
have better Algorithms,
only more Data”
Many features/
low-bias models
More data or better models?
52. Consumer Science
52
Main goal is to effectively innovate for customers
Innovation goals
“If you want to increase your success rate, double
your failure rate.” – Thomas Watson, Sr., founder of
IBM
The only real failure is the failure to innovate
Fail cheaply
Know why you failed/succeeded
53. Consumer (Data) Science
53
1. Start with a hypothesis:
Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2. Design a test
Develop a solution or prototype
Think about dependent & independent variables, control,
significance…
3. Execute the test
4. Let data speak for itself
54. Offline/Online testing process
days Weeks to months
Rollout
Feature to
all users
Offline
testing
Online A/B
testing
[success] [success]
[fail]
54
55. Offline testing
55
Optimize algorithms offline
Measure model performance, using metrics such as:
Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…
Offline performance used as an indication to make informed
decisions on follow-up A/B tests
A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)
56. Executing A/B tests
56
Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
Think about significance and hypothesis testing
Our tests usually have thousands of members and 2-20 cells
A/B Tests allow you to try radical ideas or test many
approaches at the same time.
We typically have hundreds of customer A/B tests running
Decisions on the product always data-driven
57. What to measure
57
OEC: Overall Evaluation Criteria
In an AB test framework, the measure of success is key
Short-term metrics do not always align with long term
goals
E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
Use long term metrics such as LTV (Life time value)
whenever possible
In Netflix, we use member retention
58. What to measure
Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See “Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained” [Kohavi et. Al. KDD
12]
58
59. Consumer Data Science - Recap
59
Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
This is mainly done through online AB Testing
However, we can speed up innovation by experimenting
offline
But, both for online and offline experimentation, it is
important to chose the right metric and experimental
framework
64. • UI devices should broadcast many
different kinds of user events
• Clicks
• Presentations
• Browsing events
• …
• Events vs. data
• Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
• Others need to be processed and
“turned into” data (higher latency,
higher information quality).
• And… there are many in between
• Real-time event flow managed
through internal tool (Manhattan)
• Data flow mostly managed through
Hadoop.
Event & Data Distribution
64
66. • Two kinds of offline jobs
• Model training
• Batch offline computation of
recommendations/
intermediate results
• Offline queries either in Hive or
PIG
• Need a publishing mechanism
that solves several issues
• Notify readers when result of
query is ready
• Support different repositories
(s3, cassandra…)
• Handle errors, monitoring…
• We do this through Hermes
Offline Jobs
66
68. • Two ways of computing personalized
results
• Batch/offline
• Online
• Each approach has pros/cons
• Offline
+ Allows more complex computations
+ Can use more data
- Cannot react to quick changes
- May result in staleness
• Online
+ Can respond quickly to events
+ Can use most recent data
- May fail because of SLA
- Cannot deal with “complex”
computations
• It’s not an either/or decision
• Both approaches can be combined
68
Computation
70. • Both offline and online algorithms are
based on three different inputs:
• Models: previously trained from
existing data
• (Offline) Data: previously
processed and stored information
• Signals: fresh data obtained from
live services
• User-related data
• Context data (session, date,
time…)
70
Signals & Models
72. • Recommendations can be serviced
from:
• Previously computed lists
• Online algorithms
• A combination of both
• The decision on where to service the
recommendation from can respond to
many factors including context.
• Also, important to think about the
fallbacks (what if plan A fails)
• Previously computed lists/intermediate
results can be stored in a variety of
ways
• Cache
• Cassandra
• Relational DB
72
Results
73. Alerts and Monitoring
73
A non-trivial concern in large-scale recommender
systems
Monitoring: continuously observe quality of system
Alert: fast notification if quality of system goes below a
certain pre-defined threshold
Questions:
What do we need to monitor?
How do we know something is “bad enough” to alert
74. What to monitor
Staleness
Monitor time since last data update
Did something go
wrong here?
74
75. What to monitor
Algorithmic quality
Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
75
76. What to monitor
Algorithmic quality
Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
Did something go
wrong here?
76
77. What to monitor
Algorithmic source for users
Monitor how users interact with different algorithms
Algorithm X
New version
Did something go
wrong here?
77
78. When to alert
78
Alerting thresholds are hard to tune
Avoid unnecessary alerts (the “learn-to-ignore problem”)
Avoid important issues being noticed before the alert happens
Rules of thumb
Alert on anything that will impact user experience significantly
Alert on issues that are actionable
If a noticeable event happens without an alert… add a new alert
for next time
80. The Personalization Problem
80
The Netflix Prize simplified the recommendation problem
to predicting ratings
But…
User ratings are only one of the many data inputs we have
Rating predictions are only part of our solution
Other algorithms such as ranking or similarity are very important
We can reformulate the recommendation problem
Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service
81. More data +
Better models +
81
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!