acmsigtalkshare-121023190142-phpapp01.pptx

Netflix Recommendations
Beyond the 5 Stars
ACM SF-‐Bay Area
October 22, 2012
Xavier Amatriain
Personalization Science and Engineering -
‐Netflix
@xamat

Outline
1. The Netflix Prize & the Recommendation
Problem
2. Anatomy of Netflix Personalization
3. Data & Models
4. And…
a) Consumer (Data) Science
b) Or Software Architectures

What we were interested in:
 High quality recommendations
Proxy question:
 Accuracy in predicted rating
 Improve by 10% = $1million!
Results
• Top 2 algorithms still in
production
SVD
RBM

What about the final prize ensembles?
 Our offline studies showed they were too computationally
intensive to scale
 Expected improvement not worth the engineering effort
 Plus…. Focus had already shifted to other issues that
had more impact than rating prediction.
5

Anatomy of
Netflix
Personalization
Everything is a Recommendation

Everything is personalized
Note:
Recommendations
are per household,
not individual user
Rows
Ranking
8

Top 10
Personalization awareness
Diversity
All
9
Dad Dad&Mom Daughter Mom
All Daughter Son Mom
All?

Support for Recommendations
10
Social Support

Watch again & Continue Watching
12

Genre rows
 Personalized genre rows focus on user interest
 Also provide context and “evidence”
 Important for member satisfaction – moving personalized rows to top on
devices increased retention
 How are they generated?
 Implicit: based on user’s recent plays, ratings, & other interactions
 Explicit taste preferences
 Hybrid:combine the above
 Also take into account:
 Freshness - has this been shown before?
 Diversity– avoid repeating tags and genres, limit number of TV genres, etc.

19
Genres – user involvement

Genres – user involvement
20

 Displayed in
many different
contexts
 In response to
user actions/
context (search,
queue add…)
 More like… rows
Similars

Anatomy of a Personalization - Recap
 Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity…
 We strive to make it easy for the user, but…
 We want the user to be aware and be involved in the
recommendation process
 Deal with implicit/explicit and hybrid feedback
 Add support/explanations for recommendations
 Consider issues such as diversity or freshness
22

Big Data @Netflix
 Plays:
 Almost 30M subscribers
 Ratings: 4M/day
 Searches: 3M/day
30M/day
 2B hours streamed in Q4
2011
 1B hours in June 2012
24

Smart Models
25
 Logistic/linear regression
 Elastic nets
 SVD and other MF models
 Restricted Boltzmann Machines
 Markov Chains
 Different clustering approaches
 LDA
 Association Rules
 Gradient Boosted Decision Trees
 …

SVD
X[n x m] = U[n x r] S  r x r] (V[m x r])T
 X: m x n matrix (e.g., m users, n videos)
 U: m x r matrix (m users, r concepts)
 S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)
 V: r x n matrix (n videos, r concepts)

Simon Funk’s SVD
 One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
 Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
27

SVD for Rating Prediction
 Where f
 qv, xv, yv   are three item factor vectors
 Users are not parametrized, but rather represented by:
 R(u): items rated by user u
 N(u): items for which the user has given implicit preference (e.g. rated vs. not rated)
f
pu  
f
qv  
r' T
 b  p q
uv uv u v
 User factor vectors and item-factors vector
 Baseline buv   bu bv (user & item deviation from average)
 Predict rating as
 SVD++ (Koren et. Al) asymmetric variation w. implicit feedback
r'
uj uj j
jR(u)

28

 b  qT
R(u)

2  (r  b )x  N(u) 2
uv uv v 
 1 1 

jN (u) 
y 
j 

Artificial Neural Networks – 4 generations
 1st - Perceptrons (~60s)
 Single layer of hand-coded features
 Linear activation function
 Fundamentally limited in what they can learn to do.
 2nd - Back-propagation (~80s)
 Back-propagate error signal to get derivatives for learning
 Non-linear activation function
 3rd - Belief Networks (~90s)
 Directed acyclic graph composed of (visible & hidden) stochastic variables
with weighted connections.
 Infer the states of the unobserved variables & learn interactions between
variables to make network more likely to generate observed data.
29

Restricted Boltzmann Machines
 Restrict the connectivity to make learning easier.
 Only one layer of hidden units.
 Although multiple layers are possible
 No connections between hidden units.
 Hidden units are independent given the visible
states..
 So we can quickly get an unbiased sample from
the posterior distribution over hidden “causes”
when given a data-vector
 RBMs can be stacked to form Deep Belief
Nets (DBN) – 4th generation of ANNs
i
hidden
j
visible

Ranking Key algorithm, sorts titles in most
contexts

Ranking
 Ranking = Scoring + Sorting + Filtering
bags of movies for presentation to a user
 Goal: Find the best possible ordering of a
set of videos for a user within a specific
context in real-time
 Objective: maximize consumption
 Aspirations: Played & “enjoyed” titles have
best score
 Akin to CTR forecast for ads/search results
 Factors
 Accuracy
 Novelty
 Diversity
 Freshness
 Scalability
 …

Ranking
 Popularity is the obvious baseline
 Ratings prediction is a clear secondary data
input that allows for personalization
 We have added many other features (and tried
many more that have not proved useful)
 What about the weights?
 Based on A/B testing
 Machine-learned

Example: Two features, linear model
35
Popularity
Predicted
Rating
1
2
3
4
5
Linear Model:
frank(u,v) = w1 p(v) + w2 r(u,v) + b
Final
Ranking

Learning to rank
 Machine learning problem: goal is to construct ranking
model from training data
 Training data can have partial order or binary judgments
(relevant/not relevant).
 Resulting order of the items typically induced from a
numerical score
 Learning to rank is a key element for personalization
 You can treat the problem as a standard supervised
classification problem
40

Learning to Rank Approaches
1. Pointwise
 Ranking function minimizes loss function defined on individual
relevance judgment
 Ranking score based on regression or classification
 Ordinal regression, Logistic regression, SVM, GBDT, …
2. Pairwise
 Loss function is defined on pair-wise preferences
 Goal: minimize number of inversions in ranking
 Ranking problem is then transformed into the binary classification
problem
 RankSVM, RankBoost, RankNet, FRank…

Learning to rank - metrics
 Quality of ranking measured using metrics as
 Normalized Discounted Cumulative Gain
 Mean Reciprocal Rank (MRR)
 Fraction of Concordant Pairs (FCP)
 Others…
 But, it is hard to optimize machine-learned
models directly on these measures (they are
not differentiable)
 Recent research on models that directly
optimize ranking measures
42
NDCG 
DCG
IDCG
DCG  relevance1
relevance
log i
2 2
n
 i
H hH rank(hi )
1 1
MRR  
FCP  ij
CP(xi, xj )
n(n 1)
2

Learning to Rank Approaches
3. Listwise
a. Indirect Loss Function
 RankCosine: similarity between ranking list and ground truth as loss function
 ListNet: KL-divergence as loss function by defining a probability distribution
 Problem: optimization of listwise loss function may not optimize IR metrics
b. Directly optimizing IR measures (difficult since they are not differentiable)
 Directly optimize IR measures through Genetic Programming
 Directly optimize measures with Simulated Annealing
 Gradient descent on smoothed version of objective function (e.g. CLiMF
presented at Recsys 2012 or TFMAP at SIGIR 2012)
 SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
 AdaRank uses boosting to optimize NDCG

44
Similars
 Different similarities computed
from different sources: metadata,
ratings, viewing data…
 Similarities can be treated as
data/features
 Machine Learned models
improve our concept of “similarity”

Data & Models - Recap
 All sorts of feedback from the user can help generate better
recommendations
 Need to design systems that capture and take advantage of
all this data
 The right model is as important as the right data
 It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
 Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity…)
45

More data or better models?
46
Really?
Anand Rajaraman: Stanford & Senior VP at
Walmart Global eCommerce (former Kosmix)

Sometimes, it’s not
about more data
47

[Banko and Brill, 2001]
48
Norvig: “Google does not
have better Algorithms,
only more Data”
Many features/
low-bias models

0.03
0.02
0.01
0
49
0.04
0.05
0.09
0.08
0.07
0.06
0 1000000 2000000 3000000 4000000 5000000 6000000
Model performance vs. sample size
(actual Netflix system)
Sometimes, it’s not
about more data

50
Data without a sound approach = noise

Consumer Science
52
 Main goal is to effectively innovate for customers
 Innovation goals
 “If you want to increase your success rate, double
your failure rate.” – Thomas Watson, Sr., founder of
IBM
 The only real failure is the failure to innovate
 Fail cheaply
 Know why you failed/succeeded

Consumer (Data) Science
53
1. Start with a hypothesis:
 Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2. Design a test
 Develop a solution or prototype
 Think about dependent & independent variables, control,
significance…
3. Execute the test
4. Let data speak for itself

Offline/Online testing process
days Weeks to months
Rollout
Feature to
all users
Offline
testing
Online A/B
testing
[success] [success]
[fail]
54

Offline testing
55
 Optimize algorithms offline
 Measure model performance, using metrics such as:
 Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…
 Offline performance used as an indication to make informed
decisions on follow-up A/B tests
 A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
 Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)

Executing A/B tests
56
 Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
 Think about significance and hypothesis testing
 Our tests usually have thousands of members and 2-20 cells
 A/B Tests allow you to try radical ideas or test many
approaches at the same time.
 We typically have hundreds of customer A/B tests running
 Decisions on the product always data-driven

What to measure
57
 OEC: Overall Evaluation Criteria
 In an AB test framework, the measure of success is key
 Short-term metrics do not always align with long term
goals
 E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
 Use long term metrics such as LTV (Life time value)
whenever possible
 In Netflix, we use member retention

What to measure
 Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
 At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
 But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See “Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained” [Kohavi et. Al. KDD
12]
58

Consumer Data Science - Recap
59
 Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
 This is mainly done through online AB Testing
 However, we can speed up innovation by experimenting
offline
 But, both for online and offline experimentation, it is
important to chose the right metric and experimental
framework

Technology
61
http://techblog.netflix.com

• UI devices should broadcast many
different kinds of user events
• Clicks
• Presentations
• Browsing events
• …
• Events vs. data
• Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
• Others need to be processed and
“turned into” data (higher latency,
higher information quality).
• And… there are many in between
• Real-time event flow managed
through internal tool (Manhattan)
• Data flow mostly managed through
Hadoop.
Event & Data Distribution
64

• Two kinds of offline jobs
• Model training
• Batch offline computation of
recommendations/
intermediate results
• Offline queries either in Hive or
PIG
• Need a publishing mechanism
that solves several issues
• Notify readers when result of
query is ready
• Support different repositories
(s3, cassandra…)
• Handle errors, monitoring…
• We do this through Hermes
Offline Jobs
66

• Two ways of computing personalized
results
• Batch/offline
• Online
• Each approach has pros/cons
• Offline
+ Allows more complex computations
+ Can use more data
- Cannot react to quick changes
- May result in staleness
• Online
+ Can respond quickly to events
+ Can use most recent data
- May fail because of SLA
- Cannot deal with “complex”
computations
• It’s not an either/or decision
• Both approaches can be combined
68
Computation

• Both offline and online algorithms are
based on three different inputs:
• Models: previously trained from
existing data
• (Offline) Data: previously
processed and stored information
• Signals: fresh data obtained from
live services
• User-related data
• Context data (session, date,
time…)
70
Signals & Models

• Recommendations can be serviced
from:
• Previously computed lists
• Online algorithms
• A combination of both
• The decision on where to service the
recommendation from can respond to
many factors including context.
• Also, important to think about the
fallbacks (what if plan A fails)
• Previously computed lists/intermediate
results can be stored in a variety of
ways
• Cache
• Cassandra
• Relational DB
72
Results

Alerts and Monitoring
73
 A non-trivial concern in large-scale recommender
systems
 Monitoring: continuously observe quality of system
 Alert: fast notification if quality of system goes below a
certain pre-defined threshold
 Questions:
 What do we need to monitor?
 How do we know something is “bad enough” to alert

What to monitor
 Staleness
 Monitor time since last data update
Did something go
wrong here?
74

What to monitor
 Algorithmic quality
 Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
75

What to monitor
 Algorithmic quality
 Monitor different metrics by comparing what users do and what
your algorithm predicted they would do
Did something go
wrong here?
76

What to monitor
 Algorithmic source for users
 Monitor how users interact with different algorithms
Algorithm X
New version
Did something go
wrong here?
77

When to alert
78
 Alerting thresholds are hard to tune
 Avoid unnecessary alerts (the “learn-to-ignore problem”)
 Avoid important issues being noticed before the alert happens
 Rules of thumb
 Alert on anything that will impact user experience significantly
 Alert on issues that are actionable
 If a noticeable event happens without an alert… add a new alert
for next time

The Personalization Problem
80
 The Netflix Prize simplified the recommendation problem
to predicting ratings
 But…
 User ratings are only one of the many data inputs we have
 Rating predictions are only part of our solution
 Other algorithms such as ranking or similarity are very important
 We can reformulate the recommendation problem
 Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service

More data +
Better models +
81
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!

Thanks!
We’re hiring!
Xavier Amatriain (@xamat)
xamatriain@netflix.com

acmsigtalkshare-121023190142-phpapp01.pptx

Recommended

Recommended

More Related Content

Similar to acmsigtalkshare-121023190142-phpapp01.pptx

Similar to acmsigtalkshare-121023190142-phpapp01.pptx (20)

Recently uploaded

Recently uploaded (20)

acmsigtalkshare-121023190142-phpapp01.pptx