Outline
1.  The Netflix Prize & the Recommendation
Problem
2.  Anatomy of Netflix Personalization
3.  Data & Models
4.  And…
a)  Consumer (Data) Science
b)  Or Software Architectures

SVD
What we were interested in:
§  High quality recommendations
Proxy question: Results
§  Accuracy in predicted rating •  Top 2 algorithms still in
production
§  Improve by 10% = $1million!

RBM

What about the final prize ensembles?
§  Our offline studies showed they were too computationally
intensive to scale
§  Expected improvement not worth the engineering effort
§  Plus…. Focus had already shifted to other issues that
had more impact than rating prediction.

5

Change of focus

2006 2012 6

Anatomy of
Netflix
Personalization
Everything is a Recommendation

Everything is personalized
Ranking

Note:
Recommendations
Rows

are per household,
not individual user

8

Top 10
Personalization awareness

All Dad Dad&Mom Daughter All All? Daughter Son Mom Mom

Diversity

9

Support for Recommendations

Social Support 10

Social Recommendations

11

Watch again & Continue Watching

12

Genre rows
§  Personalized genre rows focus on user interest
§  Also provide context and “evidence”
§  Important for member satisfaction – moving personalized rows to top on
devices increased retention
§  How are they generated?
§  Implicit: based on user’s recent plays, ratings, & other interactions
§  Explicit taste preferences
§  Hybrid:combine the above
§  Also take into account:
§  Freshness - has this been shown before?
§  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.

Genres - personalization

15

Genres - personalization

16

Genres- explanations

17

Genres- explanations

18

Genres – user involvement

19

Genres – user involvement

20

Similars
§  Displayed in
many different
contexts
§  In response to
user actions/
context (search,
queue add…)
§  More like… rows

Anatomy of a Personalization - Recap
§  Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity…
§  We strive to make it easy for the user, but…
§  We want the user to be aware and be involved in the
recommendation process
§  Deal with implicit/explicit and hybrid feedback
§  Add support/explanations for recommendations
§  Consider issues such as diversity or freshness
22

Big Data @Netflix §  Almost 30M subscribers
§  Ratings: 4M/day
§  Searches: 3M/day
§  Plays: 30M/day
§  2B hours streamed in Q4
2011
§  1B hours in June 2012

24

Smart Models
§  Logistic/linear regression
§  Elastic nets
§  SVD and other MF models
§  Restricted Boltzmann Machines
§  Markov Chains
§  Different clustering approaches
§  LDA
§  Association Rules
§  Gradient Boosted Decision Trees
§  …
25

SVD
X[n x m] = U[n x r] S [ r x r] (V[m x r])T

§  X: m x n matrix (e.g., m users, n videos)

§  U: m x r matrix (m users, r concepts)
§  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

§  V: r x n matrix (n videos, r concepts)

Simon Funk’s SVD
§  One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
§  Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
27

SVD for Rating Prediction
f
§  User factor vectors pu ∈ ℜ f and item-factors vector qv ∈ ℜ
§  Baseline buv = µ + bu + bv (user & item deviation from average)
' T
§  Predict rating as ruv = buv + pu qv
§  SVD++ (Koren et. Al) asymmetric variation w. implicit feedback
$ −
1
−
1 '
' T
& R(u) 2
r = buv + q &
uv v ∑ (ruj − buj )x j + N(u) 2
∑ yj ) )
% (
§  Where
j∈R(u) j∈N (u)

§  qv , xv , yv ∈ ℜ f are three item factor vectors
§  Users are not parametrized, but rather represented by:
§  R(u): items rated by user u
§  N(u): items for which the user has given implicit preference (e.g. rated vs. not rated)

28

Artificial Neural Networks – 4 generations
§  1st - Perceptrons (~60s)
§  Single layer of hand-coded features
§  Linear activation function
§  Fundamentally limited in what they can learn to do.
§  2nd - Back-propagation (~80s)
§  Back-propagate error signal to get derivatives for learning
§  Non-linear activation function
§  3rd - Belief Networks (~90s)
§  Directed acyclic graph composed of (visible & hidden) stochastic variables
with weighted connections.
§  Infer the states of the unobserved variables & learn interactions between
variables to make network more likely to generate observed data.

29

Restricted Boltzmann Machines
§  Restrict the connectivity to make learning easier.
§  Only one layer of hidden units.
§  Although multiple layers are possible hidden
§  No connections between hidden units.
j
§  Hidden units are independent given the visible
states..
§  So we can quickly get an unbiased sample from
the posterior distribution over hidden “causes” i
when given a data-vector
visible
§  RBMs can be stacked to form Deep Belief
Nets (DBN) – 4th generation of ANNs

RBM for the Netflix Prize

31

Ranking Key algorithm, sorts titles in most
contexts

Ranking
§  Ranking = Scoring + Sorting + Filtering §  Factors
bags of movies for presentation to a user §  Accuracy
§  Goal: Find the best possible ordering of a §  Novelty
set of videos for a user within a specific §  Diversity
context in real-time §  Freshness
§  Objective: maximize consumption §  Scalability
§  Aspirations: Played & “enjoyed” titles have §  …
best score
§  Akin to CTR forecast for ads/search results

Ranking
§  Popularity is the obvious baseline
§  Ratings prediction is a clear secondary data
input that allows for personalization
§  We have added many other features (and tried
many more that have not proved useful)
§  What about the weights?
§  Based on A/B testing
§  Machine-learned

Example: Two features, linear model
1

Predicted Rating

2

Final
Ranking

3

4
Linear
Model:

frank(u,v)
=
w1
p(v)
+
w2
r(u,v)
+
b

5

Popularity
35

Learning to rank
§  Machine learning problem: goal is to construct ranking
model from training data
§  Training data can have partial order or binary judgments
(relevant/not relevant).
§  Resulting order of the items typically induced from a
numerical score
§  Learning to rank is a key element for personalization
§  You can treat the problem as a standard supervised
classification problem

40

Learning to Rank Approaches
1.  Pointwise
§  Ranking function minimizes loss function defined on individual
relevance judgment
§  Ranking score based on regression or classification
§  Ordinal regression, Logistic regression, SVM, GBDT, …
2.  Pairwise
§  Loss function is defined on pair-wise preferences
§  Goal: minimize number of inversions in ranking
§  Ranking problem is then transformed into the binary classification
problem
§  RankSVM, RankBoost, RankNet, FRank…

Learning to rank - metrics DCG
NDCG =
IDCG
§  Quality of ranking measured using metrics as n
relevancei
DCG = relevance1 + ∑
§  Normalized Discounted Cumulative Gain 2 log 2 i

§  Mean Reciprocal Rank (MRR)
1 1
§  Fraction of Concordant Pairs (FCP) MRR =
H
∑ rank(h )
h∈H i

§  Others…
§  But, it is hard to optimize machine-learned ∑CP(x , x ) i j

models directly on these measures (they are FCP = i≠ j
n(n −1)
not differentiable) 2

§  Recent research on models that directly
optimize ranking measures

42

Learning to Rank Approaches
3.  Listwise
a.  Indirect Loss Function
§  RankCosine: similarity between ranking list and ground truth as loss function
§  ListNet: KL-divergence as loss function by defining a probability distribution
§  Problem: optimization of listwise loss function may not optimize IR metrics
b.  Directly optimizing IR measures (difficult since they are not differentiable)
§  Directly optimize IR measures through Genetic Programming
§  Directly optimize measures with Simulated Annealing
§  Gradient descent on smoothed version of objective function (e.g. CLiMF
presented at Recsys 2012 or TFMAP at SIGIR 2012)
§  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
§  AdaRank uses boosting to optimize NDCG

Similars

§  Different similarities computed
from different sources: metadata,
ratings, viewing data…
§  Similarities can be treated as
data/features
§  Machine Learned models
improve our concept of “similarity”

44

Data & Models - Recap
§  All sorts of feedback from the user can help generate better
recommendations
§  Need to design systems that capture and take advantage of
all this data
§  The right model is as important as the right data
§  It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
§  Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity…)

45

More data or better models?

Really?

Anand Rajaraman: Stanford & Senior VP at
Walmart Global eCommerce (former Kosmix) 46


Sometimes, it’s not
about more data

47

[Banko and Brill, 2001]

Norvig: “Google does not
have better Algorithms,
only more Data”

Many features/
low-bias models

48

Model performance vs. sample size
(actual Netflix system)
0.09

0.08

0.07

0.06

0.05 Sometimes, it’s not
about more data
0.04

0.03

0.02

0.01

0
0 1000000 2000000 3000000 4000000 5000000 6000000

49


Data without a sound approach = noise 50

Consumer Science
§  Main goal is to effectively innovate for customers
§  Innovation goals
§  “If you want to increase your success rate, double
your failure rate.” – Thomas Watson, Sr., founder of
IBM
§  The only real failure is the failure to innovate
§  Fail cheaply
§  Know why you failed/succeeded

52

Consumer (Data) Science
1.  Start with a hypothesis:
§  Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2.  Design a test
§  Develop a solution or prototype
§  Think about dependent & independent variables, control,
significance…
3.  Execute the test
4.  Let data speak for itself

53

Offline/Online testing process
days Weeks to months

Offline Online A/B Rollout
Feature to
testing [success]
testing [success] all users

[fail]

54

Offline testing
§  Optimize algorithms offline
§  Measure model performance, using metrics such as:
§  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…

§  Offline performance used as an indication to make informed
decisions on follow-up A/B tests
§  A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
§  Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)

55

Executing A/B tests
§  Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
§  Think about significance and hypothesis testing
§  Our tests usually have thousands of members and 2-20 cells
§  A/B Tests allow you to try radical ideas or test many
approaches at the same time.
§  We typically have hundreds of customer A/B tests running
§  Decisions on the product always data-driven

56

What to measure
§  OEC: Overall Evaluation Criteria
§  In an AB test framework, the measure of success is key
§  Short-term metrics do not always align with long term
goals
§  E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
§  Use long term metrics such as LTV (Life time value)
whenever possible
§  In Netflix, we use member retention
57

What to measure
§  Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
§  At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
§  But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See “Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained” [Kohavi et. Al. KDD
12]

58

Consumer Data Science - Recap
§  Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
§  This is mainly done through online AB Testing
§  However, we can speed up innovation by experimenting
offline
§  But, both for online and offline experimentation, it is
important to chose the right metric and experimental
framework

59

Technology

hTp://techblog.neDlix.com
61

Event & Data
Distribution

63

Event & Data Distribution
•  UI devices should broadcast many
different kinds of user events
•  Clicks
•  Presentations
•  Browsing events
•  …
•  Events vs. data
•  Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
•  Others need to be processed and
“turned into” data (higher latency,
higher information quality).
•  And… there are many in between
•  Real-time event flow managed
through internal tool (Manhattan)
•  Data flow mostly managed through
Hadoop.

64

Offline Jobs
•  Two kinds of offline jobs
•  Model training
•  Batch offline computation of
recommendations/
intermediate results
•  Offline queries either in Hive or
PIG
•  Need a publishing mechanism
that solves several issues
•  Notify readers when result of
query is ready
•  Support different repositories
(s3, cassandra…)
•  Handle errors, monitoring…
•  We do this through Hermes
66

Computation
•  Two ways of computing personalized
results
•  Batch/offline
•  Online
•  Each approach has pros/cons
•  Offline
+  Allows more complex computations
+  Can use more data
-  Cannot react to quick changes
-  May result in staleness
•  Online
+  Can respond quickly to events
+  Can use most recent data
-  May fail because of SLA
-  Cannot deal with “complex”
computations
•  It’s not an either/or decision
•  Both approaches can be combined

68

Signals & Models

69

Signals & Models

•  Both offline and online algorithms are
based on three different inputs:
•  Models: previously trained from
existing data
•  (Offline) Data: previously
processed and stored information
•  Signals: fresh data obtained from
live services
•  User-related data
•  Context data (session, date,
time…)

70

Results
•  Recommendations can be serviced
from:
•  Previously computed lists
•  Online algorithms
•  A combination of both
•  The decision on where to service the
recommendation from can respond to
many factors including context.
•  Also, important to think about the
fallbacks (what if plan A fails)
•  Previously computed lists/intermediate
results can be stored in a variety of
ways
•  Cache
•  Cassandra
•  Relational DB
72

Alerts and Monitoring
§  A non-trivial concern in large-scale recommender
systems
§  Monitoring: continuously observe quality of system
§  Alert: fast notification if quality of system goes below a
certain pre-defined threshold
§  Questions:
§  What do we need to monitor?
§  How do we know something is “bad enough” to alert

73

What to monitor
Did something go
§  Staleness wrong here?

§  Monitor time since last data update

74

What to monitor
§  Algorithmic quality
§  Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

75

What to monitor
§  Algorithmic quality
§  Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

Did something go
wrong here?

76

What to monitor
§  Algorithmic source for users
§  Monitor how users interact with different algorithms
Algorithm X

Did something go
wrong here?

New version

77

When to alert
§  Alerting thresholds are hard to tune
§  Avoid unnecessary alerts (the “learn-to-ignore problem”)
§  Avoid important issues being noticed before the alert happens
§  Rules of thumb
§  Alert on anything that will impact user experience significantly
§  Alert on issues that are actionable
§  If a noticeable event happens without an alert… add a new alert
for next time

78

The Personalization Problem
§  The Netflix Prize simplified the recommendation problem
to predicting ratings
§  But…
§  User ratings are only one of the many data inputs we have
§  Rating predictions are only part of our solution
§  Other algorithms such as ranking or similarity are very important

§  We can reformulate the recommendation problem
§  Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service

80

More data +
Better models +
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!
81

Thanks!

We’re hiring!
Xavier Amatriain (@xamat)
xamatriain@netflix.com

Netflix Recommendations - Beyond the 5 Stars

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Netflix Recommendations - Beyond the 5 Stars

Similar to Netflix Recommendations - Beyond the 5 Stars (20)

More from Xavier Amatriain

More from Xavier Amatriain (20)

Recently uploaded

Recently uploaded (20)

Netflix Recommendations - Beyond the 5 Stars