Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

Building
Industrial-‐scale
Real-‐world
Recommender
Systems

September
11,
2012

Xavier
Amatriain

Personaliza8on
Science
and
Engineering
-‐
Ne?lix
@xamat

Outline
1.  Anatomy of Netflix Personalization
2.  Data & Models
3.  Consumer (Data) Science
4.  Architectures

Anatomy of
Netflix
Personalization
Everything is a Recommendation

Everything is personalized
Ranking

Note:
Recommendations
Rows

are per household,
not individual user

4

Top 10
Personalization awareness

All Dad Dad&Mom Daughter All All? Daughter Son Mom Mom

Diversity

5

Support for Recommendations

Social Support 6

Watch again & Continue Watching

7

Genre rows
§  Personalized genre rows focus on user interest
§  Also provide context and “evidence”
§  Important for member satisfaction – moving personalized rows to top on
devices increased retention
§  How are they generated?
§  Implicit: based on user’s recent plays, ratings, & other interactions
§  Explicit taste preferences
§  Hybrid:combine the above
§  Also take into account:
§  Freshness - has this been shown before?
§  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.

Genres - personalization

10

Genres - personalization

11

Genres- explanations

12

Genres- explanations

13

Genres – user involvement

14

Genres – user involvement

15

Similars
§  Displayed in
many different
contexts
§  In response to
user actions/
context (search,
queue add…)
§  More like… rows

Anatomy of a Personalization - Recap
§  Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity…
§  We strive to make it easy for the user, but…
§  We want the user to be aware and be involved in the
recommendation process
§  Deal with implicit/explicit and hybrid feedback
§  Add support/explanations for recommendations
§  Consider issues such as diversity or freshness
17

§  Plays
Big Data §  Behavior
§  Geo-Information
§  Time
§  Ratings
§  Searches
§  Impressions
§  Device info
§  Metadata
§  Social
§  …

19

Big Data §  25M+ subscribers
@Netflix §  Ratings: 4M/day
§  Searches: 3M/day
§  Plays: 30M/day
§  2B hours streamed in Q4
2011
§  1B hours in June 2012

20

Models
§  Logistic/linear regression
§  Elastic nets
§  Matrix Factorization
§  Markov Chains
§  Clustering
§  LDA
§  Association Rules
§  Gradient Boosted Decision Trees
§  …

21

Rating Prediction

22

2007 Progress Prize
§  KorBell team (AT&T) improved by 8.43%
§  Spent ~2,000 hours
§  Combined 107 prediction algorithms with linear
equation
§  Gave us the source code

2007 Progress Prize
§  Top 2 algorithms
§  SVD - Prize RMSE: 0.8914
§  RBM - Prize RMSE: 0.8990
§  Linear blend Prize RMSE: 0.88
§  Limitations
§  Designed for 100M ratings, we have 5B ratings
§  Not adaptable as users add ratings
§  Performance issues
§  Currently in use as part of Netflix’ rating prediction component

SVD
X[n x m] = U[n x r] S [ r x r] (V[m x r])T

§  X: m x n matrix (e.g., m users, n videos)

§  U: m x r matrix (m users, r concepts)
§  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

§  V: r x n matrix (n videos, r concepts)

Simon Funk’s SVD
§  One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
§  Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
http://sifter.org/~simon/journal/20061211.html 26

SVD for Rating Prediction
§  Associate each user with a user-factors vector pu ∈ ℜ f
§  Associate each item with an item-factors vector qv ∈ ℜ f
§  Define a baseline estimate buv = µ + bu + bv to account for
user and item deviation from the average
§  Predict rating using the rule
' T
r = buv + p qv
uv u

27

SVD++
§  Koren et. al proposed an asymmetric variation that includes
implicit feedback:
$ −
1
−
1 '
' T
& R(u) 2
r = buv + q &
uv v ∑ (ruj − buj )x j + N(u) 2
∑ yj ) )
% j∈R(u) j∈N (u) (

§  Where
§  qv , xv , yv ∈ ℜ f are three item factor vectors
§  Users are not parametrized, but rather represented by:
§  R(u): items rated by user u
§  N(u): items for which the user has given an implicit preference (e.g. rated
vs. not rated)

28

First generation neural networks (~60s)

Like Hate
§  Perceptrons (~1960) output units -
§  Single layer of hand-coded class labels
features
§  Linear activation function
§  Fundamentally limited in what non-adaptive
they can learn to do. hand-coded
features

input units -
features

Second generation neural networks (~80s)
Compare output to
Back-propagate correct answer to
compute error signal
error signal to
get derivatives
outputs
for learning
Non-linear
activation
function
hidden layers

input features

Belief Networks (~90s)
§  Directed acyclic graph stochas8c

composed of stochastic hidden

variables with weighted cause

connections.
§  Can observe some of the
variables
§  Solve two problems:
§  Inference: Infer the states of the
unobserved variables. visible

§  Learning: Adjust the eﬀect

interactions between variables
to make the network more likely
to generate the observed data.

Restricted Boltzmann Machine
§  Restrict the connectivity to make learning easier.
§  Only one layer of hidden units.
§  Although multiple layers are possible hidden
§  No connections between hidden units. j
§  Hidden units are independent given the visible
states..
§  So we can quickly get an unbiased sample from
the posterior distribution over hidden “causes” i
when given a data-vector
visible
§  RBMs can be stacked to form Deep Belief
Nets (DBN)

RBM for the Netflix Prize

34

What about the final prize ensembles?
§  Our offline studies showed they were too
computationally intensive to scale
§  Expected improvement not worth the
engineering effort
§  Plus, focus had already shifted to other
issues that had more impact than rating
prediction...

35

Ranking Key algorithm, sorts titles in most
contexts

Ranking
§  Ranking = Scoring + Sorting + Filtering §  Factors
bags of movies for presentation to a user §  Accuracy
§  Goal: Find the best possible ordering of a §  Novelty
set of videos for a user within a specific §  Diversity
context in real-time §  Freshness
§  Objective: maximize consumption §  Scalability
§  Aspirations: Played & “enjoyed” titles have §  …
best score
§  Akin to CTR forecast for ads/search results

Ranking
§  Popularity is the obvious baseline
§  Ratings prediction is a clear secondary data
input that allows for personalization
§  We have added many other features (and tried
many more that have not proved useful)
§  What about the weights?
§  Based on A/B testing
§  Machine-learned

Example: Two features, linear model
1

Predicted Rating

2

Final
Ranking

3

4
Linear
Model:

frank(u,v)
=
w1
p(v)
+
w2
r(u,v)
+
b

5

Popularity
39

Learning to rank
§  Machine learning problem: goal is to construct ranking
model from training data
§  Training data can have partial order or binary judgments
(relevant/not relevant).
§  Resulting order of the items typically induced from a
numerical score
§  Learning to rank is a key element for personalization
§  You can treat the problem as a standard supervised
classification problem

41

Learning to Rank Approaches
1.  Pointwise
§  Ranking function minimizes loss function defined on individual
relevance judgment
§  Ranking score based on regression or classification
§  Ordinal regression, Logistic regression, SVM, GBDT, …
2.  Pairwise
§  Loss function is defined on pair-wise preferences
§  Goal: minimize number of inversions in ranking
§  Ranking problem is then transformed into the binary classification
problem
§  RankSVM, RankBoost, RankNet, FRank…

Learning to rank - metrics
§  Quality of ranking measured using metrics as
§  Normalized Discounted Cumulative Gain
n
DCG relevancei
NDCG = where DCG = relevance1 + ∑ and IDCG = ideal ranking
IDCG 2 log 2 i
§  Mean Reciprocal Rank (MRR)
1 1
MRR =
H
∑ rank(h ) where hi are the positive “hits” from the user
h∈H i

§  Mean average Precision (MAP)
N

∑ AveP(n) tp
MAP = n=1
where N can be number of users, items… and P =
N tp + fp

43

Learning to rank - metrics
§  Quality of ranking measured using metrics as
§  Fraction of Concordant Pairs (FCP)
§  Given items xi and xj, user preference P and a ranking method R, a
concordant pair (CP) is { xi , x j } s.t.P(xi ) > P(x j ) ⇔ R(xi ) < R(x j )
∑CP(x , x )
i j

§  Then FCP = i≠ j
n(n −1)
§  Others… 2
§  But, it is hard to optimize machine-learned models directly on
these measures
§  They are not differentiable
§  Recent research on models that directly optimize ranking
measures

44

Learning to Rank Approaches
3.  Listwise
a.  Directly optimizing IR measures (difficult since they are not differentiable)
§  Directly optimize IR measures through Genetic Programming
§  Directly optimize measures with Simulated Annealing
§  Gradient descent on smoothed version of objective function
§  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
§  AdaRank uses boosting to optimize NDCG
b.  Indirect Loss Function
§  RankCosine uses similarity between the ranking list and the ground truth as
loss function
§  ListNet uses KL-divergence as loss function by defining a probability
distribution
§  Problem: optimization in the listwise loss function does not necessarily optimize
IR metrics

Similars

§  Different similarities computed
from different sources: metadata,
ratings, viewing data…
§  Similarities can be treated as
data/features
§  Machine Learned models
improve our concept of “similarity”

46

Data & Models - Recap
§  All sorts of feedback from the user can help generate better
recommendations
§  Need to design systems that capture and take advantage of
all this data
§  The right model is as important as the right data
§  It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
§  Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity…)

47

Consumer Science
§  Main goal is to effectively innovate for customers
§  Innovation goals
§  “If you want to increase your success rate, double
your failure rate.” – Thomas Watson, Sr., founder of
IBM
§  The only real failure is the failure to innovate
§  Fail cheaply
§  Know why you failed/succeeded

49

Consumer (Data) Science
1.  Start with a hypothesis:
§  Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2.  Design a test
§  Develop a solution or prototype
§  Think about dependent & independent variables, control,
significance…
3.  Execute the test
4.  Let data speak for itself

50

Offline/Online testing process
days Weeks to months

Offline Online A/B Rollout
Feature to
testing [success]
testing [success] all users

[fail]

51

Offline testing process
Initial
Hypothesis
Decide
Reformulate Model Rollout
Hypothesis Prototype Rollout
Train Model Feature to
[no] Wait for
all users

Try
offline
Online A/B
Results

Analyze
different
model?
[yes]
Test
testing
Results

[success]
[no] Hypothesis Significant
validated improvement
offline? on users?
[yes]
[fail] 52
[no]

Offline testing
§  Optimize algorithms offline
§  Measure model performance, using metrics such as:
§  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…

§  Offline performance used as an indication to make informed
decisions on follow-up A/B tests
§  A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
§  Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)

53

Online A/B testing process
Choose
Design A/
Control
B Test
Group
Decide
Reformulate Model Rollout
Hypothesis Prototype Rollout
Train Model Feature to
[no] Wait for
offline all users

Try
Offline Results

Analyze
different
model? testing
[yes]
Test
Results

Significant
Hypothesis [success] improvement
validated on users?
[no] offline? [yes]

[no] 54

Executing A/B tests
§  Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
§  Think about significance and hypothesis testing
§  Our tests usually have thousands of members and 2-20 cells
§  A/B Tests allow you to try radical ideas or test many
approaches at the same time.
§  We typically have hundreds of customer A/B tests running
§  Decisions on the product always data-driven

55

What to measure
§  OEC: Overall Evaluation Criteria
§  In an AB test framework, the measure of success is key
§  Short-term metrics do not always align with long term
goals
§  E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
§  Use long term metrics such as LTV (Life time value)
whenever possible
§  In Netflix, we use member retention
56

What to measure
§  Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
§  At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
§  But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See “Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained” [Kohavi et. Al. KDD
12]

57

Consumer Data Science - Recap
§  Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
§  This is mainly done through online AB Testing
§  However, we can speed up innovation by experimenting
offline
§  But, both for online and offline experimentation, it is
important to choose the right metric and experimental
framework

58

Technology

hQp://techblog.ne?lix.com
60

Event & Data
Distribution

62

Event & Data Distribution
•  UI devices should broadcast many
different kinds of user events
•  Clicks
•  Presentations
•  Browsing events
•  …
•  Events vs. data
•  Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
•  Others need to be processed and
“turned into” data (higher latency,
higher information quality).
•  And… there are many in between
•  Real-time event flow managed
through internal tool (Manhattan)
•  Data flow mostly managed through
Hadoop.

63

Offline Jobs
•  Two kinds of offline jobs
•  Model training
•  Batch offline computation of
recommendations/
intermediate results
•  Offline queries either in Hive or
PIG
•  Need a publishing mechanism
that solves several issues
•  Notify readers when result of
query is ready
•  Support different repositories
(s3, cassandra…)
•  Handle errors, monitoring…
•  We do this through Hermes
65

Computation
•  Two ways of computing personalized
results
•  Batch/offline
•  Online
•  Each approach has pros/cons
•  Offline
+  Allows more complex computations
+  Can use more data
-  Cannot react to quick changes
-  May result in staleness
•  Online
+  Can respond quickly to events
+  Can use most recent data
-  May fail because of SLA
-  Cannot deal with “complex”
computations
•  It’s not an either/or decision
•  Both approaches can be combined

67

Signals & Models

68

Signals & Models

•  Both offline and online algorithms are
based on three different inputs:
•  Models: previously trained from
existing data
•  (Offline) Data: previously
processed and stored information
•  Signals: fresh data obtained from
live services
•  User-related data
•  Context data (session, date,
time…)

69

Results
•  Recommendations can be serviced
from:
•  Previously computed lists
•  Online algorithms
•  A combination of both
•  The decision on where to service the
recommendation from can respond to
many factors including context.
•  Also, important to think about the
fallbacks (what if plan A fails)
•  Previously computed lists/intermediate
results can be stored in a variety of
ways
•  Cache
•  Cassandra
•  Relational DB
71

Alerts and Monitoring
§  A non-trivial concern in large-scale recommender
systems
§  Monitoring: continuously observe quality of system
§  Alert: fast notification if quality of system goes below a
certain pre-defined threshold
§  Questions:
§  What do we need to monitor?
§  How do we know something is “bad enough” to alert

72

What to monitor
Did something go
§  Staleness wrong here?

§  Monitor time since last data update

73

What to monitor
§  Algorithmic quality
§  Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

74

What to monitor
§  Algorithmic quality
§  Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

Did something go
wrong here?

75

What to monitor
§  Algorithmic source for users
§  Monitor how users interact with different algorithms
Algorithm X

Did something go
wrong here?

New version

76

When to alert
§  Alerting thresholds are hard to tune
§  Avoid unnecessary alerts (the “learn-to-ignore problem”)
§  Avoid important issues being noticed before the alert happens
§  Rules of thumb
§  Alert on anything that will impact user experience significantly
§  Alert on issues that are actionable
§  If a noticeable event happens without an alert… add a new alert
for next time

77

The Personalization Problem
§  The Netflix Prize simplified the recommendation problem
to predicting ratings
§  But…
§  User ratings are only one of the many data inputs we have
§  Rating predictions are only part of our solution
§  Other algorithms such as ranking or similarity are very important

§  We can reformulate the recommendation problem
§  Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service

79

More to Recsys than Algorithms
§  Not only is there more to algorithms than rating
prediction
§  There is more to Recsys than algorithms
§  User Interface & Feedback
§  Data
§  AB Testing
§  Systems & Architectures

80

More data +
Better models +
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!
81

We’re hiring!

Xavier Amatriain (@xamat)
xamatriain@netflix.com

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

More Related Content

What's hot

Viewers also liked

Similar to Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

More from Xavier Amatriain

Recently uploaded

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial