Machine Learning &
Recommender Systems

@ Netflix Scale

November, 2013
Xavier Amatriain
Director - Algorithms Engineering @ Netflix

@xamat
SVD

What we were interested in:
■ High quality recommendations

Proxy question:
■ Accuracy in predicted rating
■ Improve by 10% = $1million!

Results
●

Top 2 algorithms still in
production

RBM
From the Netflix
Prize to today

2006

2013
Everything is

Personalized
Everything is personalized
Ranking

Over 75% of what
people watch
comes from a
recommendation
Top 10
Personalization awareness

Diversity
Support for Recommendations

Social Support
Genre Rows
Similars
EVERYTHING is a Recommendation
Data
&
Models
Big Data @Netflix

■ > 40M subscribers
■ Ratings: ~5M/day
■ Searches: >3M/day
Time
■ Plays:
Geo-information > 50M/day
■ Streamed hours:
○ 5B hours in Q3 2013
Impressions

Device Info

Metadata
Social

Ratings
Member Behavior

Demographics
Smart Models

■ Regression models (Logistic,
Linear, Elastic nets)
■ SVD & other MF models
■ Factorization Machines
■ Restricted Boltzmann Machines
■ Markov Chains & other graph
models
■ Clustering (from k-means to
HDP)
■ Deep ANN
■ LDA
■ Association Rules
■ GBDT/RF
■ …
SVD for Rating Prediction
■ User factor vectors
■ Baseline (bias)

and item-factors vectors
(user & item deviation

from average)
■ Predict rating as
■ SVD++ (Koren et. Al) asymmetric variation w.
implicit feedback

■ Where
■
are three item factor vectors
■ Users are not parametrized, but rather represented by:
■ R(u): items rated by user u & N(u): items for which the user
has given implicit preference (e.g. rated/not rated)
Restricted Boltzmann Machines
■ Restrict the connectivity in ANN to
make learning easier.

■

Only one layer of hidden units.

■

■

Although multiple layers are possible

No connections between hidden
units.

■ Hidden units are independent given
the visible states..
■ RBMs can be stacked to form Deep
Belief Networks (DBN) – 4th generation
of ANNs
Ranking
■ Ranking = Scoring + Sorting + Filtering
bags of movies for presentation to a user
■ Key algorithm, sorts titles in most contexts
■ Goal: Find the best possible ordering of a
set of videos for a user within a specific
context in real-time
■ Objective: maximize consumption &
“enjoyment”

■ Factors
■
■
■
■
■
■

Accuracy
Novelty
Diversity
Freshness
Scalability
…
Example: Two features, linear model

2
3
4

5

Popularity

Linear Model:
frank(u,v) = w1 p(v) + w2 r(u,v) + b

Final Ranking

Predicted Rating

1
Example: Two features, linear model

2
3
4

5

Popularity

Final Ranking

Predicted Rating

1
Ranking
More data or
better
models?
More data or better models?

Really?

Anand Rajaraman: Former Stanford Prof. &
Senior VP at Walmart
More data or better models?
Sometimes, it’s not
about more data
More data or better models?
[Banko and Brill, 2001]

Norvig: “Google does not
have better Algorithms,
only more Data”

Many features/
low-bias models
More data or better models?

Sometimes, it’s not
about more data
“Data without a sound approach = noise”
Smart
Architectures
Technology Stack

http://techblog.netflix.com
Cloud Computing at Netflix
▪ Layered services
▪ 100s of services and applications

▪ Clusters: Horizontal scaling
▪ 10,000s of EC2 instances

▪ Auto-scale with demand
▪ Plan for failure
▪ Replication
▪ Fail fast
▪ State is bad

▪ Simian Army: Induce failures to ensure
resiliency
System Overview
▪ Blueprint for multiple
personalization algorithm
services
▪ Ranking
▪ Row selection
▪ Ratings
▪ …

▪ Recommendation involving
multi-layered Machine
Learning
Event & Data Distribution
▪ Collect actions
▪ Plays, browsing, searches, ratings, etc.

▪ Events
▪ Small units
▪ Time sensitive

▪ Data
▪ Dense information
▪ Processed for further use
▪ Saved
Online Computation
▪ Synchronous computation in
response to a member request
▪ Pros:

▪ Good for:
▪ Simple algorithms
▪ Model application

▪ Access to most fresh data

▪ Business logic

▪ Knowledge of full request context

▪ Context-dependence

▪ Compute only what is necessary

▪ Interactivity

▪ Cons:
▪ Strict Service Level Agreements
▪ Must respond quickly … in all
cases
▪ Requires high availability

▪ Limited view of data

www.netflix.com
Offline Computation
▪ Asynchronous computation done
on a regular schedule
▪ Pros:

▪ Good for:
▪ Batch learning
▪ Model training

▪ Can handle large data

▪ Complex algorithms

▪ Can do bulk processing

▪ Precomputing

▪ Relaxed time constraints

▪ Cons:
▪ Cannot react quickly
▪ Results can become stale
Nearline Computation
▪ Asynchronous computation in
response to a member event
▪ Pros:

▪ Good for:
▪ Incremental learning
▪ User-oriented algorithms

▪ Can keep data fresh

▪ Moderate complexity algorithms

▪ Can run moderate complexity
algorithms

▪ Keeping precomputed results
fresh

▪ Can average computational cost
across users
▪ Change from actions

▪ Cons:
▪ Has some delay
▪ Done in event context
Where to place components?
▪ Example: Matrix Factorization
▪ Offline:
▪ Collect sample of play data
▪ Run batch learning algorithm to produce
factorization
▪ Publish item factors

▪ Nearline:

X
X≈UVt
Aui=b

sij=uivj

▪ Solve user factors

sij

▪ Compute user-item products
▪ Combine

▪ Online:
▪ Presentation-context filtering
▪ Serve recommendations

sij>t

V
Recommendation Results
▪ Precomputed results
▪ Fetch from data store
▪ Post-process in context

▪ Generated on the fly
▪ Collect signals, apply model

▪ Combination
▪ Dynamically choose
▪ Fallbacks
Conclusion
More data +
Smarter models +
More accurate metrics +
Better system architectures
Lots of room for improvement!
Xavier Amatriain (@xamat)
xavier@netflix.com

Thanks!

We’re hiring!

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

  • 1.
    Machine Learning & RecommenderSystems @ Netflix Scale November, 2013 Xavier Amatriain Director - Algorithms Engineering @ Netflix @xamat
  • 2.
    SVD What we wereinterested in: ■ High quality recommendations Proxy question: ■ Accuracy in predicted rating ■ Improve by 10% = $1million! Results ● Top 2 algorithms still in production RBM
  • 3.
    From the Netflix Prizeto today 2006 2013
  • 4.
  • 5.
    Everything is personalized Ranking Over75% of what people watch comes from a recommendation
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    EVERYTHING is aRecommendation
  • 11.
  • 12.
    Big Data @Netflix ■> 40M subscribers ■ Ratings: ~5M/day ■ Searches: >3M/day Time ■ Plays: Geo-information > 50M/day ■ Streamed hours: ○ 5B hours in Q3 2013 Impressions Device Info Metadata Social Ratings Member Behavior Demographics
  • 13.
    Smart Models ■ Regressionmodels (Logistic, Linear, Elastic nets) ■ SVD & other MF models ■ Factorization Machines ■ Restricted Boltzmann Machines ■ Markov Chains & other graph models ■ Clustering (from k-means to HDP) ■ Deep ANN ■ LDA ■ Association Rules ■ GBDT/RF ■ …
  • 14.
    SVD for RatingPrediction ■ User factor vectors ■ Baseline (bias) and item-factors vectors (user & item deviation from average) ■ Predict rating as ■ SVD++ (Koren et. Al) asymmetric variation w. implicit feedback ■ Where ■ are three item factor vectors ■ Users are not parametrized, but rather represented by: ■ R(u): items rated by user u & N(u): items for which the user has given implicit preference (e.g. rated/not rated)
  • 15.
    Restricted Boltzmann Machines ■Restrict the connectivity in ANN to make learning easier. ■ Only one layer of hidden units. ■ ■ Although multiple layers are possible No connections between hidden units. ■ Hidden units are independent given the visible states.. ■ RBMs can be stacked to form Deep Belief Networks (DBN) – 4th generation of ANNs
  • 16.
    Ranking ■ Ranking =Scoring + Sorting + Filtering bags of movies for presentation to a user ■ Key algorithm, sorts titles in most contexts ■ Goal: Find the best possible ordering of a set of videos for a user within a specific context in real-time ■ Objective: maximize consumption & “enjoyment” ■ Factors ■ ■ ■ ■ ■ ■ Accuracy Novelty Diversity Freshness Scalability …
  • 17.
    Example: Two features,linear model 2 3 4 5 Popularity Linear Model: frank(u,v) = w1 p(v) + w2 r(u,v) + b Final Ranking Predicted Rating 1
  • 18.
    Example: Two features,linear model 2 3 4 5 Popularity Final Ranking Predicted Rating 1
  • 19.
  • 20.
  • 21.
    More data orbetter models? Really? Anand Rajaraman: Former Stanford Prof. & Senior VP at Walmart
  • 22.
    More data orbetter models? Sometimes, it’s not about more data
  • 23.
    More data orbetter models? [Banko and Brill, 2001] Norvig: “Google does not have better Algorithms, only more Data” Many features/ low-bias models
  • 24.
    More data orbetter models? Sometimes, it’s not about more data
  • 25.
    “Data without asound approach = noise”
  • 26.
  • 27.
  • 28.
    Cloud Computing atNetflix ▪ Layered services ▪ 100s of services and applications ▪ Clusters: Horizontal scaling ▪ 10,000s of EC2 instances ▪ Auto-scale with demand ▪ Plan for failure ▪ Replication ▪ Fail fast ▪ State is bad ▪ Simian Army: Induce failures to ensure resiliency
  • 29.
    System Overview ▪ Blueprintfor multiple personalization algorithm services ▪ Ranking ▪ Row selection ▪ Ratings ▪ … ▪ Recommendation involving multi-layered Machine Learning
  • 30.
    Event & DataDistribution ▪ Collect actions ▪ Plays, browsing, searches, ratings, etc. ▪ Events ▪ Small units ▪ Time sensitive ▪ Data ▪ Dense information ▪ Processed for further use ▪ Saved
  • 31.
    Online Computation ▪ Synchronouscomputation in response to a member request ▪ Pros: ▪ Good for: ▪ Simple algorithms ▪ Model application ▪ Access to most fresh data ▪ Business logic ▪ Knowledge of full request context ▪ Context-dependence ▪ Compute only what is necessary ▪ Interactivity ▪ Cons: ▪ Strict Service Level Agreements ▪ Must respond quickly … in all cases ▪ Requires high availability ▪ Limited view of data www.netflix.com
  • 32.
    Offline Computation ▪ Asynchronouscomputation done on a regular schedule ▪ Pros: ▪ Good for: ▪ Batch learning ▪ Model training ▪ Can handle large data ▪ Complex algorithms ▪ Can do bulk processing ▪ Precomputing ▪ Relaxed time constraints ▪ Cons: ▪ Cannot react quickly ▪ Results can become stale
  • 33.
    Nearline Computation ▪ Asynchronouscomputation in response to a member event ▪ Pros: ▪ Good for: ▪ Incremental learning ▪ User-oriented algorithms ▪ Can keep data fresh ▪ Moderate complexity algorithms ▪ Can run moderate complexity algorithms ▪ Keeping precomputed results fresh ▪ Can average computational cost across users ▪ Change from actions ▪ Cons: ▪ Has some delay ▪ Done in event context
  • 34.
    Where to placecomponents? ▪ Example: Matrix Factorization ▪ Offline: ▪ Collect sample of play data ▪ Run batch learning algorithm to produce factorization ▪ Publish item factors ▪ Nearline: X X≈UVt Aui=b sij=uivj ▪ Solve user factors sij ▪ Compute user-item products ▪ Combine ▪ Online: ▪ Presentation-context filtering ▪ Serve recommendations sij>t V
  • 35.
    Recommendation Results ▪ Precomputedresults ▪ Fetch from data store ▪ Post-process in context ▪ Generated on the fly ▪ Collect signals, apply model ▪ Combination ▪ Dynamically choose ▪ Fallbacks
  • 36.
  • 37.
    More data + Smartermodels + More accurate metrics + Better system architectures Lots of room for improvement!
  • 38.