This document discusses how Netflix uses Spark and GraphX to power its recommender system at scale. It describes two machine learning problems - generating item rankings using graph diffusion algorithms like Topic Sensitive PageRank, and finding item clusters using LDA. It shows how these algorithms can be implemented iteratively in GraphX by representing the data as graphs and propagating vertex attributes. Performance comparisons show GraphX can outperform alternative implementations for large datasets due to its parallelism. Lessons learned include the importance of regular checkpointing and that multicore implementations are efficient for smaller datasets that fit in memory.
4. Introduction
● Goal: Help members find
content that they’ll enjoy
to maximize satisfaction
and retention
● Core part of product
○ Every impression is a
recommendation
5. Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
6. Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
○ Spark And GraphX
8. Spark And GraphX
● Spark- Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX - extends RDDs to Multigraphs and provides
graph analytics
● Convenient and fast, all the way from prototyping
(iSpark, Zeppelin) to Production
9. Two Machine Learning Problems
● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)
● Find Clusters of related items using co-occurrence data
○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
15. ● Popular graph diffusion algorithm
● Capturing vertex importance with regards to a particular
vertex
● e.g. for the topic “Seattle”
Topic Sensitive Pagerank @ Netflix
16. Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17. Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
20. GraphX implementation
● Running one propagation for each possible starting
node would be slow
● Keep a vector of activation probabilities at each vertex
● Use GraphX to run all propagations in parallel
21. Topic Sensitive Pagerank in GraphX
activation probability,
starting from vertex 1
activation probability,
starting from vertex 2
activation probability,
starting from vertex 3
...
Activation probabilities
as vertex attributes
...
...
... ...
...
...
24. LDA @ Netflix
● A popular clustering/latent factors model
● Discovers clusters/topics of related videos from Netflix
data
● e.g, a topic of Animal Documentaries
25. LDA - Graphical Model
Question: How to parallelize inference?
26. LDA - Graphical Model
Question: How to parallelize inference?
Answer: Read conditional independencies
in the model
30. Gibbs Sampler 2 (UnCollapsed)
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
31. Gibbs Sampler 2 (UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
57. What we learned so far ...
● Where is the cross-over point for your iterative ML
algorithm?
○ GraphX brings performance benefits if you’re on the right side of that
point
○ GraphX lets you easily throw more hardware at a problem
● GraphX very useful (and fast) for other graph
processing tasks
○ Data pre-processing
○ Efficient joins
58. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
59. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
○ ~36%
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !