Spark and GraphX in the Netflix
Recommender System
Ehtsham Elahi and Yves Raimond
Algorithms Engineering
Netflix
MLconf Se...
Machine Learning @ Netflix
Introduction
● Goal: Help members find
content that they’ll enjoy
to maximize satisfaction
and retention
● Core part of pr...
Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours...
Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours...
Spark and GraphX
Spark And GraphX
● Spark- Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX ...
Two Machine Learning Problems
● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph D...
Iterative Algorithms in GraphX
v1
v2v3
v4
v6
v7Vertex Attribute
Edge Attribute
Iterative Algorithms in GraphX
v1
v2v3
v4
v6
v7Vertex Attribute
Edge Attribute
GraphX represents the
graph as RDDs. e.g.
V...
Iterative Algorithms in GraphX
v1
v2v3
v4
v6
v7Vertex Attribute
Edge Attribute
GraphX provides APIs
to propagate and
updat...
Iterative Algorithms in GraphX
v1
v2v3
v4
v6
v7Vertex Attribute
Edge Attribute
Iterative Algorithm
proceeds by creating
up...
Graph Diffusion algorithms
● Popular graph diffusion algorithm
● Capturing vertex importance with regards to a particular
vertex
● e.g. for the topic...
Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
relate...
Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
Iteration 2
Vertex accumulates
higher mass
Iteration 2
And again, until
convergence
GraphX implementation
● Running one propagation for each possible starting
node would be slow
● Keep a vector of activatio...
Topic Sensitive Pagerank in GraphX
activation probability,
starting from vertex 1
activation probability,
starting from ve...
Example graph diffusion results
“Matrix”
“Zombies”
“Seattle”
Distributed Clustering algorithms
LDA @ Netflix
● A popular clustering/latent factors model
● Discovers clusters/topics of related videos from Netflix
data
...
LDA - Graphical Model
Question: How to parallelize inference?
LDA - Graphical Model
Question: How to parallelize inference?
Answer: Read conditional independencies
in the model
Gibbs Sampler 1 (Semi Collapsed)
Gibbs Sampler 1 (Semi Collapsed)
Sample Topic Labels in a given document Sequentially
Sample Topic Labels in different doc...
Gibbs Sampler 2 (UnCollapsed)
Gibbs Sampler 2 (UnCollapsed)
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documen...
Gibbs Sampler 2 (UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallel
Sample Topic Labels ...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed paramet...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed paramet...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed paramet...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed paramet...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
(vertex, edge, vertex...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distribut...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distribut...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
1
1
2
0
Sample Topics...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0
1
0
0
1
1
1
0
0
0 2 0
1 0 1
1
1
2
0
Neighborhood aggregation for topic
histogra...
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.1
0.4
0.3
0.1
0.4
0.4
0.8
0.2
0.3
0.1 0.8 0.1
0.45 0.1 0.45
Realize samples fro...
Example LDA Results
Cluster of Bollywood
Movies
Cluster of Kids shows
Cluster of Western
movies
GraphX performance comparison
Algorithm Implementations
● Topic Sensitive Pagerank
○ Broadcast graph adjacency matrix
○ Scala/Breeze code, triggered by ...
Performance Comparison
Performance Comparison
Open Source DBPedia
dataset
Performance Comparison
Alternative Implementation:
Scala code triggered by
Spark on a cluster
Performance Comparison
Sublinear rise in time
with GraphX Vs Linear
rise in the Alternative
Performance Comparison
Doubling the size of cluster:
2.0 speedup in the Alternative
Impl Vs 1.2 in GraphX
Performance Comparison
Large number of
vertices propagated in
parallel lead to large
shuffle data, causing
failures in Gra...
Performance Comparison
Netflix dataset
Number of Topics = 100
Performance Comparison
Multi-core implementation:
Single machine Multi-
threaded Java Code
Performance Comparison
GraphX setup:
8 x Resources than the
Multi-Core setup
Performance Comparison
Wikipedia dataset
(16 x r3.2xl)
(Databricks)
Performance Comparison
GraphX for very large datasets
outperforms the multi-core
unCollapsed Impl
Lessons Learned
What we learned so far ...
● Where is the cross-over point for your iterative ML
algorithm?
○ GraphX brings performance be...
What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
...
What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
...
We’re hiring!
(come talk to us)
https://jobs.netflix.com/
Appendix
Using GraphX
scala> val edgesFile = "/data/mlconf-graphx/edges.txt"
scala> sc.textFile(edgesFile).take(5).foreach(println)...
Creating a GraphX graph
scala> val graph = GraphLoader.edgeListFile(sc, edgesFile, false, 100)
graph: org.apache.spark.gra...
Pagerank in GraphX
scala> val ranks = graph.staticPageRank(10, 0.15).vertices
scala> val resources = mapping.map { row =>
...
Topic-sensitive pagerank in GraphX
● Initialization:
○ Construct a message VertexRDD holding initial activation probabilit...
Distributed Gibbs Sampler in GraphX
1) Initialize Document-Word graph, G
2) For each triplet in G,
a) Construct a categori...
References
● Topic Sensitive Pagerank [Haveliwala, 2002]
● Latent Dirichlet Allocation [Blei, 2003]
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineering Group at Netflix at MLconf SEA - 5/01/15
Upcoming SlideShare
Loading in …5
×

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineering Group at Netflix at MLconf SEA - 5/01/15

5,991 views

Published on

Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.

Published in: Technology

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineering Group at Netflix at MLconf SEA - 5/01/15

  1. 1. Spark and GraphX in the Netflix Recommender System Ehtsham Elahi and Yves Raimond Algorithms Engineering Netflix MLconf Seattle 2015
  2. 2. Machine Learning @ Netflix
  3. 3. Introduction ● Goal: Help members find content that they’ll enjoy to maximize satisfaction and retention ● Core part of product ○ Every impression is a recommendation
  4. 4. Main Challenge - Scale ● Algorithms @ Netflix Scale ○ > 62 M Members ○ > 50 Countries ○ > 1000 device types ○ > 100M Hours / day ● Distributed Machine Learning algorithms help with Scale
  5. 5. Main Challenge - Scale ● Algorithms @ Netflix Scale ○ > 62 M Members ○ > 50 Countries ○ > 1000 device types ○ > 100M Hours / day ● Distributed Machine Learning algorithms help with Scale ○ Spark And GraphX
  6. 6. Spark and GraphX
  7. 7. Spark And GraphX ● Spark- Distributed in-memory computational engine using Resilient Distributed Datasets (RDDs) ● GraphX - extends RDDs to Multigraphs and provides graph analytics ● Convenient and fast, all the way from prototyping (iSpark, Zeppelin) to Production
  8. 8. Two Machine Learning Problems ● Generate ranking of items with respect to a given item from an interaction graph ○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank) ● Find Clusters of related items using co-occurrence data ○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
  9. 9. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute
  10. 10. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX represents the graph as RDDs. e.g. VertexRDD, EdgeRDD
  11. 11. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX provides APIs to propagate and update attributes
  12. 12. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute Iterative Algorithm proceeds by creating updated graphs
  13. 13. Graph Diffusion algorithms
  14. 14. ● Popular graph diffusion algorithm ● Capturing vertex importance with regards to a particular vertex ● e.g. for the topic “Seattle” Topic Sensitive Pagerank @ Netflix
  15. 15. Iteration 0 We start by activating a single node “Seattle” related to shot in featured in related to cast cast cast related to
  16. 16. Iteration 1 With some probability, we follow outbound edges, otherwise we go back to the origin.
  17. 17. Iteration 2 Vertex accumulates higher mass
  18. 18. Iteration 2 And again, until convergence
  19. 19. GraphX implementation ● Running one propagation for each possible starting node would be slow ● Keep a vector of activation probabilities at each vertex ● Use GraphX to run all propagations in parallel
  20. 20. Topic Sensitive Pagerank in GraphX activation probability, starting from vertex 1 activation probability, starting from vertex 2 activation probability, starting from vertex 3 ... Activation probabilities as vertex attributes ... ... ... ... ... ...
  21. 21. Example graph diffusion results “Matrix” “Zombies” “Seattle”
  22. 22. Distributed Clustering algorithms
  23. 23. LDA @ Netflix ● A popular clustering/latent factors model ● Discovers clusters/topics of related videos from Netflix data ● e.g, a topic of Animal Documentaries
  24. 24. LDA - Graphical Model Question: How to parallelize inference?
  25. 25. LDA - Graphical Model Question: How to parallelize inference? Answer: Read conditional independencies in the model
  26. 26. Gibbs Sampler 1 (Semi Collapsed)
  27. 27. Gibbs Sampler 1 (Semi Collapsed) Sample Topic Labels in a given document Sequentially Sample Topic Labels in different documents In parallel
  28. 28. Gibbs Sampler 2 (UnCollapsed)
  29. 29. Gibbs Sampler 2 (UnCollapsed) Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
  30. 30. Gibbs Sampler 2 (UnCollapsed) Suitable For GraphX Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
  31. 31. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics
  32. 32. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics document
  33. 33. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics word
  34. 34. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics Edge: if word appeared in the document
  35. 35. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 (vertex, edge, vertex) = triplet
  36. 36. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distribution for the triplet using vertex attributes
  37. 37. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distributions for all triplets
  38. 38. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 1 1 2 0 Sample Topics for all edges
  39. 39. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0 1 0 0 1 1 1 0 0 0 2 0 1 0 1 1 1 2 0 Neighborhood aggregation for topic histograms
  40. 40. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.1 0.4 0.3 0.1 0.4 0.4 0.8 0.2 0.3 0.1 0.8 0.1 0.45 0.1 0.45 Realize samples from Dirichlet to update the graph
  41. 41. Example LDA Results Cluster of Bollywood Movies Cluster of Kids shows Cluster of Western movies
  42. 42. GraphX performance comparison
  43. 43. Algorithm Implementations ● Topic Sensitive Pagerank ○ Broadcast graph adjacency matrix ○ Scala/Breeze code, triggered by Spark ● LDA ○ Single machine ○ Multi-threaded Java code ● All implementations are Netflix Internal Code
  44. 44. Performance Comparison
  45. 45. Performance Comparison Open Source DBPedia dataset
  46. 46. Performance Comparison Alternative Implementation: Scala code triggered by Spark on a cluster
  47. 47. Performance Comparison Sublinear rise in time with GraphX Vs Linear rise in the Alternative
  48. 48. Performance Comparison Doubling the size of cluster: 2.0 speedup in the Alternative Impl Vs 1.2 in GraphX
  49. 49. Performance Comparison Large number of vertices propagated in parallel lead to large shuffle data, causing failures in GraphX for small clusters
  50. 50. Performance Comparison Netflix dataset Number of Topics = 100
  51. 51. Performance Comparison Multi-core implementation: Single machine Multi- threaded Java Code
  52. 52. Performance Comparison GraphX setup: 8 x Resources than the Multi-Core setup
  53. 53. Performance Comparison Wikipedia dataset (16 x r3.2xl) (Databricks)
  54. 54. Performance Comparison GraphX for very large datasets outperforms the multi-core unCollapsed Impl
  55. 55. Lessons Learned
  56. 56. What we learned so far ... ● Where is the cross-over point for your iterative ML algorithm? ○ GraphX brings performance benefits if you’re on the right side of that point ○ GraphX lets you easily throw more hardware at a problem ● GraphX very useful (and fast) for other graph processing tasks ○ Data pre-processing ○ Efficient joins
  57. 57. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
  58. 58. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ○ ~36% ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
  59. 59. We’re hiring! (come talk to us) https://jobs.netflix.com/
  60. 60. Appendix
  61. 61. Using GraphX scala> val edgesFile = "/data/mlconf-graphx/edges.txt" scala> sc.textFile(edgesFile).take(5).foreach(println) 0 1 2 3 2 4 2 5 2 6 scala> val mapping = sc.textFile("/data/mlconf-graphx/uri-mapping.csv") scala> mapping.take(5).foreach(println) http://dbpedia.org/resource/Drew_Finerty,3663393 http://dbpedia.org/resource/1998_JGTC_season,4148403 http://dbpedia.org/resource/Eucalyptus_bosistoana,3473416 http://dbpedia.org/resource/Wilmington,234049 http://dbpedia.org/resource/Wetter_(Ruhr),884940
  62. 62. Creating a GraphX graph scala> val graph = GraphLoader.edgeListFile(sc, edgesFile, false, 100) graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx. impl.GraphImpl@547a8dc1 scala> graph.edges.count res3: Long = 16090021 scala> graph.vertices.count res4: Long = 4548083
  63. 63. Pagerank in GraphX scala> val ranks = graph.staticPageRank(10, 0.15).vertices scala> val resources = mapping.map { row => val fields = row.split(",") (fields.last.toLong, fields.first) } scala> val ranksByResource = resources.join(ranks).map { case (id, (resource, rank)) => (resource, rank) } scala> ranksByResource.top(3)(Ordering.by(_._2)).foreach(println) (http://dbpedia.org/resource/United_States,15686.671749384182) (http://dbpedia.org/resource/Animal,6530.621240073025) (http://dbpedia.org/resource/United_Kingdom,5780.806077968981)
  64. 64. Topic-sensitive pagerank in GraphX ● Initialization: ○ Construct a message VertexRDD holding initial activation probabilities at each vertex (sparse vector with one non-zero) ● Propagate message along outbound edges using flatMap ○ (Involves shuffling) ● Sum incoming messages at each vertex ○ aggregateUsingIndex, summing up sparse vectors ○ join the message to the old graph to create a new one ● count to materialize the new graph ● unpersist to clean up old graph and message ● Repeat for fixed number of iterations or until convergence ● Zeppelin notebook, using DBpedia data
  65. 65. Distributed Gibbs Sampler in GraphX 1) Initialize Document-Word graph, G 2) For each triplet in G, a) Construct a categorical using vertex attributes (P(topic | document), P(word | topic)) b) Sample a topic label from the categorical distribution 3) Aggregate topic labels on the Vertex id 4) Sample vertex attributes from dirichlet distribution a) This involves computing and distributing a marginal over the Topic matrix, this materializes the graph in every iteration 5) Join vertices with updated attributes with the graph and repeat from step 2 Note: Step 2 and 3 can be accomplished jointly using aggregateMessages method on the Graph
  66. 66. References ● Topic Sensitive Pagerank [Haveliwala, 2002] ● Latent Dirichlet Allocation [Blei, 2003]

×