Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineering Group at Netflix at MLconf SEA - 5/01/15
The document discusses the implementation of Spark and GraphX in Netflix's recommender system, aiming to enhance content discovery for over 62 million members. It details the challenges of scaling machine learning algorithms across various devices and countries, and presents solutions using distributed algorithms to perform tasks like graph diffusion and clustering. Findings indicate that GraphX significantly improves performance for large datasets and highlights lessons learned regarding iterative machine learning and state management.
Explains the role of algorithms for member satisfaction and content recommendations to maximize retention.
Discusses the scale of Netflix's service with over 62M members across 50 countries and the need for distributed machine learning.
Introduces Spark as a distributed computational engine and GraphX for graph analytics using Resilient Distributed Datasets.
Outlines two machine learning problems: ranking items and finding related clusters using Graph Diffusion Algorithms and Probabilistic Graphical Models.
Describes how GraphX represents graphs and utilizes APIs for propagating and updating attributes in iterative algorithms.
Details on the Topic Sensitive Pagerank, a graph diffusion algorithm, which captures vertex importance regarding specific topics.
Explains the iterative process where nodes accumulate mass through connections until convergence is achieved.
Outlines the implementation process in GraphX to run multiple propagations of Topic Sensitive Pagerank in parallel.
Introduces Latent Dirichlet Allocation as a method for identifying video clusters and topics within Netflix's data.
Addresses the inference parallelization challenge in LDA and presents approaches using Gibbs Sampling.
Discusses the structure of distributed Gibbs Samplers for LDA, including word-document relationships and sampling processes.
Presents example results from LDA showing clusters of movies and kids shows identified by topics.
Compares the performance of algorithms in handling datasets, highlighting GraphX efficiencies against alternatives.Summarizes key insights around the performance benefits of GraphX and the importance of iterative algorithm design.
Opens the opportunity for employment at Netflix and provides coding and data handling examples in GraphX.
Technical instructions and references related to the implementation of Pagerank and distributed Gibbs Sampler in GraphX.
Introduction
● Goal: Helpmembers find
content that they’ll enjoy
to maximize satisfaction
and retention
● Core part of product
○ Every impression is a
recommendation
5.
Main Challenge -Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
6.
Main Challenge -Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
○ Spark And GraphX
Spark And GraphX
●Spark- Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX - extends RDDs to Multigraphs and provides
graph analytics
● Convenient and fast, all the way from prototyping
(iSpark, Zeppelin) to Production
9.
Two Machine LearningProblems
● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)
● Find Clusters of related items using co-occurrence data
○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
● Popular graphdiffusion algorithm
● Capturing vertex importance with regards to a particular
vertex
● e.g. for the topic “Seattle”
Topic Sensitive Pagerank @ Netflix
16.
Iteration 0
We startby
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17.
Iteration 1
With someprobability,
we follow outbound
edges, otherwise we
go back to the origin.
GraphX implementation
● Runningone propagation for each possible starting
node would be slow
● Keep a vector of activation probabilities at each vertex
● Use GraphX to run all propagations in parallel
21.
Topic Sensitive Pagerankin GraphX
activation probability,
starting from vertex 1
activation probability,
starting from vertex 2
activation probability,
starting from vertex 3
...
Activation probabilities
as vertex attributes
...
...
... ...
...
...
LDA @ Netflix
●A popular clustering/latent factors model
● Discovers clusters/topics of related videos from Netflix
data
● e.g, a topic of Animal Documentaries
25.
LDA - GraphicalModel
Question: How to parallelize inference?
26.
LDA - GraphicalModel
Question: How to parallelize inference?
Answer: Read conditional independencies
in the model
Gibbs Sampler 2(UnCollapsed)
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
31.
Gibbs Sampler 2(UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
What we learnedso far ...
● Where is the cross-over point for your iterative ML
algorithm?
○ GraphX brings performance benefits if you’re on the right side of that
point
○ GraphX lets you easily throw more hardware at a problem
● GraphX very useful (and fast) for other graph
processing tasks
○ Data pre-processing
○ Efficient joins
58.
What we learnedso far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
59.
What we learnedso far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
○ ~36%
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
Creating a GraphXgraph
scala> val graph = GraphLoader.edgeListFile(sc, edgesFile, false, 100)
graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.
impl.GraphImpl@547a8dc1
scala> graph.edges.count
res3: Long = 16090021
scala> graph.vertices.count
res4: Long = 4548083
64.
Pagerank in GraphX
scala>val ranks = graph.staticPageRank(10, 0.15).vertices
scala> val resources = mapping.map { row =>
val fields = row.split(",")
(fields.last.toLong, fields.first)
}
scala> val ranksByResource = resources.join(ranks).map {
case (id, (resource, rank)) => (resource, rank)
}
scala> ranksByResource.top(3)(Ordering.by(_._2)).foreach(println)
(http://dbpedia.org/resource/United_States,15686.671749384182)
(http://dbpedia.org/resource/Animal,6530.621240073025)
(http://dbpedia.org/resource/United_Kingdom,5780.806077968981)
65.
Topic-sensitive pagerank inGraphX
● Initialization:
○ Construct a message VertexRDD holding initial activation probabilities
at each vertex (sparse vector with one non-zero)
● Propagate message along outbound edges using flatMap
○ (Involves shuffling)
● Sum incoming messages at each vertex
○ aggregateUsingIndex, summing up sparse vectors
○ join the message to the old graph to create a new one
● count to materialize the new graph
● unpersist to clean up old graph and message
● Repeat for fixed number of iterations or until convergence
● Zeppelin notebook, using DBpedia data
66.
Distributed Gibbs Samplerin GraphX
1) Initialize Document-Word graph, G
2) For each triplet in G,
a) Construct a categorical using vertex attributes (P(topic | document), P(word | topic))
b) Sample a topic label from the categorical distribution
3) Aggregate topic labels on the Vertex id
4) Sample vertex attributes from dirichlet distribution
a) This involves computing and distributing a marginal over the Topic matrix, this materializes
the graph in every iteration
5) Join vertices with updated attributes with the graph and repeat from
step 2
Note: Step 2 and 3 can be accomplished jointly using aggregateMessages method on the
Graph