Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- How to Build a Recommendation Engin... by Caserta 34093 views
- Spark Meetup @ Netflix, 05/19/2015 by Yves Raimond 5221 views
- Machine Learning and GraphX by Andy Petrella 5067 views
- GraphX: Graph Analytics in Apache S... by Ankur Dave 5180 views
- Exploring Titan and Spark GraphX fo... by DataWorks Summit/... 3682 views
- Past, Present & Future of Recommend... by Justin Basilico 60480 views

6,552 views

Published on

Published in:
Technology

No Downloads

Total views

6,552

On SlideShare

0

From Embeds

0

Number of Embeds

3,887

Shares

0

Downloads

68

Comments

0

Likes

13

No embeds

No notes for slide

- 1. Spark and GraphX in the Netflix Recommender System Ehtsham Elahi and Yves Raimond Algorithms Engineering Netflix MLconf Seattle 2015
- 2. Machine Learning @ Netflix
- 3. Introduction ● Goal: Help members find content that they’ll enjoy to maximize satisfaction and retention ● Core part of product ○ Every impression is a recommendation
- 4. Main Challenge - Scale ● Algorithms @ Netflix Scale ○ > 62 M Members ○ > 50 Countries ○ > 1000 device types ○ > 100M Hours / day ● Distributed Machine Learning algorithms help with Scale
- 5. Main Challenge - Scale ● Algorithms @ Netflix Scale ○ > 62 M Members ○ > 50 Countries ○ > 1000 device types ○ > 100M Hours / day ● Distributed Machine Learning algorithms help with Scale ○ Spark And GraphX
- 6. Spark and GraphX
- 7. Spark And GraphX ● Spark- Distributed in-memory computational engine using Resilient Distributed Datasets (RDDs) ● GraphX - extends RDDs to Multigraphs and provides graph analytics ● Convenient and fast, all the way from prototyping (iSpark, Zeppelin) to Production
- 8. Two Machine Learning Problems ● Generate ranking of items with respect to a given item from an interaction graph ○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank) ● Find Clusters of related items using co-occurrence data ○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
- 9. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute
- 10. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX represents the graph as RDDs. e.g. VertexRDD, EdgeRDD
- 11. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX provides APIs to propagate and update attributes
- 12. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute Iterative Algorithm proceeds by creating updated graphs
- 13. Graph Diffusion algorithms
- 14. ● Popular graph diffusion algorithm ● Capturing vertex importance with regards to a particular vertex ● e.g. for the topic “Seattle” Topic Sensitive Pagerank @ Netflix
- 15. Iteration 0 We start by activating a single node “Seattle” related to shot in featured in related to cast cast cast related to
- 16. Iteration 1 With some probability, we follow outbound edges, otherwise we go back to the origin.
- 17. Iteration 2 Vertex accumulates higher mass
- 18. Iteration 2 And again, until convergence
- 19. GraphX implementation ● Running one propagation for each possible starting node would be slow ● Keep a vector of activation probabilities at each vertex ● Use GraphX to run all propagations in parallel
- 20. Topic Sensitive Pagerank in GraphX activation probability, starting from vertex 1 activation probability, starting from vertex 2 activation probability, starting from vertex 3 ... Activation probabilities as vertex attributes ... ... ... ... ... ...
- 21. Example graph diffusion results “Matrix” “Zombies” “Seattle”
- 22. Distributed Clustering algorithms
- 23. LDA @ Netflix ● A popular clustering/latent factors model ● Discovers clusters/topics of related videos from Netflix data ● e.g, a topic of Animal Documentaries
- 24. LDA - Graphical Model Question: How to parallelize inference?
- 25. LDA - Graphical Model Question: How to parallelize inference? Answer: Read conditional independencies in the model
- 26. Gibbs Sampler 1 (Semi Collapsed)
- 27. Gibbs Sampler 1 (Semi Collapsed) Sample Topic Labels in a given document Sequentially Sample Topic Labels in different documents In parallel
- 28. Gibbs Sampler 2 (UnCollapsed)
- 29. Gibbs Sampler 2 (UnCollapsed) Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
- 30. Gibbs Sampler 2 (UnCollapsed) Suitable For GraphX Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
- 31. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics
- 32. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics document
- 33. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics word
- 34. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics Edge: if word appeared in the document
- 35. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 (vertex, edge, vertex) = triplet
- 36. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distribution for the triplet using vertex attributes
- 37. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distributions for all triplets
- 38. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 1 1 2 0 Sample Topics for all edges
- 39. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0 1 0 0 1 1 1 0 0 0 2 0 1 0 1 1 1 2 0 Neighborhood aggregation for topic histograms
- 40. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.1 0.4 0.3 0.1 0.4 0.4 0.8 0.2 0.3 0.1 0.8 0.1 0.45 0.1 0.45 Realize samples from Dirichlet to update the graph
- 41. Example LDA Results Cluster of Bollywood Movies Cluster of Kids shows Cluster of Western movies
- 42. GraphX performance comparison
- 43. Algorithm Implementations ● Topic Sensitive Pagerank ○ Broadcast graph adjacency matrix ○ Scala/Breeze code, triggered by Spark ● LDA ○ Single machine ○ Multi-threaded Java code ● All implementations are Netflix Internal Code
- 44. Performance Comparison
- 45. Performance Comparison Open Source DBPedia dataset
- 46. Performance Comparison Alternative Implementation: Scala code triggered by Spark on a cluster
- 47. Performance Comparison Sublinear rise in time with GraphX Vs Linear rise in the Alternative
- 48. Performance Comparison Doubling the size of cluster: 2.0 speedup in the Alternative Impl Vs 1.2 in GraphX
- 49. Performance Comparison Large number of vertices propagated in parallel lead to large shuffle data, causing failures in GraphX for small clusters
- 50. Performance Comparison Netflix dataset Number of Topics = 100
- 51. Performance Comparison Multi-core implementation: Single machine Multi- threaded Java Code
- 52. Performance Comparison GraphX setup: 8 x Resources than the Multi-Core setup
- 53. Performance Comparison Wikipedia dataset (16 x r3.2xl) (Databricks)
- 54. Performance Comparison GraphX for very large datasets outperforms the multi-core unCollapsed Impl
- 55. Lessons Learned
- 56. What we learned so far ... ● Where is the cross-over point for your iterative ML algorithm? ○ GraphX brings performance benefits if you’re on the right side of that point ○ GraphX lets you easily throw more hardware at a problem ● GraphX very useful (and fast) for other graph processing tasks ○ Data pre-processing ○ Efficient joins
- 57. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
- 58. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ○ ~36% ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
- 59. We’re hiring! (come talk to us) https://jobs.netflix.com/
- 60. Appendix
- 61. Using GraphX scala> val edgesFile = "/data/mlconf-graphx/edges.txt" scala> sc.textFile(edgesFile).take(5).foreach(println) 0 1 2 3 2 4 2 5 2 6 scala> val mapping = sc.textFile("/data/mlconf-graphx/uri-mapping.csv") scala> mapping.take(5).foreach(println) http://dbpedia.org/resource/Drew_Finerty,3663393 http://dbpedia.org/resource/1998_JGTC_season,4148403 http://dbpedia.org/resource/Eucalyptus_bosistoana,3473416 http://dbpedia.org/resource/Wilmington,234049 http://dbpedia.org/resource/Wetter_(Ruhr),884940
- 62. Creating a GraphX graph scala> val graph = GraphLoader.edgeListFile(sc, edgesFile, false, 100) graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx. impl.GraphImpl@547a8dc1 scala> graph.edges.count res3: Long = 16090021 scala> graph.vertices.count res4: Long = 4548083
- 63. Pagerank in GraphX scala> val ranks = graph.staticPageRank(10, 0.15).vertices scala> val resources = mapping.map { row => val fields = row.split(",") (fields.last.toLong, fields.first) } scala> val ranksByResource = resources.join(ranks).map { case (id, (resource, rank)) => (resource, rank) } scala> ranksByResource.top(3)(Ordering.by(_._2)).foreach(println) (http://dbpedia.org/resource/United_States,15686.671749384182) (http://dbpedia.org/resource/Animal,6530.621240073025) (http://dbpedia.org/resource/United_Kingdom,5780.806077968981)
- 64. Topic-sensitive pagerank in GraphX ● Initialization: ○ Construct a message VertexRDD holding initial activation probabilities at each vertex (sparse vector with one non-zero) ● Propagate message along outbound edges using flatMap ○ (Involves shuffling) ● Sum incoming messages at each vertex ○ aggregateUsingIndex, summing up sparse vectors ○ join the message to the old graph to create a new one ● count to materialize the new graph ● unpersist to clean up old graph and message ● Repeat for fixed number of iterations or until convergence ● Zeppelin notebook, using DBpedia data
- 65. Distributed Gibbs Sampler in GraphX 1) Initialize Document-Word graph, G 2) For each triplet in G, a) Construct a categorical using vertex attributes (P(topic | document), P(word | topic)) b) Sample a topic label from the categorical distribution 3) Aggregate topic labels on the Vertex id 4) Sample vertex attributes from dirichlet distribution a) This involves computing and distributing a marginal over the Topic matrix, this materializes the graph in every iteration 5) Join vertices with updated attributes with the graph and repeat from step 2 Note: Step 2 and 3 can be accomplished jointly using aggregateMessages method on the Graph
- 66. References ● Topic Sensitive Pagerank [Haveliwala, 2002] ● Latent Dirichlet Allocation [Blei, 2003]

No public clipboards found for this slide

Be the first to comment