Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Graph processing - Powergraph and G... by Amir Payberah 2487 views
- GraphLab: Large-Scale Machine Learn... by Amazon Web Services 4033 views
- Machine Learning in the Cloud with ... by Danny Bickson 1702 views
- Graphlab under the hood by Zuhair khayyat 2021 views
- GraphLab by Tushar Sudhakar Jee 392 views
- CS267_Graph_Lab by JaideepKatkar 719 views

No Downloads

Total views

1,242

On SlideShare

0

From Embeds

0

Number of Embeds

30

Shares

0

Downloads

64

Comments

0

Likes

1

No embeds

No notes for slide

- 1. 22.06.2015 DIMA – TU Berlin 1 Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs Igor Shevchenko Mentor: Sebastian Schelter
- 2. 22.06.2015 DIMA – TU Berlin 2 Agenda 1. Natural Graphs: Properties and Problems; 2. PowerGraph: Vertex Cut and Vertex Programs; 3. GAS Decomposition; 4. Vertex Cut Partitioning; 5. Delta Caching; 6. Applications and Evaluation; Paper: Gonzalez at al. PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs.
- 3. 22.06.2015 DIMA – TU Berlin 3 ■ Natural graphs are graphs derived from real-world or natural phenomena; ■ Graphs are big: billions of vertices and edges and rich metadata; Natural graphs have Power-Law Degree Distribution Natural Graphs
- 4. 22.06.2015 DIMA – TU Berlin 4 Power-Law Degree Distribution (Andrei Broder et al. Graph structure in the web)
- 5. 22.06.2015 DIMA – TU Berlin 5 ■ We want to analyze natural graphs; ■ Essential for Data Mining and Machine Learning; Goal Identify influential people and information; Identify special nodes and communities; Model complex data dependencies; Target ads and products; Find communities; Flow scheduling;
- 6. 22.06.2015 DIMA – TU Berlin 6 ■ Existing distributed graph computation systems perform poorly on natural graphs (Gonzalez et al. OSDI ’12); ■ The reason is presence of high degree vertices; Problem High Degree Vertices: Star-like motif
- 7. 22.06.2015 DIMA – TU Berlin 7 Possible problems with high degree vertices: ■ Limited single-machine resources; ■ Work imbalance; ■ Sequential computation; ■ Communication costs; ■ Graph partitioning; Applicable to: ■ Hadoop; GraphLab; Pregel (Piccolo); Problem Continued
- 8. 22.06.2015 DIMA – TU Berlin 8 ■ High degree vertices can exceed the memory capacity of a single machine; ■ Store edge meta-data and adjacency information; Problem: Limited Single-Machine Resources
- 9. 22.06.2015 DIMA – TU Berlin 9 ■ The power-law degree distribution can lead to significant work imbalance and frequency barriers; ■ For ex. with synchronous execution (Pregel): Problem: Work Imbalance
- 10. 22.06.2015 DIMA – TU Berlin 10 ■ No parallelization of individual vertex-programs; ■ Edges are processed sequentially; ■ Locking does not scale well to high degree vertices (for ex. in GraphLab); Problem: Sequential Computation Sequentially process edges Asynchronous execution requires heavy locking
- 11. 22.06.2015 DIMA – TU Berlin 11 ■ Generate and send large amount of identical messages (for ex. in Pregel); ■ This results in communication asymmetry; Problem: Communication Costs
- 12. 22.06.2015 DIMA – TU Berlin 12 ■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Problem: Graph Partitioning
- 13. 22.06.2015 DIMA – TU Berlin 13 ■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication; Expected edges that are cut Examples: ■ 10 machines: ■ 100 machines: Problem: Graph Partitioning Continued = number of machines 90% of edges cut; 99% of edges cut;
- 14. 22.06.2015 DIMA – TU Berlin 14 ■ GraphLab and Pregel are not well suited for computations on natural graphs; Reasons: ■ Challenges of high-degree vertices; ■ Low quality partitioning; Solution: ■ PowerGraph new abstraction; In Summary
- 15. 22.06.2015 DIMA – TU Berlin 15 PowerGraph
- 16. 22.06.2015 DIMA – TU Berlin 16 Two approaches for partitioning the graph in a distributed environment: ■ Edge Cut; ■ Vertex Cut; Partition Techniques
- 17. 22.06.2015 DIMA – TU Berlin 17 ■ Used by Pregel and GraphLab abstractions; ■ Evenly assign vertices to machines; Edge Cut
- 18. 22.06.2015 DIMA – TU Berlin 18 ■ Used by PowerGraph abstraction; ■ Evenly assign edged to machines; Vertex Cut The strong point of the paper 4 edges 4 edges
- 19. 22.06.2015 DIMA – TU Berlin 19 Think like a Vertex [Malewicz et al. SIGMOD’10] User-defined Vertex-Program: 1. Runs on each vertex; 2. Interactions are constrained by graph structure; Pregel and GraphLab also use this concept, where parallelism is achieved by running multiple vertex programs simultaneously; Vertex Programs
- 20. 22.06.2015 DIMA – TU Berlin 20 ■ Vertex cut distributes a single vertex-program across several machines; ■ Allows to parallelize high-degree vertices; GAS Decomposition The strong point of the paper
- 21. 22.06.2015 DIMA – TU Berlin 21 Generalize the vertex-program into three phases: 1. Gather Accumulate information about neighborhood; 2. Apply Apply accumulated value to center vertex; 3. Scatter Update adjacent edges and vertices; GAS Decomposition Gather, Apply and Scatter are user-defined functions; The strong point of the paper
- 22. 22.06.2015 DIMA – TU Berlin 22 ■ Executed on the edges in parallel; ■ Accumulate information about neighborhood; Gather Phase
- 23. 22.06.2015 DIMA – TU Berlin 23 ■ Executed on the central vertex; ■ Apply accumulated value to center vertex; Apply Phase
- 24. 22.06.2015 DIMA – TU Berlin 24 ■ Executed on the neighboring vertices in parallel; ■ Update adjacent edges and vertices; Scatter Phase
- 25. 22.06.2015 DIMA – TU Berlin 25 ■ Vertex-programs that are written using GAS decomposition will automatically scale to several machines; How does it work? GAS Decomposition
- 26. 22.06.2015 DIMA – TU Berlin 26 GAS in a Distributed Environment
- 27. 22.06.2015 DIMA – TU Berlin 27 ■ Case with 2 machines; GAS in a Distributed Environment
- 28. 22.06.2015 DIMA – TU Berlin 28 ■ Compute partial sums on each machine; Gather Phase
- 29. 22.06.2015 DIMA – TU Berlin 29 ■ Send partial sum to the master machine; ■ Master machine computes the total sum; Gather Phase
- 30. 22.06.2015 DIMA – TU Berlin 30 ■ Apply accumulated value to center vertex; ■ Replicate value to the mirrors; Apply Phase
- 31. 22.06.2015 DIMA – TU Berlin 31 ■ Update adjacent edges and vertices; ■ Initiate neighboring vertex-programs if necessary; Scatter Phase
- 32. 22.06.2015 DIMA – TU Berlin 32 ■ During the Gather Phase the partial results are combined using commutative and associative user-defined SUM operation; ■ Examples: sum(a, b): return a + b sum(a, b): return union(a, b) sum(a, b): return min(a, b) ■ Also a requirement for Pregel combiners; ■ What if not commutative and associative? SUM Operation
- 33. 22.06.2015 DIMA – TU Berlin 33 ■ If not commutative and associative sum; ■ Send each edge data to the master machine; ■ Increases communication amount on Gather: Gather Phase: no partial sums
- 34. 22.06.2015 DIMA – TU Berlin 34 Vertex Cut Partitioning The strong point of the paper
- 35. 22.06.2015 DIMA – TU Berlin 35 Three distributed approaches for Vertex Cut: ■ Random Edge Placement; ■ Coordinated Greedy Edge Placement; ■ Oblivious Greedy Edge Placemen; Vertex Cut Partitioning Minimize machines spanned by each vertex Minimize communication and storage overhead =
- 36. 22.06.2015 DIMA – TU Berlin 36 ■ Randomly assign edges to machines; Random Edge Placement
- 37. 22.06.2015 DIMA – TU Berlin 37 Random Edge Placement ■ Randomly assign edges to machines;
- 38. 22.06.2015 DIMA – TU Berlin 38 Random Edge Placement ■ Randomly assign edges to machines; ■ Edge data is uniquely assigned to one machine
- 39. 22.06.2015 DIMA – TU Berlin 39 ■ Only 3 network communication channels; ■ Can predict network communication usage; ■ Significantly less communication comparing to the Edge Cut graph placement; ■ Can improve upon random placement! Communication Overhead
- 40. 22.06.2015 DIMA – TU Berlin 40 ■ Place edges on machines which already has the vertices in that edge; Greedy Edge Placement
- 41. 22.06.2015 DIMA – TU Berlin 41 ■ If several choices are possible, assign to the least loaded machine; Greedy Edge Placement
- 42. 22.06.2015 DIMA – TU Berlin 42 ■ Greedy Edge Placement is de-randomization; ■ Minimizes the number of machines spanned; Coordinated Greedy Edge Placement: ■ Requires coordination to place each edge; ■ Maintains global distributed placement table; ■ Slower but produces higher quality cuts; Oblivious Greedy Edge Placement: ■ Approx. greedy objective without coordination; ■ Faster but produces lower quality cuts; Greedy Edge Placement
- 43. 22.06.2015 DIMA – TU Berlin 43 ■ Twitter Follower Graph: 41M vertices, 1.4B edges; ■ Oblivious Greedy Edge Placement balances cost (replication factor) and construction time; Vertex Cut Partitioning: Comparison
- 44. 22.06.2015 DIMA – TU Berlin 44 ■ Greedy Edge Placement improves computation performance; Vertex Cut Partitioning: Comparison
- 45. 22.06.2015 DIMA – TU Berlin 45 Delta Caching Execution Modes
- 46. 22.06.2015 DIMA – TU Berlin 46 ■ Vertex-program can be triggered in response to a change only in a few of its neighbors; ■ In response Gather Phase will accumulate information about the all neighborhood; Delta Caching The strong point of the paper
- 47. 22.06.2015 DIMA – TU Berlin 47 ■ Accelerate the process by caching neighborhood accumulators from previous gather phase; Delta Caching The strong point of the paper
- 48. 22.06.2015 DIMA – TU Berlin 48 Delta Caching can speed up: ■ Gather Phase; ■ Scatter Phase; Requires Abelian Group; ■ sum (+) ■ inverse (−) Examples: ■ Page Rank – applicable; ■ Graph Coloring – not applicable; Delta Caching Commutative and associative The strong point of the paper
- 49. 22.06.2015 DIMA – TU Berlin 49 Supports three execution modes: ■ Synchronous: Bulk-Synchronous GAS Phases; ■ Asynchronous: Interleave GAS Phases; ■ Asynchronous Serializable: Prevent neighboring vertices to run simultaneously; Different tradeoffs: ■ Algorithm performance; ■ System performance; ■ Determinism; Execution Modes
- 50. 22.06.2015 DIMA – TU Berlin 50 Evaluation
- 51. 22.06.2015 DIMA – TU Berlin 51 PowerGraph on the natural graphs shows: ■ Reduced network communication; ■ Reduced runtime; ■ Reduced storage; On many examples Evaluation PageRank on the Twitter Follower Graph (40M Users, 1.4 Billion Links)
- 52. 22.06.2015 DIMA – TU Berlin 52 ■ Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent SVD Non-negative MF ■ Statistical Inference Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Applicability ■ Graph Analytics PageRank Triangle Counting Shortest Path Graph Coloring K-core Decomposition ■ Computer Vision Image stitching ■ Language Modeling LDA
- 53. 22.06.2015 DIMA – TU Berlin 53 ■ Vertex Cut; ■ GAS Decomposition; ■ Delta Caching; ■ Three modes of execution; Synchronous; Asynchronous; Asynchronous + Serializable; Strong Points of the Paper
- 54. 22.06.2015 DIMA – TU Berlin 54 ■ “In all cases the system is entirely symmetric with no single coordinating instance or scheduler”; How do they deal with Synchronous execution? Evaluation mess: ■ Evaluated Synchronous execution using PageRank; ■ Evaluated Asynchronous execution using GraphColoring; ■ Evaluated Asynchronous+S execution using GraphColoring; ■ Compared PowerGraph with published results again using PageRank, Triangle Counting but not GraphColoring; ■ Oblivious Greedy Edge Placement is poorly explained; Weak Points of the Paper
- 55. 22.06.2015 DIMA – TU Berlin 55 ■ Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin. PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs. 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2012); ■ Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large- scale graph processing. In SIGMOD (2010). ■ Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. in PVLDB (2012). ■ http://graphlab.org References
- 56. 22.06.2015 DIMA – TU Berlin 56 Questions? 1. Natural Graphs: Properties and Problems; 2. PowerGraph: Vertex Cut and Vertex Programs; 3. GAS Decomposition; 4. Vertex Cut Partitioning; 5. Delta Caching; 6. Applications and Evaluation; Paper: Gonzalez at al. PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs.

No public clipboards found for this slide

Be the first to comment