Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HDRF: Stream-Based Partitioning for Power-Law Graphs

506 views

Published on

F. Petroni, L. Querzoni, K. Daudjee, S. Kamali and G. Iacoboni:
"HDRF: Stream-Based Partitioning for Power-Law Graphs."
In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), 2015.

Abstract: "Balanced graph partitioning is a fundamental problem that is receiving growing attention with the emergence of distributed graph-computing (DGC) frameworks. In these frameworks, the partitioning strategy plays an important role since it drives the communication cost and the workload balance among computing nodes, thereby affecting system performance.
However, existing solutions only partially exploit a key characteristic of natural graphs commonly found in the
real-world: their highly skewed power-law degree distributions.
In this paper, we propose High-Degree (are) Replicated First (HDRF), a novel streaming vertex-cut graph partitioning algorithm that effectively exploits skewed degree distributions by explicitly taking into account vertex degree in the placement decision. We analytically and experimentally evaluate HDRF on both synthetic and real-world graphs and show that it outperforms all existing algorithms in partitioning quality."

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

HDRF: Stream-Based Partitioning for Power-Law Graphs

  1. 1. HDRF: Stream-Based Partitioning for Power-Law Graphs. travel grant from Fabio Petroni, L. Querzoni, K. Daudjee, S. Kamali, G. Iacoboni
  2. 2. Graphs I large amount of real data can be represented as a graph. Vertices Edges G(V,E) I the problem of optimally partitioning a graph while maintaining load balance is important in several contexts. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 2 of 22
  3. 3. Distributed Graph Computing Frameworks I support e cient computation on really large graphs. B graph analytics; data mining and machine learning tasks. I input data partitioning can have a significant impact on the performance of the graph computation. B a↵ects network usage, memory occupation, synchronization. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 3 of 22
  4. 4. Balanced Graph Partitioning vertex-cutedge-cut I partition G into smaller components of (ideally) equal size edge-cut vertex disjoint partitions vertex-cut edge disjoint partitions I a vertex can be cut in multiple ways and span several partitions while a cut edge connects only two partitions. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 4 of 22
  5. 5. Balanced Graph Partitioning vertex-cutedge-cut I partition G into smaller components of (ideally) equal size edge-cut vertex disjoint partitions vertex-cut edge disjoint partitions I a vertex can be cut in multiple ways and span several partitions while a cut edge connects only two partitions. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 4 of 22
  6. 6. Vertex-Cut computation steps are associated with edges v-cut perform better on power law graphs create less storage and network overhead (Gonzalez et al., 2012) I modern distributed graph computing frameworks use vertex-cut. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 5 of 22
  7. 7. Power-Law Graphs I characteristic of real graphs: power-law degree distribution. B most vertices have few connections while a few have many. 10 0 101 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 numberofvertices degree crawl Twitter connection 2010 log-log scale I the probability that a vertex has degree d is P(d) / d ↵ . I ↵ controls the “skewness” of the degree distribution. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 6 of 22
  8. 8. Balanced Vertex-Cut Graph Partitioning I v 2 V vertex; e 2 E edge; p 2 P partition. I A(v) set of partitions where vertex v is replicated. I 1 tolerance to load imbalance. I the size |p| of partition p is its edge cardinality. minimize replicas reduce (1) bandwidth, (2) memory usage and (3) synchronization balance the load efficient usage of available computing resources min 1 |V | X v2V |A(v)| s.t. max p2P |p| < |E| |P| I the objective function is the replication factor (RF). B average number of replicas per vertex. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 7 of 22
  9. 9. Streaming Setting I input data is a list of edges, consumed in streaming fashion, requiring only a single pass. partitioning algorithm stream of edges graph 3 handle graphs that don’t fit in the main memory. 3 impose minimum overhead in time. 3 scalable, easy parallel implementations. 7 assignment decision taken cannot be later changed. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 8 of 22
  10. 10. Algorithms Taxonomy Grid PDS HDRFDBH Greedy hashing stream-based vertex-cut graph partitioning Hashing constrained partitioning (Gonzalez et al., 2012)(Jain et al., 2013) (Petroni et al., 2015)(Xie et al., 2014) history aware (Jain et al., 2013)(Gonzalez et al., 2012) history agnostic CIKM 2015. October 19-23, 2015. Melbourne, Australia. 9 of 22
  11. 11. Algorithms Taxonomy Grid PDS HDRFDBH Greedy hashing stream-based vertex-cut graph partitioning Hashing constrained partitioning (Gonzalez et al., 2012)(Jain et al., 2013) (Petroni et al., 2015)(Xie et al., 2014) history aware (Jain et al., 2013)(Gonzalez et al., 2012) history agnostic CIKM 2015. October 19-23, 2015. Melbourne, Australia. 9 of 22
  12. 12. HDRF: High Degree are Replicated First I favor the replication of high-degree vertices. I the number of high-degree vertices in power-law graphs is very low. I overall reduction of the replication factor. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 10 of 22
  13. 13. HDRF: High Degree are Replicated First I in the context of robustness to network failure. I if few high-degree vertices are removed from a power-law graph then it is turned into a set of isolated clusters. I focus on the locality of low-degree vertices. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 10 of 22
  14. 14. The HDRF Algorithm vertex without replicas vertex with replicas case 1 vertices not assigned to partitions incoming edge e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  15. 15. The HDRF Algorithm vertex without replicas vertex with replicas case 1 place e in the least loaded partition incoming edge e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  16. 16. The HDRF Algorithm case 2 only one vertex has been assigned vertex without replicas vertex with replicas case 1 place e in the least loaded partition incoming edge e e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  17. 17. The HDRF Algorithm case 2 place e in the partition vertex without replicas vertex with replicas case 1 place e in the least loaded partition incoming edge e e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  18. 18. The HDRF Algorithm case 2 place e in the partition vertex without replicas vertex with replicas case 1 place e in the least loaded partition incoming edge e e case 3 vertices assigned, common partition e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  19. 19. The HDRF Algorithm case 2 place e in the partition vertex without replicas vertex with replicas case 1 place e in the least loaded partition incoming edge e e case 3 place e in the intersection e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 11 of 22
  20. 20. Create Replicas case 4 empty intersection e CIKM 2015. October 19-23, 2015. Melbourne, Australia. 12 of 22
  21. 21. Create Replicas case 4 least loaded partition in the union case 4 empty intersection e e standard Greedy solution CIKM 2015. October 19-23, 2015. Melbourne, Australia. 12 of 22
  22. 22. Create Replicas case 4 least loaded partition in the union case 4 empty intersection e case 4 replicate vertex with highest degree e δ(v1) > δ(v2) e standard Greedy solution HDRF CIKM 2015. October 19-23, 2015. Melbourne, Australia. 12 of 22
  23. 23. Equivalent Formulation: Maximize The Score I computing a score C(e, p) for all partitions p 2 P. I assigns e to the partition that maximizes C(e, p). C(e, p) = CREP(e, p) + CBAL(p) I the balance term breaks ties in replication term . I this may not be enough to ensure load balance. control the importance of the balance term. balanced partitions even when classical greedy fails. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 13 of 22
  24. 24. Equivalent Formulation: Maximize The Score I computing a score C(e, p) for all partitions p 2 P. I assigns e to the partition that maximizes C(e, p). C(e, p) = CREP(e, p) + · CBAL(p) the balance term breaks ties in replication term. this may not be enough to ensure load balance. I controls the importance of the balance term . I balanced partitions even when classical greedy fails. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 13 of 22
  25. 25. Ordered Stream Of Edges I without CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  26. 26. Ordered Stream Of Edges I without CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  27. 27. Ordered Stream Of Edges I without CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  28. 28. Ordered Stream Of Edges I without CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  29. 29. Ordered Stream Of Edges I without single partition all the graph in a CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  30. 30. Ordered Stream Of Edges I with CIKM 2015. October 19-23, 2015. Melbourne, Australia. 14 of 22
  31. 31. Experiments - Settings I standalone partitioner. B VGP, a software package for one-pass vertex-cut balanced graph partitioning. B measure the performance: replication and balancing. I GraphLab. B HDRF has been integrated in GraphLab PowerGraph 2.2. B measure the impact on the execution time of graph computation in a distributed graph computing frameworks. I stream of edges in random order. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 15 of 22
  32. 32. Experiments - Datasets I real-word graphs. I synthetic graphs. 1M vertices 60M to 3M edges Dataset |V | |E| MovieLens 10M 80.6K 10M Netflix 497.9K 100.4M Tencent Weibo 1.4M 140M twitter-2010 41.7M 1.47B 10 0 10 1 10 2 10 3 104 105 101 102 103 104 105 numberofvertices degree α = 1.8 α = 2.2 α = 2.6 α = 3.0 α = 3.4 α = 4.0 CIKM 2015. October 19-23, 2015. Melbourne, Australia. 16 of 22
  33. 33. Results - Synthetic Graphs Replication Factor I 128 partitions 4 8 1.8 2.0 2.4 2.8 3.2 3.6 4.0 replicationfactor alpha HDRF PDS |P|=133 grid |P|=121 greedy DBH skewed homogeneous CIKM 2015. October 19-23, 2015. Melbourne, Australia. 17 of 22
  34. 34. Results - Synthetic Graphs Replication Factor I 128 partitions 4 8 1.8 2.0 2.4 2.8 3.2 3.6 4.0 replicationfactor alpha HDRF PDS |P|=133 grid |P|=121 greedy DBH skewed homogeneous less dense less edges easyer to partition CIKM 2015. October 19-23, 2015. Melbourne, Australia. 17 of 22
  35. 35. Results - Synthetic Graphs Replication Factor I 128 partitions 4 8 1.8 2.0 2.4 2.8 3.2 3.6 4.0 replicationfactor alpha HDRF PDS |P|=133 grid |P|=121 greedy DBH skewed homogeneous less dense less edges easier to partition more dense more edges difficult to partition CIKM 2015. October 19-23, 2015. Melbourne, Australia. 17 of 22
  36. 36. Results - Synthetic Graphs Replication Factor I 128 partitions 4 8 1.8 2.0 2.4 2.8 3.2 3.6 4.0 replicationfactor alpha HDRF PDS |P|=133 grid |P|=121 greedy DBH skewed homogeneous less dense less edges easier to partition more dense more edges difficult to partition exploit power-law CIKM 2015. October 19-23, 2015. Melbourne, Australia. 17 of 22
  37. 37. Results - Synthetic Graphs Replication Factor I 128 partitions 4 8 1.8 2.0 2.4 2.8 3.2 3.6 4.0 replicationfactor alpha HDRF PDS |P|=133 grid |P|=121 greedy DBH skewed homogeneous CIKM 2015. October 19-23, 2015. Melbourne, Australia. 17 of 22
  38. 38. Results - Real-Word Graphs Replication Factor I 133 partitions 1 2 4 8 16 Tencent Weibo Netflix MovieLens 10M twitter-2010 replicationfactor PDS 7.9 11.3 11.5 8.0 greedy 2.8 8.1 10.7 5.9 DBH 1.5 7.1 12.9 6.8 HDRF 1.3 4.9 9.1 4.8 1 2 4 8 16 Tencent Weibo Netflix MovieLens 10M twitter-2010 replicationfactor PDS greedy DBH HDRF CIKM 2015. October 19-23, 2015. Melbourne, Australia. 18 of 22
  39. 39. Results - Real-Word Graphs Replication Factor I 133 partitions 1 2 4 8 16 Tencent Weibo Netflix MovieLens 10M twitter-2010 replicationfactor PDS 7.9 11.3 11.5 8.0 greedy 2.8 8.1 10.7 5.9 DBH 1.5 7.1 12.9 6.8 HDRF 1.3 4.9 9.1 4.8 1 2 4 8 16 Tencent Weibo Netflix MovieLens 10M twitter-2010 replicationfactor PDS greedy DBH HDRF HDRF achieves a replication factor about 40% smaller than DBH, more than 50% smaller than Greedy, almost 3× smaller than PDS, more than 4× smaller than Grid and almost 14× smaller than Hashing. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 18 of 22
  40. 40. Results - Load Relative Standard Deviation I MovieLens 10M 0 1 2 3 4 5 6 7 8 16 32 64 128 256 loadrelativestandarddeviation(%) partitions PDS grid greedy DBH HDRF hashing CIKM 2015. October 19-23, 2015. Melbourne, Australia. 19 of 22
  41. 41. Results - Graph Algorithm Runtime Speedup I SGD algorithm for collaborative filtering on Tencent Weibo. 1 1.5 2 2.5 3 32 64 128 speedup(×) partitions greedy PDS 1 1.5 2 2.5 3 32 64 128 speedup(×) partitions greedy PDS |P|=31,57,133 I the speedup is proportional to both: B the advantage in replication factor. B the actual network usage of the algorithm. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 20 of 22
  42. 42. Conclusion I HDRF is a one-pass vertex-cut graph partitioning algorithm. I based on a greedy vertex-cut approach that leverages information on vertex degrees. I we provide a theoretical analysis of HDRF with an average-case upper bound for the vertex replication factor. I experimental study shows that: B HDRF provides the smallest replication factor with close to optimal load balance. B HDRF significantly reduces the time needed to perform computation on graphs. I the stand-alone software package for one-pass v-cut balanced partitioning at https://github.com/fabiopetroni/VGP. CIKM 2015. October 19-23, 2015. Melbourne, Australia. 21 of 22
  43. 43. Thank you! Questions? Fabio Petroni Sapienza University of Rome, Italy Current position: PhD Student in Engineering in Computer Science Research Interests: data mining, machine learning, big data petroni@dis.uniroma1.it CIKM 2015. October 19-23, 2015. Melbourne, Australia. 22 of 22

×