Scalable Multi-threaded Community Detectionin Social NetworksE. Jason Riedy1 , David A. Bader1 , and Henning Meyerhenke21 ...
Exascale data analysis             Health care Finding outbreaks, population epidemiology       Social networks Advertisin...
These are not easy graphs.                              Yifan Hu’s (AT&T) visualization of the in-2004 data set           ...
But no shortage of structure...          Protein interactions, Giot et al., “A Protein          Interaction Map of Drosoph...
Outline      Motivation      Defining community detection and metrics      Shooting for massive graphs      Our parallel me...
Community detection     What do we mean?         • Partition a graph’s           vertices into disjoint           communit...
Community detection     Assumptions         • Disjoint partitioning of           vertices.         • There is no one uniqu...
Common community metric: Modularity          • Modularity: Deviation of connectivity in the community             induced ...
Can we tackle massive graphs now?      Parallel, of course...          • Massive needs distributed memory, right?         ...
Multi-threaded algorithm design points      A scalable multi-threaded graph analysis algorithm          • ... avoids globa...
Sequential agglomerative method                                              • A common method (e.g. Clauset, et          ...
Sequential agglomerative method                                              • A common method (e.g. Clauset, et          ...
Sequential agglomerative method                                              • A common method (e.g. Clauset, et          ...
Sequential agglomerative method                                              • A common method (e.g. Clauset, et          ...
Parallel agglomerative method                                              • We use a matching to avoid the queue.        ...
Parallel agglomerative method                                              • We use a matching to avoid the queue.        ...
Parallel agglomerative method                                              • We use a matching to avoid the queue.        ...
Platform: Cray XMT2      Tolerates latency by massive multithreading.          • Hardware: 128 threads per processor      ...
Platform: Intel R E7-8870-based server      Tolerates some latency by hyperthreading.          • “Westmere:” 2 threads / c...
Platform: Other Intel R -based servers      Different design points          • “Nehalem” X5570: 2.93 GHz, 2 threads/core, 4...
Implementation: Data structures      Extremely basic for graph G = (V, E)          • An array of (i, j; w) weighted edge p...
Implementation: Data structures      Extremely basic for graph G = (V, E)          • An array of (i, j; w) weighted edge p...
Implementation: Routines      Three primitives: Scoring, matching, contracting              Scoring Trivial.           Mat...
Implementation: Routines      Contracting          1   Map each i, j to new vertices, re-order by hashing.          2   Ac...
Performance summary      Two moderate-sized graphs, one large                    Graph                     |V |           ...
Performance: Time to solution                                                   rmat−24−16                                ...
Performance: Rate (edges/second)                                                            rmat−24−16                    ...
Performance: Modularity                                    0.8                                    0.6                     ...
Performance: Small-scale speedup                                                                                          ...
Performance: Large-scale time                                       31568s                             214                ...
Performance: Large-scale speedup                  Speed up over single processor/thread                                   ...
Conclusions and plans          • Code:            http://www.cc.gatech.edu/~jriedy/community-detection/          • Some lo...
Acknowledgment of supportMTAAP 2012—Scalable Community Detection—Jason Riedy   33/35
Bibliography I            U. Brandes, D. Delling, M. Gaertler, R. G¨rke, M. Hoefer,                                       ...
Bibliography II            D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A            recursive model for graph minin...
Upcoming SlideShare
Loading in …5
×

MTAAP12: Scalable Community Detection

1,513 views
1,372 views

Published on

Presented at MTAAP12 in Shanghai.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,513
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MTAAP12: Scalable Community Detection

  1. 1. Scalable Multi-threaded Community Detectionin Social NetworksE. Jason Riedy1 , David A. Bader1 , and Henning Meyerhenke21 School of Comp. Science and Engineering, Georgia Inst. of Technology2 Inst. of Theoretical Informatics, Karlsruhe Inst. of Technology (KIT) 25 May 2012
  2. 2. Exascale data analysis Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes • Graph clustering is common in all application areas.MTAAP 2012—Scalable Community Detection—Jason Riedy 2/35
  3. 3. These are not easy graphs. Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.htmlMTAAP 2012—Scalable Community Detection—Jason Riedy 3/35
  4. 4. But no shortage of structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. • Locally, there are clusters or communities. • First pass over a massive social graph: • Find smaller communities of interest. • Analyze / visualize top-ranked communities. • Our part: Community detection at massive scale. (Or kinda large, given available data.)MTAAP 2012—Scalable Community Detection—Jason Riedy 4/35
  5. 5. Outline Motivation Defining community detection and metrics Shooting for massive graphs Our parallel method Implementation and platform details Performance Conclusions and plansMTAAP 2012—Scalable Community Detection—Jason Riedy 5/35
  6. 6. Community detection What do we mean? • Partition a graph’s vertices into disjoint communities. • A community locally maximizes some metric. • Modularity, conductance, ... • Trying to capture that vertices are more similar within one community than between communities. Jason’s network via LinkedIn LabsMTAAP 2012—Scalable Community Detection—Jason Riedy 6/35
  7. 7. Community detection Assumptions • Disjoint partitioning of vertices. • There is no one unique answer. • Many metrics are NP-complete to optimize (Brandes, et al.[1]). • Graph is lossy representation. • Want an adaptable detection method. Jason’s network via LinkedIn LabsMTAAP 2012—Scalable Community Detection—Jason Riedy 7/35
  8. 8. Common community metric: Modularity • Modularity: Deviation of connectivity in the community induced by a vertex set S from some expected background model of connectivity. • We take Newman [2]’s basic uniform model. • Let m count all edges in graph G, mS count of edges with both endpoints in S, and xS count the edges with any endpoint in S. Modularity QS : QS = (mS − x2 /4m)/m S • Total modularity: sum of modularities of disjoint subsets. • A sufficiently positive modularity implies some structure. • Known issues: Resolution limit, NP-complete opt. prob.MTAAP 2012—Scalable Community Detection—Jason Riedy 8/35
  9. 9. Can we tackle massive graphs now? Parallel, of course... • Massive needs distributed memory, right? • Well... Not really. Can buy a 2 TiB Intel-based Dell server on-line for around $200k USD, a 1.5 TiB from IBM, etc. Image: dell.com. Not an endorsement, just evidence! • Publicly available “real-world” data fits... • Start with shared memory to see what needs done. • Specialized architectures provide larger shared-memory views over distributed implementations (e.g. Cray XMT).MTAAP 2012—Scalable Community Detection—Jason Riedy 9/35
  10. 10. Multi-threaded algorithm design points A scalable multi-threaded graph analysis algorithm • ... avoids global locks and frequent global synchronization. • ... distributes computation over edges rather than only vertices. • ... works with data as local to an edge as possible. • ... uses compact data structures that agglomerate memory references.MTAAP 2012—Scalable Community Detection—Jason Riedy 10/35
  11. 11. Sequential agglomerative method • A common method (e.g. Clauset, et al. [3]) agglomerates vertices into A C communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi [4]).MTAAP 2012—Scalable Community Detection—Jason Riedy 11/35
  12. 12. Sequential agglomerative method • A common method (e.g. Clauset, et al. [3]) agglomerates vertices into A C communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi [4]).MTAAP 2012—Scalable Community Detection—Jason Riedy 12/35
  13. 13. Sequential agglomerative method • A common method (e.g. Clauset, et al. [3]) agglomerates vertices into A C communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi [4]).MTAAP 2012—Scalable Community Detection—Jason Riedy 13/35
  14. 14. Sequential agglomerative method • A common method (e.g. Clauset, et al. [3]) agglomerates vertices into A C communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi [4]).MTAAP 2012—Scalable Community Detection—Jason Riedy 14/35
  15. 15. Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. • Simple greedy algorithm. A C • Maximal matching. • Within factor of 2 in weight. B • Merge all matched communities at D once. E • Maintains some balance. G • Produces different results. F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.MTAAP 2012—Scalable Community Detection—Jason Riedy 15/35
  16. 16. Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. • Simple greedy algorithm. A C • Maximal matching. • Within factor of 2 in weight. B • Merge all matched communities at D once. E • Maintains some balance. G • Produces different results. F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.MTAAP 2012—Scalable Community Detection—Jason Riedy 16/35
  17. 17. Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. • Simple greedy algorithm. A C • Maximal matching. • Within factor of 2 in weight. B • Merge all matched communities at D once. E • Maintains some balance. G • Produces different results. F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.MTAAP 2012—Scalable Community Detection—Jason Riedy 17/35
  18. 18. Platform: Cray XMT2 Tolerates latency by massive multithreading. • Hardware: 128 threads per processor • Context switch on every cycle (500 MHz) • Many outstanding memory requests (180/proc) • “No” caches... • Flexibly supports dynamic load balancing • Globally hashed address space, no data cache • Support for fine-grained, word-level synchronization • Full/empty bit on with every memory word • 64 processor XMT2 at CSCS, the Swiss National Supercomputer Centre. • 500 MHz processors, 8192 threads, 2 TiB of shared memory Image: cray.comMTAAP 2012—Scalable Community Detection—Jason Riedy 18/35
  19. 19. Platform: Intel R E7-8870-based server Tolerates some latency by hyperthreading. • “Westmere:” 2 threads / core, 10 cores / socket, four sockets. • Fast cores (2.4 GHz), fast memory (1 066 MHz). • Not so many outstanding memory requests (60/socket), but large caches (30 MiB L3 per socket). • Good system support • Transparent hugepages reduces TLB costs. • Fast, user-level locking. (HLE would be better...) • OpenMP, although I didn’t tune it... • mirasol, #17 on Graph500 (thanks to UCB) • Four processors (80 threads), 256 GiB memory • gcc 4.6.1, Linux kernel Image: Intel R press kit 3.2.0-rc5MTAAP 2012—Scalable Community Detection—Jason Riedy 19/35
  20. 20. Platform: Other Intel R -based servers Different design points • “Nehalem” X5570: 2.93 GHz, 2 threads/core, 4 cores/socket, 2 sockets, 8 MiB cache/socket • “Westmere” X5650: 2.66 GHz, 2 threads/core, 6 cores/socket, 2 sockets, 12 MiB cache/socket • All with 1 066 MHz memory. • Does the Westmere E7-8870’s scale affect performance? • Nodes in Georgia Tech CSE cluster jinx • 24-48 GiB memory, small tests Image: Intel R press kitMTAAP 2012—Scalable Community Detection—Jason Riedy 20/35
  21. 21. Implementation: Data structures Extremely basic for graph G = (V, E) • An array of (i, j; w) weighted edge pairs, each i, j stored only once and packed, uses 3|E| space • An array to store self-edges, d(i) = w, |V | • A temporary floating-point array for scores, |E| • A additional temporary arrays using 4|V | + 2|E| to store degrees, matching choices, offsets... • Weights count number of agglomerated vertices or edges. • Scoring methods (modularity, conductance) need only vertex-local counts. • Storing an undirected graph in a symmetric manner reduces memory usage drastically and works with our simple matcher.MTAAP 2012—Scalable Community Detection—Jason Riedy 21/35
  22. 22. Implementation: Data structures Extremely basic for graph G = (V, E) • An array of (i, j; w) weighted edge pairs, each i, j stored only once and packed, uses 3|E| space • An array to store self-edges, d(i) = w, |V | • A temporary floating-point array for scores, |E| • A additional temporary arrays using 4|V | + 2|E| to store degrees, matching choices, offsets... • Original ignored order in edge array, killed OpenMP. • New: Roughly bucket edge array by first stored index. Non-adjacent CSR-like structure. • New: Hash i, j to determine order. Scatter among buckets.MTAAP 2012—Scalable Community Detection—Jason Riedy 22/35
  23. 23. Implementation: Routines Three primitives: Scoring, matching, contracting Scoring Trivial. Matching Repeat until no ready, unmatched vertex: 1 For each unmatched vertex in parallel, find the best unmatched neighbor in its bucket. 2 Try to point remote match at that edge (lock, check if best, unlock). 3 If pointing succeeded, try to point self-match at that edge. 4 If both succeeded, yeah! If not and there was some eligible neighbor, re-add self to ready, unmatched list. (Possibly too simple, but...)MTAAP 2012—Scalable Community Detection—Jason Riedy 23/35
  24. 24. Implementation: Routines Contracting 1 Map each i, j to new vertices, re-order by hashing. 2 Accumulate counts for new i bins, prefix-sum for offset. 3 Copy into new bins. • Only synchronizing in the prefix-sum. That could be removed if I don’t re-order the i , j pair; haven’t timed the difference. • Actually, the current code copies twice... On short list for fixing. • Binning as opposed to original list-chasing enabled Intel/OpenMP support with reasonable performance.MTAAP 2012—Scalable Community Detection—Jason Riedy 24/35
  25. 25. Performance summary Two moderate-sized graphs, one large Graph |V | |E| Reference rmat-24-16 15 580 378 262 482 711 [5, 6] soc-LiveJournal1 4 847 571 68 993 773 [7] uk-2007-05 105 896 555 3 301 876 564 [8] Peak processing rates in edges/second Platform rmat-24-16 soc-LiveJournal1 uk-2007-05 X5570 1.83 × 106 3.89 × 106 X5650 2.54 × 106 4.98 × 106 E7-8870 5.86 × 106 6.90 × 106 6.54 × 106 XMT 1.20 × 106 0.41 × 106 XMT2 2.11 × 106 1.73 × 106 3.11 × 106MTAAP 2012—Scalable Community Detection—Jason Riedy 25/35
  26. 26. Performance: Time to solution rmat−24−16 soc−LiveJournal1 3162 1000 q q q q q 316 q q q Intel q 100 q q q q q 32 q q q q 10 3 Time (s) 3162 1000 316 Cray XMT 100 32 10 3 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Threads (OpenMP) / Processors (XMT) Platform q X5570 (4−core) X5650 (6−core) E7−8870 (10−core) XMT XMT2MTAAP 2012—Scalable Community Detection—Jason Riedy 26/35
  27. 27. Performance: Rate (edges/second) rmat−24−16 soc−LiveJournal1 107.5 107 q 106.5 Intel q q q q q q q q q 106 q q q q q q q Edges per second 105.5 q q 5 10 107.5 107 Cray XMT 106.5 106 105.5 105 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Threads (OpenMP) / Processors (XMT) Platform q X5570 (4−core) X5650 (6−core) E7−8870 (10−core) XMT XMT2MTAAP 2012—Scalable Community Detection—Jason Riedy 27/35
  28. 28. Performance: Modularity 0.8 0.6 q q Modularity q 0.4 0.2 0.0 0 10 20 30 step Graph Termination metric q coAuthorsCiteseer q eu−2005 q uk−2002 q Coverage Max Average • Timing results: Stop when coverage ≥ 0.5 (communities cover 1/2 edges). • More work ⇒ higher modularity. Choice up to application.MTAAP 2012—Scalable Community Detection—Jason Riedy 28/35
  29. 29. Performance: Small-scale speedup rmat−24−16 soc−LiveJournal1 Speed−up over one thread (OpenMP) or processor (XMT) 32 16 8 Intel q q q q 4 q q q q q q q q q q 2 q q q q q q 1 32 16 Cray XMT 8 4 2 1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Threads (OpenMP) / Processors (XMT) Platform q X5570 (4−core) X5650 (6−core) E7−8870 (10−core) XMT XMT2MTAAP 2012—Scalable Community Detection—Jason Riedy 29/35
  30. 30. Performance: Large-scale time 31568s 214 q 6917s Time (s) 12 q 2 q q q q q 1063s 210 q q q q q q q q q q q q 504.9s q q q q 1 2 4 8 16 32 64 Threads (OpenMP) / Processors (XMT) Platform q E7−8870 (10−core) XMT2MTAAP 2012—Scalable Community Detection—Jason Riedy 30/35
  31. 31. Performance: Large-scale speedup Speed up over single processor/thread 29.6x 24 13.7x q q q q q q q q q q q q 23 q q q q q 2 q 2 q q q 21 q 1 2 4 8 16 32 64 Threads (OpenMP) / Processors (XMT) Platform q E7−8870 (10−core) XMT2MTAAP 2012—Scalable Community Detection—Jason Riedy 31/35
  32. 32. Conclusions and plans • Code: http://www.cc.gatech.edu/~jriedy/community-detection/ • Some low-hanging fruit remains: • Eliminate one unnecessary copy during contraction. • Deal with stars. • Then... Practical experiments. • How volatile are modularity and conductance to perturbations? • What matching schemes work well? • How do different metrics compare in applications? • Extending to streaming graph data! • Includes developing parallel refinement... • And possibly de-clustering or manipulating the dendogram.... • Very much WIP, more tricky than anticipated.MTAAP 2012—Scalable Community Detection—Jason Riedy 32/35
  33. 33. Acknowledgment of supportMTAAP 2012—Scalable Community Detection—Jason Riedy 33/35
  34. 34. Bibliography I U. Brandes, D. Delling, M. Gaertler, R. G¨rke, M. Hoefer, o Z. Nikoloski, and D. Wagner, “On modularity clustering,” IEEE Trans. Knowledge and Data Engineering, vol. 20, no. 2, pp. 172–188, 2008. M. Newman, “Modularity and community structure in networks,” Proc. of the National Academy of Sciences, vol. 103, no. 23, pp. 8577–8582, 2006. A. Clauset, M. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E, vol. 70, no. 6, p. 66111, 2004. K. Wakita and T. Tsurumi, “Finding community structure in mega-scale social networks,” CoRR, vol. abs/cs/0702048, 2007.MTAAP 2012—Scalable Community Detection—Jason Riedy 34/35
  35. 35. Bibliography II D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model for graph mining,” in Proc. 4th SIAM Intl. Conf. on Data Mining (SDM). Orlando, FL: SIAM, Apr. 2004. D. Bader, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, W. Mann, and T. Meuse, HPCS SSCA#2 Graph Analysis Benchmark Specifications v1.1, Jul. 2005. J. Leskovec, “Stanford large network dataset collection,” At http://snap.stanford.edu/data/, Oct. 2011. P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: A scalable fully distributed web crawler,” Software: Practice & Experience, vol. 34, no. 8, pp. 711–726, 2004.MTAAP 2012—Scalable Community Detection—Jason Riedy 35/35

×