DIMACS10: Parallel Community Detection for Massive Graphs
Upcoming SlideShare
Loading in...5
×
 

DIMACS10: Parallel Community Detection for Massive Graphs

on

  • 451 views

Presentation at the 10th DIMACS Implementation Challenge on our improved parallel community detection algorithm.

Presentation at the 10th DIMACS Implementation Challenge on our improved parallel community detection algorithm.

Statistics

Views

Total Views
451
Views on SlideShare
451
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DIMACS10: Parallel Community Detection for Massive Graphs DIMACS10: Parallel Community Detection for Massive Graphs Presentation Transcript

  • Parallel Community Detection for MassiveGraphsE. Jason Riedy, Henning Meyerhenke, David Ediger, andDavid A. Bader 14 February 2012
  • Exascale data analysis Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes • Graph clustering is common in all application areas.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 2/35
  • These are not easy graphs. Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 3/35
  • But no shortage of structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. • Locally, there are clusters or communities. • First pass over a massive social graph: • Find smaller communities of interest. • Analyze / visualize top-ranked communities. • Our part: Community detection at massive scale. (Or kinda large, given available data.)10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 4/35
  • Outline Motivation Shooting for massive graphs Our parallel method Implementation and platform details Performance Conclusions and plans10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 5/35
  • Can we tackle massive graphs now? Parallel, of course... • Massive needs distributed memory, right? • Well... Not really. Can buy a 2 TiB Intel-based Dell server on-line for around $200k USD, a 1.5 TiB from IBM, etc. Image: dell.com. Not an endorsement, just evidence! • Publicly available “real-world” data fits... • Start with shared memory to see what needs done. • Specialized architectures provide larger shared-memory views over distributed implementations (e.g. Cray XMT).10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 6/35
  • Designing for parallel algorithms What should we avoid in algorithms? Rules of thumb: • “We order the vertices (or edges) by...” unless followed by bisecting searches. • “We look at a region of size more than two steps...” Many target massive graphs have diameter of ≈ 20. More than two steps swallows much of the graph. ˜ • “Our algorithm requires more than O(|E|/#)...” Massive means you hit asymptotic bounds, and |E| is plenty of work. • “For each vertex, we do something sequential...” The few high-degree vertices will be large bottlenecks. Remember: Rules of thumb can be broken with reason.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 7/35
  • Designing for parallel implementations What should we avoid in implementations? Rules of thumb: • Scattered memory accesses through traditional sparse matrix representations like CSR. Use your cache lines. idx: 32b idx: 32b ... idx1: 32b idx2: 32b val1: 64b val2: 64b ... val: 64b val: 64b ... • Using too much memory, which is a painful trade-off with parallelism. Think Fortran and workspace... • Synchronizing too often. There will be work imbalance; try to use the imbalance to reduce “hot-spotting” on locks or cache lines. Remember: Rules of thumb can be broken with reason. Some of these help when extending to PGAS / message-passing.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 8/35
  • Sequential agglomerative method • A common method (e.g. Clauset, Newman, & Moore) agglomerates A C vertices into communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi).10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 9/35
  • Sequential agglomerative method • A common method (e.g. Clauset, Newman, & Moore) agglomerates A C vertices into communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi).10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 10/35
  • Sequential agglomerative method • A common method (e.g. Clauset, Newman, & Moore) agglomerates A C vertices into communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi).10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 11/35
  • Sequential agglomerative method • A common method (e.g. Clauset, Newman, & Moore) agglomerates A C vertices into communities. • Each vertex begins in its own B community. D • An edge is chosen to contract. E • Merging maximally increases modularity. G • Priority queue. F • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi).10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 12/35
  • Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. A C • Simple greedy algorithm. • Maximal matching. • Within factor of 2 in weight. B • Merge all communities at once. D E • Maintains some balance. • Produces different results. G F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 13/35
  • Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. A C • Simple greedy algorithm. • Maximal matching. • Within factor of 2 in weight. B • Merge all communities at once. D E • Maintains some balance. • Produces different results. G F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 14/35
  • Parallel agglomerative method • We use a matching to avoid the queue. • Compute a heavy weight, large matching. A C • Simple greedy algorithm. • Maximal matching. • Within factor of 2 in weight. B • Merge all communities at once. D E • Maintains some balance. • Produces different results. G F • Agnostic to weighting, matching... • Can maximize modularity, minimize conductance. • Modifying matching permits easy exploration.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 15/35
  • Platform: Cray XMT2 Tolerates latency by massive multithreading. • Hardware: 128 threads per processor • Context switch on every cycle (500 MHz) • Many outstanding memory requests (180/proc) • “No” caches... • Flexibly supports dynamic load balancing • Globally hashed address space, no data cache • Support for fine-grained, word-level synchronization • Full/empty bit on with every memory word • 64 processor XMT2 at CSCS, the Swiss National Supercomputer Centre. • 500 MHz processors, 8192 threads, 2 TiB of shared memory Image: cray.com10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 16/35
  • Platform: Intel R E7-8870-based server Tolerates some latency by hyperthreading. • Hardware: 2 threads / core, 10 cores / socket, four sockets. • Fast cores (2.4 GHz), fast memory (1 066 MHz). • Not so many outstanding memory requests (60/socket), but large caches (30 MiB L3 per socket). • Good system support • Transparent hugepages reduces TLB costs. • Fast, user-level locking. (HLE would be better...) • OpenMP, although I didn’t tune it... • mirasol, #17 on Graph500 (thanks to UCB) • Four processors (80 threads), 256 GiB memory • gcc 4.6.1, Linux kernel Image: Intel R press kit 3.2.0-rc510th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 17/35
  • Implementation: Data structures Extremely basic for graph G = (V, E) • An array of (i, j; w) weighted edge pairs, each i, j stored only once and packed, uses 3|E| space • An array to store self-edges, d(i) = w, |V | • A temporary floating-point array for scores, |E| • A additional temporary arrays using 4|V | + 2|E| to store degrees, matching choices, offsets... • Weights count number of agglomerated vertices or edges. • Scoring methods (modularity, conductance) need only vertex-local counts. • Storing an undirected graph in a symmetric manner reduces memory usage drastically and works with our simple matcher.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 18/35
  • Implementation: Data structures Extremely basic for graph G = (V, E) • An array of (i, j; w) weighted edge pairs, each i, j stored only once and packed, uses 3|E| 32-bit space • An array to store self-edges, d(i) = w, |V | • A temporary floating-point array for scores, |E| • A additional temporary arrays using 2|V | + |E| 64-bit, 2|V | 32-bit to store degrees, matching choices, offsets... • Need to fit uk-2007-05 into 256 GiB. • Cheat: Use 32-bit integers for indices. Know we won’t contract so far to need 64-bit weights. • Could cheat further and use 32-bit floats for scores. • (Note: Code didn’t bother optimizing workspace size.)10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 19/35
  • Implementation: Data structures Extremely basic for graph G = (V, E) • An array of (i, j; w) weighted edge pairs, each i, j stored only once and packed, uses 3|E| space • An array to store self-edges, d(i) = w, |V | • A temporary floating-point array for scores, |E| • A additional temporary arrays using 2|V | + |E| 64-bit, 2|V | 32-bit to store degrees, matching choices, offsets... • Original ignored order in edge array, killed OpenMP. • New: Roughly bucket edge array by first stored index. Non-adjacent CSR-like structure. • New: Hash i, j to determine order. Scatter among buckets. • (New = MTAAP 2012)10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 20/35
  • Implementation: Routines Three primitives: Scoring, matching, contracting Scoring Trivial. Matching Repeat until no ready, unmatched vertex: 1 For each unmatched vertex in parallel, find the best unmatched neighbor in its bucket. 2 Try to point remote match at that edge (lock, check if best, unlock). 3 If pointing succeeded, try to point self-match at that edge. 4 If both succeeded, yeah! If not and there was some eligible neighbor, re-add self to ready, unmatched list. (Possibly too simple, but...)10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 21/35
  • Implementation: Routines Contracting 1 Map each i, j to new vertices, re-order by hashing. 2 Accumulate counts for new i bins, prefix-sum for offset. 3 Copy into new bins. • Only synchronizing in the prefix-sum. That could be removed if I don’t re-order the i , j pair; haven’t timed the difference. • Actually, the current code copies twice... On short list for fixing. • Binning as opposed to original list-chasing enabled Intel/OpenMP support with reasonable performance.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 22/35
  • 23/35 Intel E7−8870 Cray XMT2 uk−2007−05 er−fact1.5−scale25 uk−2002 cage15 kron_g500−simple−logn20 audikw1 ldoor eu−2005 other coPapersDBLP in−2004 333SP kron_g500−simple−logn16 contract 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy belgium.osm Graph name citationCiteseer Primitive coAuthorsCiteseer rgg_n_2_17_s0 match caidaRouterLevel G_n_pin_pout smallworldImplementation: Routines preferentialAttachment cond−mat−2005 score astro−ph luxembourg.osm memplus as−22july06 PGPgiantcompo polblogs power email celegans_metabolic 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% Time (s)
  • Performance: Time by platform 103 qqq Intel E7−8870 102 q qq q q q 101 qq qq q q q q q q qqq qqq q q q q q 100 q q qq q qq q q q qqq qq q q qq q qq q qq q q q q 10−1 q qq q q qq q q q q q q q qq q q qq q q qq 10−2 q q q q 103 q qq q q Time (s) Cray XMT2 102 qq q qq q q q qq qq q qq q 101 qq q q q q q qq q q qq q qq qq q q qq q q q q q q q qq q q 100 qq q q qq q q qq q q qq q q q q q qq q q q qq 10−1 10−2 celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05 Graph name Number of communities q 100 q 101 q 102 q 103 q 104 q 105 q 10610th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 24/35
  • 25/35 Intel E7−8870 Cray XMT2 qq qq q uk−2007−05 qq er−fact1.5−scale25 q qq qqq qq qqq uk−2002 q cage15 q q q q q kron_g500−simple−logn20 q 106 qq qq q q audikw1 q qq qq q ldoor q qq q qq eu−2005 105 q q q q qq coPapersDBLP q q qq q qq qq q q in−2004 q 333SP q q qq Number of communities 104 q kron_g500−simple−logn16 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy q q qq qq q belgium.osm q Graph name q qq q q citationCiteseer q q 103 coAuthorsCiteseer qPerformance: Rate by platform q q qq rgg_n_2_17_s0 q q q qq q q q q q caidaRouterLevel q q q q q 102 G_n_pin_pout q q smallworld q q qq qq q q preferentialAttachment q q qqq q q qq q q qq q 101 cond−mat−2005 q q astro−ph q q q q luxembourg.osm q qq qq q memplus 100 qq q q as−22july06 q q q q q q q q q qq PGPgiantcompo q q qq polblogs q q qq power q q q qq q email q q q celegans_metabolic q q q 107 6 105 104 107 106 5 104 10 10 Edges per second
  • Performance: Rate by metric (on Intel) q qq q q qq q 107 q qq qq q qq q q qqq qq q qq q qq q qq q qq qq q q 106.5 qq qqq q qq q q q q q q q cnm qq q q q q q q q 106 q q qq q q q qq q q 105.5 qq q qq q q 105 qq Edges per second q q q q qq 107 qq q q q qq q qq q q q q q q qq q qq 106.5 qq qq q q q q qq q q q qq q cond q q q q q qq q qq q qq qq q qq 106 q q q qqq 105.5 qq q qq q qq qq q qq q 105 qq q celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05 Graph name Number of communities q 100 q 101 q 102 q 103 q 104 q 105 q 10610th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 26/35
  • Performance: Scaling Intel E7−8870 Cray XMT2 1188.9 s q 103 q 368.0 s 349.6 s q 102.5 q q q q q q 285.4 s qq q q Time (s) q q 102 84.9 s q q q q q q qq q q q q q q 72.1 s q qq 101.5 q q 33.4 s q 1 10 q q q qq qq 6.6 s 0.5 10 20 21 22 23 24 25 26 20 21 22 23 24 25 26 Number of threads / processors Graph q uk−2002 q kron_g500−simple−logn2010th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 27/35
  • Performance: Modularity at coverage ≈ 0.5 x x x x x x x x x x x x 0.8 x x x x q q x qq qq qq q q qq qq qq qq q x qq qq 0.6 x qqqq q qqq qq qq q q qqq qqq qq q q q q qq qq q q x q qq q q qqq qq qqq qq q q qq qqq qq qqq qq q q q q q q q q qq q q q x qq q q qq q qqq q qq q q q q q q q qqq q qqq q qqq qq q qq q qq qq q q qqq qq q qq qq 0.4 qqq x qq q qq qq q q q qq q q q q q qq qq q q x q q Modularity qq qq q q q q q q qq q 0.2 qqq qq q q q q q q q qq qq q 0.0 qq q qq q qq qqq qq qqx qqq qq qqqx qq qqq q celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05 Graph name scoring q cnm q mb q cond10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 28/35
  • Performance: Avg. conductance at coverage ≈ 0.5 q q qq q q q qq q qq q qq qqq qq q qq q q qq qq qq q qqq 0.8 q qq qq q q qq q qq q q qq q qq q qq q q qq q q q qq q q q 0.6 q qq qq qq q q q q qq q qq q q q q q q q qq qq qq qq q qqq q q q qq q q q qq qq qq qq q q qq qqq q qq qqq q qqqq qqq qq qq q q q qq q 0.4 q qq q q q q q qq q qq q q qq q qq q qq qqq q qq qq q q q qq qqq qq qq q qq q qqq 0.2 AIXC qq qq qq q qq q q qq q q q q 0.0 qq qq q qq q qqq q q celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05 Graph name scoring q cnm q mb q cond10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 29/35
  • Performance: Modularity by step 0.8 0.6 X X X Modularity 0.4 0.2 0.0 0 5 10 15 20 25 30 35 step Graph coAuthorsCiteseer eu−2005 uk−200210th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 30/35
  • Performance: Coverage by step 0.8 0.6 X X X Coverage 0.4 0.2 0.0 0 5 10 15 20 25 30 35 step Graph coAuthorsCiteseer eu−2005 uk−200210th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 31/35
  • Performance: # of communities 107 106.5 Number of communities 106 X 105.5 105 X 104.5 X 104 0 5 10 15 20 25 30 35 step Graph coAuthorsCiteseer eu−2005 uk−200210th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 32/35
  • Performance: AIXC by step 100 X X 10−0.05 10−0.1 AIXC 10−0.15 10−0.2 10−0.25 X 0 5 10 15 20 25 30 35 step Graph coAuthorsCiteseer eu−2005 uk−200210th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 33/35
  • Performance: Comm. volume by step Communication volume (cut weight: dashed) 108 X 107.5 107 106.5 X 106 105.5 X 0 5 10 15 20 25 30 35 step Graph coAuthorsCiteseer eu−2005 uk−200210th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 34/35
  • Conclusions and plans • Code: http: //www.cc.gatech.edu/~jriedy/community-detection/ • First: Fix the low-hanging fruit. • Eliminate a copy during contraction. • Deal with stars (next presentation). • Then... Practical experiments. • How volatile are modularity and conductance to perturbations? • What matching schemes work well? • How do different metrics compare in applications? • Extending to streaming graph data! • Includes developing parallel refinement... (distance 2 matching) • And possibly de-clustering or manipulating the dendogram.10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 35/35
  • Acknowledgment of support10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 36/35