MTAAP12: Scalable Community Detection

Scalable Multi-threaded Community Detection
in Social Networks
E. Jason Riedy1 , David A. Bader1 , and Henning Meyerhenke2
1
School of Comp. Science and Engineering, Georgia Inst. of Technology
2
Inst. of Theoretical Informatics, Karlsruhe Inst. of Technology (KIT)

25 May 2012

Exascale data analysis

Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes

• Graph clustering is common in all application areas.

MTAAP 2012—Scalable Community Detection—Jason Riedy 2/35

These are not easy graphs.
Yifan Hu’s (AT&T) visualization of the in-2004 data set
http://www2.research.att.com/~yifanhu/gallery.html


But no shortage of structure...

Protein interactions, Giot et al., “A Protein
Interaction Map of Drosophila melanogaster”,
Jason’s network via LinkedIn Labs
Science 302, 1722-1736, 2003.

• Locally, there are clusters or communities.
• First pass over a massive social graph:
• Find smaller communities of interest.
• Analyze / visualize top-ranked communities.
• Our part: Community detection at massive scale. (Or kinda
large, given available data.)

Outline

Motivation

Deﬁning community detection and metrics

Shooting for massive graphs

Our parallel method

Implementation and platform details

Performance

Conclusions and plans


Community detection

What do we mean?
• Partition a graph’s
vertices into disjoint
communities.
• A community locally
maximizes some metric.
• Modularity,
conductance, ...
• Trying to capture that
vertices are more similar
within one community
than between
communities. Jason’s network via LinkedIn Labs


Community detection

Assumptions
• Disjoint partitioning of
vertices.
• There is no one unique
answer.
• Many metrics are
NP-complete to
optimize (Brandes, et
al.[1]).
• Graph is lossy
representation.
• Want an adaptable
detection method. Jason’s network via LinkedIn Labs


Common community metric: Modularity

• Modularity: Deviation of connectivity in the community
induced by a vertex set S from some expected background
model of connectivity.
• We take Newman [2]’s basic uniform model.
• Let m count all edges in graph G, mS count of edges with
both endpoints in S, and xS count the edges with any
endpoint in S. Modularity QS :

QS = (mS − x2 /4m)/m
S

• Total modularity: sum of modularities of disjoint subsets.
• A suﬃciently positive modularity implies some structure.
• Known issues: Resolution limit, NP-complete opt. prob.


Can we tackle massive graphs now?
Parallel, of course...
• Massive needs distributed memory, right?
• Well... Not really. Can buy a 2 TiB Intel-based Dell server
on-line for around $200k USD, a 1.5 TiB from IBM, etc.

Image: dell.com.

Not an endorsement, just evidence!
• Publicly available “real-world” data ﬁts...
• Start with shared memory to see what needs done.
• Specialized architectures provide larger shared-memory views
over distributed implementations (e.g. Cray XMT).

Multi-threaded algorithm design points

A scalable multi-threaded graph analysis algorithm
• ... avoids global locks and frequent global synchronization.
• ... distributes computation over edges rather than only vertices.
• ... works with data as local to an edge as possible.
• ... uses compact data structures that agglomerate memory
references.


Sequential agglomerative method

• A common method (e.g. Clauset, et
al. [3]) agglomerates vertices into
A C communities.
• Each vertex begins in its own
B community.
D • An edge is chosen to contract.
E • Merging maximally increases
modularity.
G • Priority queue.
F
• Known often to fall into an O(n2 )
performance trap with modularity
(Wakita & Tsurumi [4]).



A C communities.
B community.
modularity.
F


Parallel agglomerative method

• We use a matching to avoid the queue.
• Compute a heavy weight, large
matching.
• Simple greedy algorithm.
A C • Maximal matching.
• Within factor of 2 in weight.
B • Merge all matched communities at
D once.
E • Maintains some balance.
G • Produces diﬀerent results.
F
• Agnostic to weighting, matching...
• Can maximize modularity, minimize
conductance.
• Modifying matching permits easy
exploration.



matching.
D once.
F
conductance.
exploration.


Platform: Cray XMT2
Tolerates latency by massive multithreading.
• Hardware: 128 threads per processor
• Context switch on every cycle (500 MHz)
• Many outstanding memory requests (180/proc)
• “No” caches...
• Flexibly supports dynamic load balancing
• Globally hashed address space, no data cache
• Support for ﬁne-grained, word-level synchronization
• Full/empty bit on with every memory word

• 64 processor XMT2 at CSCS,
the Swiss National
Supercomputer Centre.
• 500 MHz processors, 8192
threads, 2 TiB of shared
memory Image: cray.com


Platform: Intel R E7-8870-based server
Tolerates some latency by hyperthreading.
• “Westmere:” 2 threads / core, 10 cores / socket, four sockets.
• Fast cores (2.4 GHz), fast memory (1 066 MHz).
• Not so many outstanding memory requests (60/socket), but
large caches (30 MiB L3 per socket).
• Good system support
• Transparent hugepages reduces TLB costs.
• Fast, user-level locking. (HLE would be better...)
• OpenMP, although I didn’t tune it...

• mirasol, #17 on Graph500
(thanks to UCB)
• Four processors (80 threads),
256 GiB memory
• gcc 4.6.1, Linux kernel
Image: Intel R press kit
3.2.0-rc5

Platform: Other Intel R -based servers

Diﬀerent design points
• “Nehalem” X5570: 2.93 GHz, 2 threads/core, 4 cores/socket,
2 sockets, 8 MiB cache/socket
• “Westmere” X5650: 2.66 GHz, 2 threads/core, 6 cores/socket,
2 sockets, 12 MiB cache/socket
• All with 1 066 MHz memory.
• Does the Westmere E7-8870’s scale aﬀect performance?

• Nodes in Georgia Tech CSE
cluster jinx
• 24-48 GiB memory, small
tests
Image: Intel R press kit


Implementation: Data structures

Extremely basic for graph G = (V, E)
• An array of (i, j; w) weighted edge pairs, each i, j stored only
once and packed, uses 3|E| space
• An array to store self-edges, d(i) = w, |V |
• A temporary ﬂoating-point array for scores, |E|
• A additional temporary arrays using 4|V | + 2|E| to store
degrees, matching choices, oﬀsets...

• Weights count number of agglomerated vertices or edges.
• Scoring methods (modularity, conductance) need only
vertex-local counts.
• Storing an undirected graph in a symmetric manner reduces
memory usage drastically and works with our simple matcher.


Implementation: Data structures

Extremely basic for graph G = (V, E)
• An array of (i, j; w) weighted edge pairs, each i, j stored only
once and packed, uses 3|E| space
• An array to store self-edges, d(i) = w, |V |
• A temporary floating-point array for scores, |E|
• A additional temporary arrays using 4|V | + 2|E| to store
degrees, matching choices, offsets...

• Original ignored order in edge array, killed OpenMP.
• New: Roughly bucket edge array by first stored index.
Non-adjacent CSR-like structure.
• New: Hash i, j to determine order. Scatter among buckets.


Implementation: Routines

Three primitives: Scoring, matching, contracting
Scoring Trivial.
Matching Repeat until no ready, unmatched vertex:
1 For each unmatched vertex in parallel, ﬁnd the
best unmatched neighbor in its bucket.
2 Try to point remote match at that edge (lock,
check if best, unlock).
3 If pointing succeeded, try to point self-match at
that edge.
4 If both succeeded, yeah! If not and there was
some eligible neighbor, re-add self to ready,
unmatched list.
(Possibly too simple, but...)


Implementation: Routines

Contracting
1 Map each i, j to new vertices, re-order by hashing.
2 Accumulate counts for new i bins, prefix-sum for offset.
3 Copy into new bins.

• Only synchronizing in the prefix-sum. That could be removed if
I don’t re-order the i , j pair; haven’t timed the difference.
• Actually, the current code copies twice... On short list for
fixing.
• Binning as opposed to original list-chasing enabled
Intel/OpenMP support with reasonable performance.


Performance summary
Two moderate-sized graphs, one large
Graph |V | |E| Reference
rmat-24-16 15 580 378 262 482 711 [5, 6]
soc-LiveJournal1 4 847 571 68 993 773 [7]
uk-2007-05 105 896 555 3 301 876 564 [8]

Peak processing rates in edges/second
Platform rmat-24-16 soc-LiveJournal1 uk-2007-05
X5570 1.83 × 106 3.89 × 106
X5650 2.54 × 106 4.98 × 106
E7-8870 5.86 × 106 6.90 × 106 6.54 × 106
XMT 1.20 × 106 0.41 × 106
XMT2 2.11 × 106 1.73 × 106 3.11 × 106


Performance: Time to solution

rmat−24−16 soc−LiveJournal1

3162
1000 q
q
q
q
q
316 q q
q

Intel
q
100 q
q
q
q
q
32 q
q
q
q
10
3
Time (s)

3162
1000
316

Cray XMT
100
32
10
3

1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128

Threads (OpenMP) / Processors (XMT)
Platform
q X5570 (4−core) X5650 (6−core) E7−8870 (10−core) XMT XMT2


Performance: Rate (edges/second)


107.5

107
q
106.5

Intel
q
q
q
q q
q
q
q q
106 q
q q
q
q

q
q
Edges per second

105.5 q
q

5
10

107.5

107

Cray XMT
106.5

106

105.5

105

1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128

Platform


Performance: Modularity
0.8

0.6
q
q
Modularity q

0.4

0.2

0.0

0 10 20 30
step

Graph Termination metric
q coAuthorsCiteseer q eu−2005 q uk−2002 q Coverage Max Average

• Timing results: Stop when coverage ≥ 0.5 (communities cover 1/2 edges).
• More work ⇒ higher modularity. Choice up to application.


Performance: Small-scale speedup

Speed−up over one thread (OpenMP) or processor (XMT)
32

16

8

Intel
q
q q
q
4 q q
q
q
q
q
q
q q
q
2 q
q
q q
q
q
1

32

16

Cray XMT
8

4

2

1

1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128

Platform


Performance: Large-scale time

31568s

214

q
6917s
Time (s)

12 q
2

q
q

q
q

q
1063s
210 q
q q
q
q
q q
q
q q
q
q
504.9s
q
q q
q

1 2 4 8 16 32 64

Platform
q E7−8870 (10−core) XMT2


Performance: Large-scale speedup

Speed up over single processor/thread 29.6x

24 13.7x
q
q q q
q
q
q q
q q
q
q
23 q
q
q
q
q

2 q
2 q

q
q

21
q

1 2 4 8 16 32 64

Platform
q E7−8870 (10−core) XMT2


Conclusions and plans

• Code:
http://www.cc.gatech.edu/~jriedy/community-detection/
• Some low-hanging fruit remains:
• Eliminate one unnecessary copy during contraction.
• Deal with stars.
• Then... Practical experiments.
• How volatile are modularity and conductance to perturbations?
• What matching schemes work well?
• How do diﬀerent metrics compare in applications?
• Extending to streaming graph data!
• Includes developing parallel reﬁnement...
• And possibly de-clustering or manipulating the dendogram....
• Very much WIP, more tricky than anticipated.


Acknowledgment of support


Bibliography I

U. Brandes, D. Delling, M. Gaertler, R. G¨rke, M. Hoefer,
o
Z. Nikoloski, and D. Wagner, “On modularity clustering,” IEEE
Trans. Knowledge and Data Engineering, vol. 20, no. 2, pp.
172–188, 2008.
M. Newman, “Modularity and community structure in
networks,” Proc. of the National Academy of Sciences, vol. 103,
no. 23, pp. 8577–8582, 2006.
A. Clauset, M. Newman, and C. Moore, “Finding community
structure in very large networks,” Physical Review E, vol. 70,
no. 6, p. 66111, 2004.
K. Wakita and T. Tsurumi, “Finding community structure in
mega-scale social networks,” CoRR, vol. abs/cs/0702048, 2007.


Bibliography II

D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A
recursive model for graph mining,” in Proc. 4th SIAM Intl.
Conf. on Data Mining (SDM). Orlando, FL: SIAM, Apr. 2004.
D. Bader, J. Gilbert, J. Kepner, D. Koester, E. Loh,
K. Madduri, W. Mann, and T. Meuse, HPCS SSCA#2 Graph
Analysis Benchmark Speciﬁcations v1.1, Jul. 2005.
J. Leskovec, “Stanford large network dataset collection,” At
http://snap.stanford.edu/data/, Oct. 2011.
P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler:
A scalable fully distributed web crawler,” Software: Practice &
Experience, vol. 34, no. 8, pp. 711–726, 2004.


MTAAP12: Scalable Community Detection

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to MTAAP12: Scalable Community Detection

Similar to MTAAP12: Scalable Community Detection (20)

More from Jason Riedy

More from Jason Riedy (20)

Recently uploaded

Recently uploaded (20)

MTAAP12: Scalable Community Detection