Overlapping clusters for distributed computation

Overlapping
Clusters for
Distributed
Computation
DAVID F. GLEICH " REID ANDERSEN "
PURDUE UNIVERSITY
MICROSOFT CORP.
COMPUTER SCIENCE " VAHAB MIRROKNI"
DEPARTMENT
GOOGLE RESEARCH, NYC

1
David Gleich · Purdue
WSDM2012

Problem
Find a good way to distribute a big graph
for solving things like linear systems and simulating random walks

Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to ﬁnd overlapping
partitions to reduce communication (~20%)

All code available
http://www.cs.purdue.edu/~dgleich/codes/
overlapping

2
WSDM2012

The problem
WHAT OUR NETWORKS WHAT OUR OTHER
LOOK LIKE
NETWORKS LOOK LIKE

3
WSDM2012

The problem
COMBINING NETWORKS AND GRAPHS IS A MESS

4
WSDM2012

“Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
complexity too

5
WSDM2012

Current solutions
Work
Comm.
Data
Programming

Disjoint vertex Okay to “Think like a
Excellent
Excellent
partitions
Good
vertex”

2d or Edge
Excellent
Excellent
Good
“Impossible”
Partitions

Where we ﬁt!

Overlapping Good to “Think like a
Okay
“Let’s see”
partitions
Excellent
cached vertex”

6
WSDM2012

Goals
Find a set of "
overlapping clusters "
where

random walks stay in a
cluster for a long time

solving diffusion-like problems
requires little communication
(think PageRank, Katz, hitting times,
semi-supervised learning)

7
WSDM2012

Related work
Domain decomposition, Schwarz methods
How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
k-step matrix-vector products (Demmel et al.) and "
growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
Parreira, Castillo, Donato et al.

8
WSDM2012

Overlapping clusters
Each vertex
in at least one cluster
has one home cluster

Formally,
an overlapping cover is
(C, ⌧ )

C={ , , }
= set of clusters

⌧ : V 7! C = map to homes
⌧ is a partition!

9
WSDM2012

Random walks in
overlapping clusters
Each vertex
in at least one cluster

red cluster "
keeps the walk
Random walks
red cluster "
go to the home
sends the walk cluster after leaving
to gray cluster

10
WSDM2012

An evaluation metric"
Swapping probability
Is (C, ⌧ ) a good
overlapping cover?
Does a random walk
swap clusters often?
red cluster "
keeps the walk
⇢

1 =
probability that a walk
red cluster "
sends the walk changes clusters on each
to gray cluster
step
computable expression in the paper

11
WSDM2012

Overlapping clusters
Each vertex
is in at least one cluster

Vol(C) = sum of degrees of
vertices in cluster C
MaxVol = "
upper bound on Vol(C)
TotalVol(C) = "
C
sum of Vol(C) for all clusters
VolRatio = TotalVol(C) / Vol(G)"
C
how much extra data!

12
WSDM2012

Swapping probability &
partitioning
No overlap in

this figure !

P is a partition

⇢1 (P)
=
1 X

Cut(P)
Vol(G)
P2P

Much like a
classical graph
partitioning metric

13
WSDM2012

Overlapping clusters vs.
Partitioning in theory
Take a cycle graph
M groups of ℓ�� vertices
MaxVol = 2ℓ��

partitioning
for
1

1
⇢ = (Optimal!)
`
for overlapping
4
⇢1 =
⌦(`2 )

14
WSDM2012

Heuristics for ﬁnding good " N P-hard for optimal
solution L

Our multi-stage heuristic!
1.  Find a large set of good clusters
Use personalized PageRank clusters
2.  Find “well contained” nodes (cores)
Compute expected “leavetime”
3.  Cover the graph with core vertices
Approximately solve a min set-cover problem
4.  Combine clusters up to MaxVol
The swapping probability is sub-modular

15
WSDM2012

Heuristics for ﬁnding good " N P-hard for optimal
solution L

Our multi-stage heuristic!
1.  Find a large set of good clusters
Each cluster takes
Use personalized PageRank clusters, or metis
“< MaxVol” work

2.  Find “well contained” nodes (cores)
Takes O(Vol)
Compute expected “leave time”
work per cluster
3.  Cover the graph with core vertices
Approximately solve a min set-cover problem
Fast enough

4.  Combine clusters up to MaxVol
The swapping probability is sub-modular
Fast enough

16
WSDM2012

Demo!

17
WSDM2012

Solving "
linear "
systems
Like PageRank, Katz, and
semi-supervised learning

18
WSDM2012

All nodes solve locally using "
the coordinate descent method.

19
WSDM2012


A core vertex for the
gray cluster.

20
WSDM2012


Red sends residuals to white.
White send residuals to red.

21
WSDM2012

White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.

22
WSDM2012

That algorithm is called "
restricted additive Schwarz.

PageRank
We look at
PageRank!
Katz scores
semi-supervised learning
any spd or M-matrix "
linear system

23
WSDM2012

It works!
2
communication

Swapping Probability (usroads)
PageRank Communication (usroads)
Swapping Probability (web−Google)
1.5
PageRank Communication (web−Google)
Relative Relative Work

1 Metis Partitioner
Partitioning baseline

0.5

0
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Volume Ratio
How much more of the
graph we need to store.

24
WSDM2012

Edges are counted twice and some graphs have self-
loops. The ﬁrst group are geometric networks and
the second are information networks.
Graph
Graph Vertices
|V | Edges
|E| MaxDeg
max deg Density
|E|/|V |
onera 85567 419201 5 4.9
usroads 126146 323900 7 2.6
annulus 500000 2999258 19 6.0

email-Enron 33696 361622 1383 10.7
soc-Slashdot 77360 1015667 2540 13.1
dico 111982 2750576 68191 24.6
lcsh 144791 394186 1025 2.7
web-Google 855802 8582704 6332 10.0
as-skitter 1694616 22188418 35455 13.1
cit-Patents 3764117 33023481 793 8.8

1 1 1

0.8 0.8 0.8
Conductance

Conductance
-
Conductance

0.6 0.6 0.6

0.4 0.4 0.4

25
0.2 0.2 0.2

0
0
WSDM2012
0 5 0 0 5

he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
Graph Comm. of Comm. of Perf. Ratio Vol. Ratio
Partition Overlap
onera 18654 48 0.003 2.82
usroads 3256 0 0.000 1.49
annulus 12074 2 0.000 0.01
email-Enron 194536* 235316 1.210 1.7
soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78
dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53
lcsh 73000* 48777 0.668 2.17
web-Google 201159* 167609 0.833 1.57
as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93
cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34

* means Graculus
nally, we evaluate our heuristic.
gave a better
partition than Metis
At left, the cluster combine procedure reduces 106 clusters to

26
around 102 . Middle, combining clusters can decrease the volume
WSDM2012

Summary
Future work
!
Overlap helps reduce Truly distributed implementation and
communication in a distributed evaluation
process!
! Can we exploit data redundancy to
Proof of concept procedure to solve problems on large graphs faster?
ﬁnd overlapping partitions to
reduce communication
Copy 1
Copy 2
src -> dst
src -> dst
src -> dst
src -> dst
src -> dst
src -> dst

All code available
http://www.cs.purdue.edu/~dgleich/codes/
overlapping

27

WSDM2012

Overlapping clusters for distributed computation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Overlapping clusters for distributed computation

Similar to Overlapping clusters for distributed computation (6)

More from David Gleich

More from David Gleich (15)

Recently uploaded

Recently uploaded (20)

Overlapping clusters for distributed computation