Overlapping
Clusters for
Distributed
Computation
DAVID F. GLEICH "     REID ANDERSEN "
 PURDUE UNIVERSITY
     MICROSOFT CORP.
COMPUTER SCIENCE "    VAHAB MIRROKNI"
 DEPARTMENT
            GOOGLE RESEARCH, NYC




                                                                      1
                                 David Gleich · Purdue
   WSDM2012
Problem 
Find a good way to distribute a big graph 
    for solving things like linear systems and simulating random walks

Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to find overlapping
partitions to reduce communication (~20%)

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping





                                                                              2
                                         David Gleich · Purdue
   WSDM2012
The problem
     WHAT OUR NETWORKS       WHAT OUR OTHER
     LOOK LIKE
              NETWORKS LOOK LIKE




                                                              3
                         David Gleich · Purdue
   WSDM2012
The problem
     COMBINING NETWORKS AND GRAPHS IS A MESS




                                                                 4
                            David Gleich · Purdue
   WSDM2012
“Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
  complexity too




                                                     5
                David Gleich · Purdue
   WSDM2012
Current solutions
                  Work
        Comm.
       Data
          Programming

Disjoint vertex                Okay to                     “Think like a
                  Excellent
                Excellent
partitions
                    Good
                       vertex”

2d or Edge
                  Excellent
   Excellent
   Good
          “Impossible”
Partitions



Where we fit!

Overlapping                    Good to                     “Think like a
                  Okay
                     “Let’s see”
partitions
                    Excellent
                  cached vertex”




                                                                                  6
                                            David Gleich · Purdue
    WSDM2012
Goals
Find a set of "
overlapping clusters "
where 

random walks stay in a
 cluster for a long time

solving diffusion-like problems
 requires little communication
 (think PageRank, Katz, hitting times,
 semi-supervised learning) 




                                                                              7
                                         David Gleich · Purdue
   WSDM2012
Related work
Domain decomposition, Schwarz methods
 How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
 k-step matrix-vector products (Demmel et al.) and "
 growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
 Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
 Parreira, Castillo, Donato et al. 




                                                                            8
                                       David Gleich · Purdue
   WSDM2012
Overlapping clusters
                           Each vertex 
                              in at least one cluster
                              has one home cluster
                           
Formally,
                           an overlapping cover is
                           (C, ⌧ )

                           C={       ,   ,       }
                              = set of clusters

                           ⌧ : V 7! C = map to homes
                           ⌧ is a partition!




                                                                 9
                        David Gleich · Purdue
       WSDM2012
Random walks in
      overlapping clusters
                                      Each vertex 
                                          in at least one cluster
                                          has one home cluster
                                      
    red cluster "
keeps the walk
                       Random walks
                       red cluster "
                                          go to the home
                       sends the walk     cluster after leaving
                       to gray cluster
                                      
                                      




                                                                       10
                                  David Gleich · Purdue
   WSDM2012
An evaluation metric"
      Swapping probability
                                     Is (C, ⌧ ) a good
                                     overlapping cover?
                                     Does a random walk
                                     swap clusters often?
    red cluster "
keeps the walk
                      ⇢
                                     
 1      =
                                         probability that a walk
                      red cluster "
                      sends the walk     changes clusters on each
                      to gray cluster
   step
                                         computable expression in the paper




                                                                          11
                                 David Gleich · Purdue
     WSDM2012
Overlapping clusters
                           Each vertex 
                              is in at least one cluster
                              has one home cluster
                              

                           Vol(C) = sum of degrees of
                            vertices in cluster C
                           MaxVol = "
                            upper bound on Vol(C) 
                           TotalVol(C) = "
                                    C
                             sum of Vol(C) for all clusters
                           VolRatio = TotalVol(C) / Vol(G)"
                                               C
                             how much extra data!




                                                               12
                        David Gleich · Purdue
   WSDM2012
Swapping probability &
partitioning
                                                       No overlap in
       
                                               this figure !

P is a partition
       
⇢1 (P) 
=
     1    X
       
    Cut(P)
  Vol(G)
           P2P
       
       Much like a
       classical graph
       partitioning metric




                                                                        13
                              David Gleich · Purdue
      WSDM2012
Overlapping clusters vs.
Partitioning in theory
                         Take a cycle graph
                             M groups of ℓ vertices
                             MaxVol = 2ℓ
                         
                         
 partitioning
                         for
                                 1
                         
1
                         ⇢     =          (Optimal!)
                                 `
                         for overlapping
                                  4
                         ⇢1 =
                               ⌦(`2 )




                                                            14
                      David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
          Use personalized PageRank clusters
      2.  Find “well contained” nodes (cores)
          Compute expected “leavetime” 
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
      




                                                                           15
                                 David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
                                                               Each cluster takes
          Use personalized PageRank clusters, or metis
        “< MaxVol” work

      2.  Find “well contained” nodes (cores)
                                                               Takes O(Vol)
          Compute expected “leave time” 
                      work per cluster
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
         Fast enough

      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
             Fast enough

      




                                                                              16
                                 David Gleich · Purdue
    WSDM2012
Demo!




                                              17
         David Gleich · Purdue
   WSDM2012
Solving "
linear "
systems
 Like PageRank, Katz, and
 semi-supervised learning




                                                                  18
                             David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




                                     19
David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




A core vertex for the
gray cluster.




                                      20
 David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
    the coordinate descent method.




   Red sends residuals to white.
White send residuals to red.




                                          21
     David Gleich · Purdue
   WSDM2012
White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.




                                          22
 David Gleich · Purdue
   WSDM2012
That algorithm is called "
restricted additive Schwarz.

  PageRank
 We look at
                 PageRank!
  Katz scores
  semi-supervised learning
  any spd or M-matrix "
     linear system




                                                   23
              David Gleich · Purdue
   WSDM2012
It works!
                           2
         communication

                                            Swapping Probability (usroads)
                                            PageRank Communication (usroads)
                                            Swapping Probability (web−Google)
                          1.5
                                            PageRank Communication (web−Google)
Relative Relative Work




                           1                                                 Metis Partitioner
                                                                        Partitioning baseline

                          0.5


                           0
                            1   1.1   1.2    1.3     1.4         1.5         1.6           1.7
                                             Volume Ratio
                                      How much more of the
                                      graph we need to store.




                                                                                                 24
                                                    David Gleich · Purdue
     WSDM2012
Edges are counted twice and some graphs have self-
    loops. The first group are geometric networks and
    the second are information networks.
                              Graph
                             Graph     Vertices
                                       |V |                     Edges
                                                                |E|               MaxDeg
                                                                                  max deg                   Density
                                                                                                            |E|/|V |
                              onera    85567                    419201            5                         4.9
                            usroads    126146                   323900            7                         2.6
                            annulus    500000                   2999258           19                        6.0

            email-Enron                33696                    361622            1383                      10.7
           soc-Slashdot                77360                    1015667           2540                      13.1
                   dico                111982                   2750576           68191                     24.6
                   lcsh                144791                   394186            1025                      2.7
             web-Google                855802                   8582704           6332                      10.0
             as-skitter                1694616                  22188418          35455                     13.1
            cit-Patents                3764117                  33023481          793                       8.8

                   1                                       1                                       1

                  0.8                                     0.8                                     0.8
    Conductance




                                                                                    Conductance
-
                                            Conductance




                  0.6                                     0.6                                     0.6

                  0.4                                     0.4                                     0.4




                                                                                                                           25
                  0.2                                     0.2                                     0.2

                   0
                                                                         David Gleich · Purdue
                                                                                        0
                                                                                                             WSDM2012
                        0               5                  0                                            0              5
he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
 at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
  Graph            Comm. of         Comm. of        Perf. Ratio      Vol. Ratio
                     Partition       Overlap
  onera                18654               48            0.003                2.82
  usroads               3256                0            0.000                1.49
  annulus              12074                2            0.000                0.01
  email-Enron       194536*           235316             1.210                 1.7
  soc-Slashdot      875435*         1.3 ⇥ 106            1.480                1.78
  dico            1.5 ⇥ 106 *       2.0 ⇥ 106            1.320                1.53
  lcsh                73000*           48777             0.668                2.17
  web-Google        201159*           167609             0.833                1.57
  as-skitter       2.4 ⇥ 106        3.9 ⇥ 106            1.645                1.93
  cit-Patents      8.7 ⇥ 106        7.3 ⇥ 106            0.845                1.34

             * means Graculus
nally, we evaluate our heuristic.
                         gave a better
                 partition than Metis
       At left, the cluster combine procedure reduces 106 clusters to




                                                                                26
       around 102 . Middle, combining clusters can decrease the volume
                                           David Gleich · Purdue
 WSDM2012
Summary
                         Future work
!                                
Overlap helps reduce             Truly distributed implementation and
communication in a distributed   evaluation
process!                         
!                                Can we exploit data redundancy to
Proof of concept procedure to    solve problems on large graphs faster?
find overlapping partitions to    
reduce communication 
                     Copy 1
           Copy 2
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping




                                                                           27

                                    David Gleich · Purdue
   WSDM2012

Overlapping clusters for distributed computation

  • 1.
    Overlapping Clusters for Distributed Computation DAVID F.GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP. COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • 2.
    Problem Find agood way to distribute a big graph for solving things like linear systems and simulating random walks Contributions Theoretical demonstration that overlap helps Proof of concept procedure to find overlapping partitions to reduce communication (~20%) All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • 3.
    The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • 4.
    The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • 5.
    “Good” data distributionsare a fundamental problem in distributed computation. ! How to divide the communication graph! Balance work Balance communication Balance data Balance programming complexity too 5 David Gleich · Purdue WSDM2012
  • 6.
    Current solutions Work Comm. Data Programming Disjoint vertex Okay to “Think like a Excellent Excellent partitions Good vertex” 2d or Edge Excellent Excellent Good “Impossible” Partitions Where we fit! Overlapping Good to “Think like a Okay “Let’s see” partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • 7.
    Goals Find a setof " overlapping clusters " where random walks stay in a cluster for a long time solving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • 8.
    Related work Domain decomposition,Schwarz methods How to solve a linear system with overlap. Szyld et al. Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld) Overlapping communities and link partitioning algorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and Satu P2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • 9.
    Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • 10.
    Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster " keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • 11.
    An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster " keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • 12.
    Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • 13.
    Swapping probability & partitioning No overlap in this figure ! P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • 14.
    Overlapping clusters vs. Partitioningin theory Take a cycle graph M groups of ℓ vertices MaxVol = 2ℓ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • 15.
    Heuristics for findinggood " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • 16.
    Heuristics for findinggood " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • 17.
    Demo! 17 David Gleich · Purdue WSDM2012
  • 18.
    Solving " linear " systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • 19.
    All nodes solvelocally using " the coordinate descent method. 19 David Gleich · Purdue WSDM2012
  • 20.
    All nodes solvelocally using " the coordinate descent method. A core vertex for the gray cluster. 20 David Gleich · Purdue WSDM2012
  • 21.
    All nodes solvelocally using " the coordinate descent method. Red sends residuals to white. White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • 22.
    White then usesthe coordinate descent method to adjust its solution. Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • 23.
    That algorithm iscalled " restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • 24.
    It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google) Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • 25.
    Edges are countedtwice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance - Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • 26.
    he communication ratioof our best result for the PageRan ommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). The ommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculus nally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • 27.
    Summary Future work ! Overlap helps reduce Truly distributed implementation and communication in a distributed evaluation process! ! Can we exploit data redundancy to Proof of concept procedure to solve problems on large graphs faster? find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dst All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012