SlideShare a Scribd company logo
Overlapping
Clusters for
Distributed
Computation
DAVID F. GLEICH "     REID ANDERSEN "
 PURDUE UNIVERSITY
     MICROSOFT CORP.
COMPUTER SCIENCE "    VAHAB MIRROKNI"
 DEPARTMENT
            GOOGLE RESEARCH, NYC




                                                                      1
                                 David Gleich · Purdue
   WSDM2012
Problem 
Find a good way to distribute a big graph 
    for solving things like linear systems and simulating random walks

Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to find overlapping
partitions to reduce communication (~20%)

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping





                                                                              2
                                         David Gleich · Purdue
   WSDM2012
The problem
     WHAT OUR NETWORKS       WHAT OUR OTHER
     LOOK LIKE
              NETWORKS LOOK LIKE




                                                              3
                         David Gleich · Purdue
   WSDM2012
The problem
     COMBINING NETWORKS AND GRAPHS IS A MESS




                                                                 4
                            David Gleich · Purdue
   WSDM2012
“Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
  complexity too




                                                     5
                David Gleich · Purdue
   WSDM2012
Current solutions
                  Work
        Comm.
       Data
          Programming

Disjoint vertex                Okay to                     “Think like a
                  Excellent
                Excellent
partitions
                    Good
                       vertex”

2d or Edge
                  Excellent
   Excellent
   Good
          “Impossible”
Partitions



Where we fit!

Overlapping                    Good to                     “Think like a
                  Okay
                     “Let’s see”
partitions
                    Excellent
                  cached vertex”




                                                                                  6
                                            David Gleich · Purdue
    WSDM2012
Goals
Find a set of "
overlapping clusters "
where 

random walks stay in a
 cluster for a long time

solving diffusion-like problems
 requires little communication
 (think PageRank, Katz, hitting times,
 semi-supervised learning) 




                                                                              7
                                         David Gleich · Purdue
   WSDM2012
Related work
Domain decomposition, Schwarz methods
 How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
 k-step matrix-vector products (Demmel et al.) and "
 growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
 Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
 Parreira, Castillo, Donato et al. 




                                                                            8
                                       David Gleich · Purdue
   WSDM2012
Overlapping clusters
                           Each vertex 
                              in at least one cluster
                              has one home cluster
                           
Formally,
                           an overlapping cover is
                           (C, ⌧ )

                           C={       ,   ,       }
                              = set of clusters

                           ⌧ : V 7! C = map to homes
                           ⌧ is a partition!




                                                                 9
                        David Gleich · Purdue
       WSDM2012
Random walks in
      overlapping clusters
                                      Each vertex 
                                          in at least one cluster
                                          has one home cluster
                                      
    red cluster "
keeps the walk
                       Random walks
                       red cluster "
                                          go to the home
                       sends the walk     cluster after leaving
                       to gray cluster
                                      
                                      




                                                                       10
                                  David Gleich · Purdue
   WSDM2012
An evaluation metric"
      Swapping probability
                                     Is (C, ⌧ ) a good
                                     overlapping cover?
                                     Does a random walk
                                     swap clusters often?
    red cluster "
keeps the walk
                      ⇢
                                     
 1      =
                                         probability that a walk
                      red cluster "
                      sends the walk     changes clusters on each
                      to gray cluster
   step
                                         computable expression in the paper




                                                                          11
                                 David Gleich · Purdue
     WSDM2012
Overlapping clusters
                           Each vertex 
                              is in at least one cluster
                              has one home cluster
                              

                           Vol(C) = sum of degrees of
                            vertices in cluster C
                           MaxVol = "
                            upper bound on Vol(C) 
                           TotalVol(C) = "
                                    C
                             sum of Vol(C) for all clusters
                           VolRatio = TotalVol(C) / Vol(G)"
                                               C
                             how much extra data!




                                                               12
                        David Gleich · Purdue
   WSDM2012
Swapping probability &
partitioning
                                                       No overlap in
       
                                               this figure !

P is a partition
       
⇢1 (P) 
=
     1    X
       
    Cut(P)
  Vol(G)
           P2P
       
       Much like a
       classical graph
       partitioning metric




                                                                        13
                              David Gleich · Purdue
      WSDM2012
Overlapping clusters vs.
Partitioning in theory
                         Take a cycle graph
                             M groups of ℓ������ vertices
                             MaxVol = 2ℓ������
                         
                         
 partitioning
                         for
                                 1
                         
1
                         ⇢     =          (Optimal!)
                                 `
                         for overlapping
                                  4
                         ⇢1 =
                               ⌦(`2 )




                                                            14
                      David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
          Use personalized PageRank clusters
      2.  Find “well contained” nodes (cores)
          Compute expected “leavetime” 
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
      




                                                                           15
                                 David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
                                                               Each cluster takes
          Use personalized PageRank clusters, or metis
        “< MaxVol” work

      2.  Find “well contained” nodes (cores)
                                                               Takes O(Vol)
          Compute expected “leave time” 
                      work per cluster
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
         Fast enough

      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
             Fast enough

      




                                                                              16
                                 David Gleich · Purdue
    WSDM2012
Demo!




                                              17
         David Gleich · Purdue
   WSDM2012
Solving "
linear "
systems
 Like PageRank, Katz, and
 semi-supervised learning




                                                                  18
                             David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




                                     19
David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




A core vertex for the
gray cluster.




                                      20
 David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
    the coordinate descent method.




   Red sends residuals to white.
White send residuals to red.




                                          21
     David Gleich · Purdue
   WSDM2012
White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.




                                          22
 David Gleich · Purdue
   WSDM2012
That algorithm is called "
restricted additive Schwarz.

  PageRank
 We look at
                 PageRank!
  Katz scores
  semi-supervised learning
  any spd or M-matrix "
     linear system




                                                   23
              David Gleich · Purdue
   WSDM2012
It works!
                           2
         communication

                                            Swapping Probability (usroads)
                                            PageRank Communication (usroads)
                                            Swapping Probability (web−Google)
                          1.5
                                            PageRank Communication (web−Google)
Relative Relative Work




                           1                                                 Metis Partitioner
                                                                        Partitioning baseline

                          0.5


                           0
                            1   1.1   1.2    1.3     1.4         1.5         1.6           1.7
                                             Volume Ratio
                                      How much more of the
                                      graph we need to store.




                                                                                                 24
                                                    David Gleich · Purdue
     WSDM2012
Edges are counted twice and some graphs have self-
    loops. The first group are geometric networks and
    the second are information networks.
                              Graph
                             Graph     Vertices
                                       |V |                     Edges
                                                                |E|               MaxDeg
                                                                                  max deg                   Density
                                                                                                            |E|/|V |
                              onera    85567                    419201            5                         4.9
                            usroads    126146                   323900            7                         2.6
                            annulus    500000                   2999258           19                        6.0

            email-Enron                33696                    361622            1383                      10.7
           soc-Slashdot                77360                    1015667           2540                      13.1
                   dico                111982                   2750576           68191                     24.6
                   lcsh                144791                   394186            1025                      2.7
             web-Google                855802                   8582704           6332                      10.0
             as-skitter                1694616                  22188418          35455                     13.1
            cit-Patents                3764117                  33023481          793                       8.8

                   1                                       1                                       1

                  0.8                                     0.8                                     0.8
    Conductance




                                                                                    Conductance
-
                                            Conductance




                  0.6                                     0.6                                     0.6

                  0.4                                     0.4                                     0.4




                                                                                                                           25
                  0.2                                     0.2                                     0.2

                   0
                                                                         David Gleich · Purdue
                                                                                        0
                                                                                                             WSDM2012
                        0               5                  0                                            0              5
he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
 at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
  Graph            Comm. of         Comm. of        Perf. Ratio      Vol. Ratio
                     Partition       Overlap
  onera                18654               48            0.003                2.82
  usroads               3256                0            0.000                1.49
  annulus              12074                2            0.000                0.01
  email-Enron       194536*           235316             1.210                 1.7
  soc-Slashdot      875435*         1.3 ⇥ 106            1.480                1.78
  dico            1.5 ⇥ 106 *       2.0 ⇥ 106            1.320                1.53
  lcsh                73000*           48777             0.668                2.17
  web-Google        201159*           167609             0.833                1.57
  as-skitter       2.4 ⇥ 106        3.9 ⇥ 106            1.645                1.93
  cit-Patents      8.7 ⇥ 106        7.3 ⇥ 106            0.845                1.34

             * means Graculus
nally, we evaluate our heuristic.
                         gave a better
                 partition than Metis
       At left, the cluster combine procedure reduces 106 clusters to




                                                                                26
       around 102 . Middle, combining clusters can decrease the volume
                                           David Gleich · Purdue
 WSDM2012
Summary
                         Future work
!                                
Overlap helps reduce             Truly distributed implementation and
communication in a distributed   evaluation
process!                         
!                                Can we exploit data redundancy to
Proof of concept procedure to    solve problems on large graphs faster?
find overlapping partitions to    
reduce communication 
                     Copy 1
           Copy 2
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping




                                                                           27

                                    David Gleich · Purdue
   WSDM2012

More Related Content

What's hot

GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imaging
Cheng-Bin Jin
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
Dr Nisha Arora
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
David Khosid
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief Overview
Satyaki Sikdar
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Linear regression
Linear regression Linear regression
Linear regression
mohamed Naas
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
Xavier Amatriain
 
Lec15 sfm
Lec15 sfmLec15 sfm
Lec15 sfm
BaliThorat1
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
Pier Luca Lanzi
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in ML
Kumud Arora
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 
[PR12] categorical reparameterization with gumbel softmax
[PR12] categorical reparameterization with gumbel softmax[PR12] categorical reparameterization with gumbel softmax
[PR12] categorical reparameterization with gumbel softmax
JaeJun Yoo
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
YashwantGahlot1
 
Neural ODE
Neural ODENeural ODE
Neural ODE
Natan Katz
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random Forest
FAO
 
1 Supervised learning
1 Supervised learning1 Supervised learning
1 Supervised learning
Dmytro Fishman
 
Spatial filtering using image processing
Spatial filtering using image processingSpatial filtering using image processing
Spatial filtering using image processing
Anuj Arora
 

What's hot (20)

GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imaging
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief Overview
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Linear regression
Linear regression Linear regression
Linear regression
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
Lec15 sfm
Lec15 sfmLec15 sfm
Lec15 sfm
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in ML
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
[PR12] categorical reparameterization with gumbel softmax
[PR12] categorical reparameterization with gumbel softmax[PR12] categorical reparameterization with gumbel softmax
[PR12] categorical reparameterization with gumbel softmax
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random Forest
 
1 Supervised learning
1 Supervised learning1 Supervised learning
1 Supervised learning
 
Spatial filtering using image processing
Spatial filtering using image processingSpatial filtering using image processing
Spatial filtering using image processing
 

Viewers also liked

Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimc
David Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
David Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
David Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
David Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
David Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
David Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
David Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
David Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
David Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
David Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
David Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
David Gleich
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
David Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
David Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
David Gleich
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
David Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
David Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
David Gleich
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
David Gleich
 

Viewers also liked (20)

Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimc
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 

Similar to Overlapping clusters for distributed computation

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive Graphs
Jason Riedy
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimization
David Gleich
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
David Gleich
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrum
David Gleich
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
David Gleich
 

Similar to Overlapping clusters for distributed computation (6)

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive Graphs
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimization
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrum
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 

More from David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
David Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
David Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
David Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
David Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
David Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
David Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
David Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
David Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
David Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
David Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
David Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
David Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
David Gleich
 

More from David Gleich (15)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 

Overlapping clusters for distributed computation

  • 1. Overlapping Clusters for Distributed Computation DAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP. COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walks Contributions Theoretical demonstration that overlap helps Proof of concept procedure to find overlapping partitions to reduce communication (~20%) All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • 5. “Good” data distributions are a fundamental problem in distributed computation. ! How to divide the communication graph! Balance work Balance communication Balance data Balance programming complexity too 5 David Gleich · Purdue WSDM2012
  • 6. Current solutions Work Comm. Data Programming Disjoint vertex Okay to “Think like a Excellent Excellent partitions Good vertex” 2d or Edge Excellent Excellent Good “Impossible” Partitions Where we fit! Overlapping Good to “Think like a Okay “Let’s see” partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • 7. Goals Find a set of " overlapping clusters " where random walks stay in a cluster for a long time solving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • 8. Related work Domain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al. Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld) Overlapping communities and link partitioning algorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and Satu P2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster " keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster " keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • 13. Swapping probability & partitioning No overlap in this figure ! P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • 14. Overlapping clusters vs. Partitioning in theory Take a cycle graph M groups of ℓ������ vertices MaxVol = 2ℓ������ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • 15. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • 16. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • 17. Demo! 17 David Gleich · Purdue WSDM2012
  • 18. Solving " linear " systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • 19. All nodes solve locally using " the coordinate descent method. 19 David Gleich · Purdue WSDM2012
  • 20. All nodes solve locally using " the coordinate descent method. A core vertex for the gray cluster. 20 David Gleich · Purdue WSDM2012
  • 21. All nodes solve locally using " the coordinate descent method. Red sends residuals to white. White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • 22. White then uses the coordinate descent method to adjust its solution. Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • 23. That algorithm is called " restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google) Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • 25. Edges are counted twice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance - Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • 26. he communication ratio of our best result for the PageRan ommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). The ommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculus nally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • 27. Summary Future work ! Overlap helps reduce Truly distributed implementation and communication in a distributed evaluation process! ! Can we exploit data redundancy to Proof of concept procedure to solve problems on large graphs faster? find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dst All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012