Overlapping clusters for distributed computation

  • 1,141 views
Uploaded on

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf …

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf

And the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,141
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. OverlappingClusters forDistributedComputationDAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP.COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walksContributionsTheoretical demonstration that overlap helpsProof of concept procedure to find overlappingpartitions to reduce communication (~20%)All code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • 5. “Good” data distributions area fundamental problem indistributed computation.!How to divide thecommunication graph!Balance workBalance communicationBalance dataBalance programming complexity too 5 David Gleich · Purdue WSDM2012
  • 6. Current solutions Work Comm. Data ProgrammingDisjoint vertex Okay to “Think like a Excellent Excellentpartitions Good vertex”2d or Edge Excellent Excellent Good “Impossible”PartitionsWhere we fit!Overlapping Good to “Think like a Okay “Let’s see”partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • 7. GoalsFind a set of "overlapping clusters "where random walks stay in a cluster for a long timesolving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • 8. Related workDomain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al.Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld)Overlapping communities and link partitioningalgorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and SatuP2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster "keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster "keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • 13. Swapping probability &partitioning No overlap in this figure !P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • 14. Overlapping clusters vs.Partitioning in theory Take a cycle graph M groups of ℓ vertices MaxVol = 2ℓ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • 15. Heuristics for finding good " N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • 16. Heuristics for finding good " N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • 17. Demo! 17 David Gleich · Purdue WSDM2012
  • 18. Solving "linear "systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • 19. All nodes solve locally using "the coordinate descent method. 19David Gleich · Purdue WSDM2012
  • 20. All nodes solve locally using "the coordinate descent method.A core vertex for thegray cluster. 20 David Gleich · Purdue WSDM2012
  • 21. All nodes solve locally using " the coordinate descent method. Red sends residuals to white.White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • 22. White then uses the coordinatedescent method to adjust its solution.Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • 23. That algorithm is called "restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google)Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • 25. Edges are counted twice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance- Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • 26. he communication ratio of our best result for the PageRanommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). Theommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculusnally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • 27. Summary Future work! Overlap helps reduce Truly distributed implementation andcommunication in a distributed evaluationprocess! ! Can we exploit data redundancy toProof of concept procedure to solve problems on large graphs faster?find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dstAll code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012