Overlapping clusters for distributed computation
Upcoming SlideShare
Loading in...5
×
 

Overlapping clusters for distributed computation

on

  • 1,437 views

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf...

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf

And the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/

Statistics

Views

Total Views
1,437
Views on SlideShare
1,437
Embed Views
0

Actions

Likes
2
Downloads
22
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Overlapping clusters for distributed computation Overlapping clusters for distributed computation Presentation Transcript

  • OverlappingClusters forDistributedComputationDAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP.COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walksContributionsTheoretical demonstration that overlap helpsProof of concept procedure to find overlappingpartitions to reduce communication (~20%)All code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • “Good” data distributions area fundamental problem indistributed computation.!How to divide thecommunication graph!Balance workBalance communicationBalance dataBalance programming complexity too 5 David Gleich · Purdue WSDM2012
  • Current solutions Work Comm. Data ProgrammingDisjoint vertex Okay to “Think like a Excellent Excellentpartitions Good vertex”2d or Edge Excellent Excellent Good “Impossible”PartitionsWhere we fit!Overlapping Good to “Think like a Okay “Let’s see”partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • GoalsFind a set of "overlapping clusters "where random walks stay in a cluster for a long timesolving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • Related workDomain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al.Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld)Overlapping communities and link partitioningalgorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and SatuP2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster "keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster "keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • Swapping probability &partitioning No overlap in this figure !P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • Overlapping clusters vs.Partitioning in theory Take a cycle graph M groups of ℓ vertices MaxVol = 2ℓ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • Heuristics for finding good " N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • Heuristics for finding good " N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • Demo! 17 David Gleich · Purdue WSDM2012
  • Solving "linear "systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • All nodes solve locally using "the coordinate descent method. 19David Gleich · Purdue WSDM2012
  • All nodes solve locally using "the coordinate descent method.A core vertex for thegray cluster. 20 David Gleich · Purdue WSDM2012
  • All nodes solve locally using " the coordinate descent method. Red sends residuals to white.White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • White then uses the coordinatedescent method to adjust its solution.Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • That algorithm is called "restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google)Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • Edges are counted twice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance- Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • he communication ratio of our best result for the PageRanommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). Theommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculusnally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • Summary Future work! Overlap helps reduce Truly distributed implementation andcommunication in a distributed evaluationprocess! ! Can we exploit data redundancy toProof of concept procedure to solve problems on large graphs faster?find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dstAll code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012