Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

1,304

Published on

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf …

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf

And the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/

No Downloads

Total Views

1,304

On Slideshare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

36

Comments

0

Likes

3

No embeds

No notes for slide

- 1. OverlappingClusters forDistributedComputationDAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP.COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
- 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walksContributionsTheoretical demonstration that overlap helpsProof of concept procedure to ﬁnd overlappingpartitions to reduce communication (~20%)All code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
- 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
- 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
- 5. “Good” data distributions area fundamental problem indistributed computation.!How to divide thecommunication graph!Balance workBalance communicationBalance dataBalance programming complexity too 5 David Gleich · Purdue WSDM2012
- 6. Current solutions Work Comm. Data ProgrammingDisjoint vertex Okay to “Think like a Excellent Excellentpartitions Good vertex”2d or Edge Excellent Excellent Good “Impossible”PartitionsWhere we ﬁt!Overlapping Good to “Think like a Okay “Let’s see”partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
- 7. GoalsFind a set of "overlapping clusters "where random walks stay in a cluster for a long timesolving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
- 8. Related workDomain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al.Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld)Overlapping communities and link partitioningalgorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and SatuP2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
- 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
- 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster "keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
- 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster "keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
- 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
- 13. Swapping probability &partitioning No overlap in this figure !P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
- 14. Overlapping clusters vs.Partitioning in theory Take a cycle graph M groups of ℓ vertices MaxVol = 2ℓ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
- 15. Heuristics for ﬁnding good N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1. Find a large set of good clusters Use personalized PageRank clusters 2. Find “well contained” nodes (cores) Compute expected “leavetime” 3. Cover the graph with core vertices Approximately solve a min set-cover problem 4. Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
- 16. Heuristics for ﬁnding good N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1. Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “ MaxVol” work 2. Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3. Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4. Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
- 17. Demo! 17 David Gleich · Purdue WSDM2012
- 18. Solving linear systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
- 19. All nodes solve locally using the coordinate descent method. 19David Gleich · Purdue WSDM2012
- 20. All nodes solve locally using the coordinate descent method.A core vertex for thegray cluster. 20 David Gleich · Purdue WSDM2012
- 21. All nodes solve locally using the coordinate descent method. Red sends residuals to white.White send residuals to red. 21 David Gleich · Purdue WSDM2012
- 22. White then uses the coordinatedescent method to adjust its solution.Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
- 23. That algorithm is called restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix linear system 23 David Gleich · Purdue WSDM2012
- 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google)Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
- 25. Edges are counted twice and some graphs have self- loops. The ﬁrst group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance- Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
- 26. he communication ratio of our best result for the PageRanommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio 1). Theommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculusnally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
- 27. Summary Future work! Overlap helps reduce Truly distributed implementation andcommunication in a distributed evaluationprocess! ! Can we exploit data redundancy toProof of concept procedure to solve problems on large graphs faster?ﬁnd overlapping partitions to reduce communication Copy 1 Copy 2 src - dst src - dst src - dst src - dst src - dst src - dstAll code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012

Be the first to comment