Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- How does Google Google: A journey i... by David Gleich 563 views
- HPLC course 1 by AVINASH KUSHWAHA 965 views
- Data Mining: Concepts and technique... by Salah Amean 537 views
- Data Mining Concepts and Techniques... by Salah Amean 807 views
- Skew-symmetric matrix completion fo... by David Gleich 1520 views
- Computing Local and Global Centrality by David Gleich 866 views
- Anti-differentiating approximation ... by David Gleich 200 views
- Localized methods for diffusions in... by David Gleich 218 views
- Personalized PageRank based communi... by David Gleich 2347 views
- Fast relaxation methods for the mat... by David Gleich 405 views
- Fast matrix primitives for ranking,... by David Gleich 623 views
- Dynamic PageRank using Evolving Tel... by Ryan Rossi 839 views

1,269

Published on

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf …

My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf

And the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/

No Downloads

Total Views

1,269

On Slideshare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

30

Comments

0

Likes

3

No embeds

No notes for slide

- 1. OverlappingClusters forDistributedComputationDAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP.COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
- 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walksContributionsTheoretical demonstration that overlap helpsProof of concept procedure to ﬁnd overlappingpartitions to reduce communication (~20%)All code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
- 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
- 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
- 5. “Good” data distributions area fundamental problem indistributed computation.!How to divide thecommunication graph!Balance workBalance communicationBalance dataBalance programming complexity too 5 David Gleich · Purdue WSDM2012
- 6. Current solutions Work Comm. Data ProgrammingDisjoint vertex Okay to “Think like a Excellent Excellentpartitions Good vertex”2d or Edge Excellent Excellent Good “Impossible”PartitionsWhere we ﬁt!Overlapping Good to “Think like a Okay “Let’s see”partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
- 7. GoalsFind a set of "overlapping clusters "where random walks stay in a cluster for a long timesolving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
- 8. Related workDomain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al.Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld)Overlapping communities and link partitioningalgorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and SatuP2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
- 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
- 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster "keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
- 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster "keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
- 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
- 13. Swapping probability &partitioning No overlap in this figure !P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
- 14. Overlapping clusters vs.Partitioning in theory Take a cycle graph M groups of ℓ vertices MaxVol = 2ℓ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
- 15. Heuristics for ﬁnding good N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1. Find a large set of good clusters Use personalized PageRank clusters 2. Find “well contained” nodes (cores) Compute expected “leavetime” 3. Cover the graph with core vertices Approximately solve a min set-cover problem 4. Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
- 16. Heuristics for ﬁnding good N P-hard for optimaloverlapping clusters solution L Our multi-stage heuristic! 1. Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “ MaxVol” work 2. Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3. Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4. Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
- 17. Demo! 17 David Gleich · Purdue WSDM2012
- 18. Solving linear systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
- 19. All nodes solve locally using the coordinate descent method. 19David Gleich · Purdue WSDM2012
- 20. All nodes solve locally using the coordinate descent method.A core vertex for thegray cluster. 20 David Gleich · Purdue WSDM2012
- 21. All nodes solve locally using the coordinate descent method. Red sends residuals to white.White send residuals to red. 21 David Gleich · Purdue WSDM2012
- 22. White then uses the coordinatedescent method to adjust its solution.Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
- 23. That algorithm is called restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix linear system 23 David Gleich · Purdue WSDM2012
- 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google)Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
- 25. Edges are counted twice and some graphs have self- loops. The ﬁrst group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance- Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
- 26. he communication ratio of our best result for the PageRanommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio 1). Theommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculusnally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
- 27. Summary Future work! Overlap helps reduce Truly distributed implementation andcommunication in a distributed evaluationprocess! ! Can we exploit data redundancy toProof of concept procedure to solve problems on large graphs faster?ﬁnd overlapping partitions to reduce communication Copy 1 Copy 2 src - dst src - dst src - dst src - dst src - dst src - dstAll code availablehttp://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012

Be the first to comment