Upcoming SlideShare
×

# Personalized PageRank based community detection

3,518 views
3,379 views

Published on

My talk at MLG 2013 on using Personalized PageRank to find communities.

Published in: Education, Technology
10 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
3,518
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
201
0
Likes
10
Embeds 0
No embeds

No notes for slide

### Personalized PageRank based community detection

1. 1. Personalized ! PageRank based Community Detection David F. Gleich! Purdue University! Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF Code bit.ly/dgleich-codes!
2. 2. Today’s talk 1.  Personalized PageRank" based community detection 2.  Conductance, Egonets, and " Network Community Proﬁles 3.  Egonet seeding 4.  Improved seeding David Gleich · Purdue 2 MLG2013
3. 3. A community is a set of vertices that is denser inside than out. David Gleich · Purdue 3 MLG2013
4. 4. 250 node GEOP network in 2 dimensions 4
5. 5. 250 node GEOP network in 2 dimensions 5
6. 6. We can ﬁnd communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", " follow a random edge 2.  with probability 1-𝛼, ", " restart at a seed aka random surfer aka random walk with restart unique stationary distribution David Gleich · Purdue 6 MLG2013
7. 7. Personalized PageRank community detection 1.  Given a seed, approximate the stationary distribution. 2.  Extract the community. Both are local operations. David Gleich · Purdue 7 MLG2013
8. 8. Demo! David Gleich · Purdue 8 MLG2013
9. 9. Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community (S) = cut(S) min vol(S), vol( ¯S) (edges leaving the set) (total edges in the set) David Gleich · Purdue cut(S) = 7 vol(S) = 33 vol( ¯S) = 11 (S) = 7/11 9 MLG2013
10. 10. Andersen- Chung-Lang! personalized PageRank community theorem! [Andersen et al. 2006]" Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will ﬁnd a set with conductance that’s nearly as good. … also, it’s really fast. David Gleich · Purdue 10 MLG2013
11. 11. # G is graph as dictionary-of-sets! alpha=0.99! tol=1e-4! ! x = {} # Store x, r as dictionaries! r = {} # initialize residual! Q = collections.deque() # initialize queue! for s in seed: ! r(s) = 1/len(seed)! Q.append(s)! while len(Q) > 0:! v = Q.popleft() # v has r[v] > tol*deg(v)! if v not in x: x[v] = 0.! x[v] += (1-alpha)*r[v]! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u! if u not in r: r[u] = 0.! if r[u] < len(G[u])*tol and ! r[u] + mass >= len(G[u])*tol:! Q.append(u) # add u to queue if large! r[u] = r[u] + mass! r[v] = mass*len(G[v]) ! David Gleich · Purdue 11 MLG2013
12. 12. Demo 2! David Gleich · Purdue 12 MLG2013
13. 13. Problem 1, which seeds? David Gleich · Purdue 13 MLG2013
14. 14. Problem 2, not fast enough. David Gleich · Purdue 14 MLG2013
15. 15. Gleich-Seshadhri, KDD 2012! David Gleich · Purdue Neighborhoods are good communities 15 MLG2013
16. 16. Gleich-Seshadhri, KDD 2012! Egonets and Conductance David Gleich · Purdue Neighborhoods are good communities ^ conductance ^ Vertex Egonets? … in graphs that look like social and information networks 16 MLG2013
17. 17. Vertex neighborhoods or! Egonets The induced subgraph of" set a vertex its neighbors Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], " community seeds [Huang11,Schaeffer11], " overlapping communities [Schaeffer07,Rees10]. David Gleich · Purdue 17 MLG2013
18. 18. Simple version of theorem If global clustering coefﬁcient = 1, then " the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities! David Gleich · Purdue 18 MLG2013
19. 19. Theorem Condition Let graph G have clustering coefﬁcient 𝜅 and " have vertex degrees bounded " by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance log degree logprobability ↵1n/d ↵2n/d  4(1 )/(3 2) David Gleich · Purdue 19 MLG2013
20. 20. Confession! The theory is weak (S)  4(1 )/(3 2) Collaboration networks " 𝜅 ~ [0.1 – 0.5] Social networks " 𝜅 ~ [0.05 – 0.1] Graph Verts Edges Avg. Deg. Max Deg.  ¯C ca-AstroPh 17903 196972 22.0 504 0.318 0.633 email-Enron 33696 180811 10.7 1383 0.085 0.509 cond-mat-2005 36458 171735 9.4 278 0.243 0.657 arxiv 86376 517563 12.0 1253 0.560 0.678 dblp 226413 716460 6.3 238 0.383 0.635 hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766 fb-Penn94 41536 1362220 65.6 4410 0.098 0.212 fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060 fb-A 3097165 23667394 15.3 4915 0.048 0.097 soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274 oregon2-010526 11461 32730 5.7 2432 0.037 0.352 p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005 as-22july06 22963 48436 4.2 2390 0.011 0.230 itdk0304 190914 607610 6.4 1071 0.061 0.158 Verts Edges Avg. Deg. Max Deg.  ¯C 17903 196972 22.0 504 0.318 0.633 n 33696 180811 10.7 1383 0.085 0.509 2005 36458 171735 9.4 278 0.243 0.657 86376 517563 12.0 1253 0.560 0.678 226413 716460 6.3 238 0.383 0.635 2009 1069126 56306653 105.3 11467 0.310 0.766 41536 1362220 65.6 4410 0.098 0.212 ar 1138557 4404989 7.7 695 0.038 0.060 3097165 23667394 15.3 4915 0.048 0.097 urnal1 4843953 42845684 17.7 20333 0.118 0.274 10526 11461 32730 5.7 2432 0.037 0.352 la25 22663 54693 4.8 66 0.005 0.005 6 22963 48436 4.2 2390 0.011 0.230 190914 607610 6.4 1071 0.061 0.158 Tech. networks " 𝜅 ~ [0.005 – 0.05] This bound is useless unless 𝜅 ≥ 1/2 David Gleich · Purdue 20 MLG2013
21. 21. We view this theory as " “intuition for the truth” David Gleich · Purdue 21 MLG2013
22. 22. Empirical Evaluation using Network Community Proﬁles la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Community Size Minimum conductance for any community of the given size Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance. David Gleich · Purdue 22 MLG2013
23. 23. Empirical Evaluation using Network Community Proﬁles la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Community Size" (Degree + 1) Minimum conductance for any community neighborhood of the given size “Egonet community proﬁle” shows the same shape, 3 secs to compute. 1.1M verts, 4M edges The Fiedler community computed from the normalized Laplacian is a neighborhood! David Gleich · Purdue Facebook data from Wilson et al. 2009 23 MLG2013
24. 24. Not just one graph 10 5 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg ver t s 2 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg ver t s 2 arxiv 10 5 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg ver t s 2 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg ver t s 2 ca-AstroPh 10 0 10 0 t any procedure to compute o compute the conductance he graph. Most of the work e cient is performed when the vertex. We can express s: {v})/2 ighbors produces a triangle double-counts). Note also dges(N1(v))/2 dv. Then ges(N1(v)). And so, given compute the cut given the well. This is easy to do with tly stores the degrees. hood communities network community plot to 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg 1 10 −4 10 −3 1 10 −4 10 −3 soc-LiveJournal1 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 1 10 −2 10 −1 10 0 1 10 −2 10 −1 10 0 Number of vertices in cluster N Figure 2: The best neighbo ductance at each size (black arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges 15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods David Gleich · Purdue 24 MLG2013
25. 25. Filling in the ! Network Community Proﬁle la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Minimum conductance for any community neighborhood of the given size We are missing a region of the NCP when we just look at neighborhoods David Gleich · Purdue Community Size" (Degree + 1) 25 MLG2013 Facebook Sample - 1.1M verts, 4M edges
26. 26. 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Filling in the ! Network Community Proﬁle la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Minimum conductance for any community of the given size 7807 seconds This region ﬁlls when using the PPR method (like now!) David Gleich · Purdue Community Size" 26 MLG2013 Facebook Sample - 1.1M verts, 4M edges
27. 27. Am I a good seed?! Locally Minimal Communities “My conductance is the best locally.” (N(v))  (N(w)) for all w adjacent to v In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes. David Gleich · Purdue 27 MLG2013
28. 28. 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Locally minimal communities capture extremal neighborhoods la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Red dots are conductance " and size of a " locally minimal community Usually about 1% of # of vertices. The red circles – the best local mins – ﬁnd the extremes in the egonet proﬁle. David Gleich · Purdue Community Size" 28 MLG2013 Facebook Sample - 1.1M verts, 4M edges
29. 29. 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Filling in the NCP! Growing locally minimal comm. la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 PPR growing only locally min communities, seeded from entire egonet 3 seconds 283 seconds 7807 seconds Full NCP Locally min NCP Original Egonet David Gleich · Purdue Community Size" 29 MLG2013
30. 30. But there’s a small problem. Most people want to cover a network with communities! We just looked at the best. David Gleich · Purdue 30 MLG2013
31. 31. The coverage of egonet-grown communities is really bad. David Gleich · Purdue 10 −3 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0.7% =0.10 2.2% =0.25 29.8% =0.88 10 −3 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0.7% =0.10 2.2% =0.25 29.8% =0.88 Coverage MaxConductance Facebook network With a conductance of 0.1 (not so good) we only cover 1% of the vertices in the network. Log-Log scale! 31 MLG2013
32. 32. Whang-Gleich-Dhillon, CIKM2013 [upcoming…] 1.  Extract part of the graph that might have overlapping communities. 2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus. 3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers. David Gleich · Purdue 32 MLG2013
33. 33. Student Version of MATLAB (a) AstroPh 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (percentage) MaximumConductance egonet graclus centers spread hubs random bigclam (d) Flickr Flickr social network 2M vertices" 22M edges We can cover 95% of network with communities of cond. ~0.15. David Gleich · Purdue A good partitioning helps! 33 MLG2013 ﬂickr sample - 2M verts, 22M edges
34. 34. F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 DBLP demon bigclam graclus centers spread hubs random egonet 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Figure 3: F1 and F2 measures comparing our algorithmic co indicates better communities. Run time Our seed Using datasets from " Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art overlapping community detection methods. " Even randomly seeded! David Gleich · Purdue And helps to ﬁnd real-world overlapping communities too. 34 MLG2013
35. 35. Conclusion & Discussion & PPR community detection is fast " [Andersen et al. FOCS06] PPR communities look real " [Abrahao et al. KDD2012; Zhu et al. ICML2013] Egonet analysis reveals basis of NCP Partitioning for seeding yields " high coverage & real communities. “Caveman” communities?! ! ! ! MLG2013 David Gleich · Purdue 35 Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample ! bit.ly/18khzO5! ! Egonet seeding bit.ly/dgleich-code! References Best conductance cut at intersection of communities?
36. 36. Proof Sketch 1) Large clustering coefﬁcient " ⇒ many wedges are closed 2) Heavy tailed degree dist " ⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize 10 0 10 1 10 2 10 3 10 4 0 0.2 0.4 0.6 0.8 1 CDFofNumberofWedges Degree David Gleich · Purdue 36 MLG2013