• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Personalized PageRank based community detection
 

Personalized PageRank based community detection

on

  • 1,628 views

My talk at MLG 2013 on using Personalized PageRank to find communities.

My talk at MLG 2013 on using Personalized PageRank to find communities.

Statistics

Views

Total Views
1,628
Views on SlideShare
1,596
Embed Views
32

Actions

Likes
2
Downloads
37
Comments
0

1 Embed 32

https://twitter.com 32

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Personalized PageRank based community detection Personalized PageRank based community detection Presentation Transcript

    • Personalized ! PageRank based Community Detection David F. Gleich! Purdue University! Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF Code bit.ly/dgleich-codes!
    • Today’s talk 1.  Personalized PageRank" based community detection 2.  Conductance, Egonets, and " Network Community Profiles 3.  Egonet seeding 4.  Improved seeding David Gleich · Purdue 2 MLG2013
    • A community is a set of vertices that is denser inside than out. David Gleich · Purdue 3 MLG2013
    • 250 node GEOP network in 2 dimensions 4
    • 250 node GEOP network in 2 dimensions 5
    • We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", " follow a random edge 2.  with probability 1-𝛼, ", " restart at a seed aka random surfer aka random walk with restart unique stationary distribution David Gleich · Purdue 6 MLG2013
    • Personalized PageRank community detection 1.  Given a seed, approximate the stationary distribution. 2.  Extract the community. Both are local operations. David Gleich · Purdue 7 MLG2013
    • Demo! David Gleich · Purdue 8 MLG2013
    • Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community (S) = cut(S) min vol(S), vol( ¯S) (edges leaving the set) (total edges in the set) David Gleich · Purdue cut(S) = 7 vol(S) = 33 vol( ¯S) = 11 (S) = 7/11 9 MLG2013
    • Andersen- Chung-Lang! personalized PageRank community theorem! [Andersen et al. 2006]" Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast. David Gleich · Purdue 10 MLG2013
    • # G is graph as dictionary-of-sets! alpha=0.99! tol=1e-4! ! x = {} # Store x, r as dictionaries! r = {} # initialize residual! Q = collections.deque() # initialize queue! for s in seed: ! r(s) = 1/len(seed)! Q.append(s)! while len(Q) > 0:! v = Q.popleft() # v has r[v] > tol*deg(v)! if v not in x: x[v] = 0.! x[v] += (1-alpha)*r[v]! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u! if u not in r: r[u] = 0.! if r[u] < len(G[u])*tol and ! r[u] + mass >= len(G[u])*tol:! Q.append(u) # add u to queue if large! r[u] = r[u] + mass! r[v] = mass*len(G[v]) ! David Gleich · Purdue 11 MLG2013
    • Demo 2! David Gleich · Purdue 12 MLG2013
    • Problem 1, which seeds? David Gleich · Purdue 13 MLG2013
    • Problem 2, not fast enough. David Gleich · Purdue 14 MLG2013
    • Gleich-Seshadhri, KDD 2012! David Gleich · Purdue Neighborhoods are good communities 15 MLG2013
    • Gleich-Seshadhri, KDD 2012! Egonets and Conductance David Gleich · Purdue Neighborhoods are good communities ^ conductance ^ Vertex Egonets? … in graphs that look like social and information networks 16 MLG2013
    • Vertex neighborhoods or! Egonets The induced subgraph of" set a vertex its neighbors Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], " community seeds [Huang11,Schaeffer11], " overlapping communities [Schaeffer07,Rees10]. David Gleich · Purdue 17 MLG2013
    • Simple version of theorem If global clustering coefficient = 1, then " the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities! David Gleich · Purdue 18 MLG2013
    • Theorem Condition Let graph G have clustering coefficient 𝜅 and " have vertex degrees bounded " by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance log degree logprobability ↵1n/d ↵2n/d  4(1 )/(3 2) David Gleich · Purdue 19 MLG2013
    • Confession! The theory is weak (S)  4(1 )/(3 2) Collaboration networks " 𝜅 ~ [0.1 – 0.5] Social networks " 𝜅 ~ [0.05 – 0.1] Graph Verts Edges Avg. Deg. Max Deg.  ¯C ca-AstroPh 17903 196972 22.0 504 0.318 0.633 email-Enron 33696 180811 10.7 1383 0.085 0.509 cond-mat-2005 36458 171735 9.4 278 0.243 0.657 arxiv 86376 517563 12.0 1253 0.560 0.678 dblp 226413 716460 6.3 238 0.383 0.635 hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766 fb-Penn94 41536 1362220 65.6 4410 0.098 0.212 fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060 fb-A 3097165 23667394 15.3 4915 0.048 0.097 soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274 oregon2-010526 11461 32730 5.7 2432 0.037 0.352 p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005 as-22july06 22963 48436 4.2 2390 0.011 0.230 itdk0304 190914 607610 6.4 1071 0.061 0.158 Verts Edges Avg. Deg. Max Deg.  ¯C 17903 196972 22.0 504 0.318 0.633 n 33696 180811 10.7 1383 0.085 0.509 2005 36458 171735 9.4 278 0.243 0.657 86376 517563 12.0 1253 0.560 0.678 226413 716460 6.3 238 0.383 0.635 2009 1069126 56306653 105.3 11467 0.310 0.766 41536 1362220 65.6 4410 0.098 0.212 ar 1138557 4404989 7.7 695 0.038 0.060 3097165 23667394 15.3 4915 0.048 0.097 urnal1 4843953 42845684 17.7 20333 0.118 0.274 10526 11461 32730 5.7 2432 0.037 0.352 la25 22663 54693 4.8 66 0.005 0.005 6 22963 48436 4.2 2390 0.011 0.230 190914 607610 6.4 1071 0.061 0.158 Tech. networks " 𝜅 ~ [0.005 – 0.05] This bound is useless unless 𝜅 ≥ 1/2 David Gleich · Purdue 20 MLG2013
    • We view this theory as " “intuition for the truth” David Gleich · Purdue 21 MLG2013
    • Empirical Evaluation using Network Community Profiles la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Community Size Minimum conductance for any community of the given size Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance. David Gleich · Purdue 22 MLG2013
    • Empirical Evaluation using Network Community Profiles la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Community Size" (Degree + 1) Minimum conductance for any community neighborhood of the given size “Egonet community profile” shows the same shape, 3 secs to compute. 1.1M verts, 4M edges The Fiedler community computed from the normalized Laplacian is a neighborhood! David Gleich · Purdue Facebook data from Wilson et al. 2009 23 MLG2013
    • Not just one graph 10 5 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg ver t s 2 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg ver t s 2 arxiv 10 5 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg ver t s 2 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg ver t s 2 ca-AstroPh 10 0 10 0 t any procedure to compute o compute the conductance he graph. Most of the work e cient is performed when the vertex. We can express s: {v})/2 ighbors produces a triangle double-counts). Note also dges(N1(v))/2 dv. Then ges(N1(v)). And so, given compute the cut given the well. This is easy to do with tly stores the degrees. hood communities network community plot to 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 max deg 1 10 −4 10 −3 1 10 −4 10 −3 soc-LiveJournal1 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 1 10 −2 10 −1 10 0 1 10 −2 10 −1 10 0 Number of vertices in cluster N Figure 2: The best neighbo ductance at each size (black arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges 15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods David Gleich · Purdue 24 MLG2013
    • Filling in the ! Network Community Profile la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Minimum conductance for any community neighborhood of the given size We are missing a region of the NCP when we just look at neighborhoods David Gleich · Purdue Community Size" (Degree + 1) 25 MLG2013 Facebook Sample - 1.1M verts, 4M edges
    • 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Filling in the ! Network Community Profile la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Minimum conductance for any community of the given size 7807 seconds This region fills when using the PPR method (like now!) David Gleich · Purdue Community Size" 26 MLG2013 Facebook Sample - 1.1M verts, 4M edges
    • Am I a good seed?! Locally Minimal Communities “My conductance is the best locally.” (N(v))  (N(w)) for all w adjacent to v In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes. David Gleich · Purdue 27 MLG2013
    • 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Locally minimal communities capture extremal neighborhoods la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 Red dots are conductance " and size of a " locally minimal community Usually about 1% of # of vertices. The red circles – the best local mins – find the extremes in the egonet profile. David Gleich · Purdue Community Size" 28 MLG2013 Facebook Sample - 1.1M verts, 4M edges
    • 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg Filling in the NCP! Growing locally minimal comm. la25 [27]) or as 304 [10]). The the nodes. and the edges ]. OD ure to compute he conductance ost of the work erformed when We can express 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 max deg 10 0 10 1 1 10 −4 10 −3 10 0 10 1 1 10 −4 10 −3 fb-A-oneyear 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 10 0 max deg 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 10 0 10 1 10 −4 10 −3 10 −2 10 −1 10 0 soc-LiveJournal1 ca 10 0 10 0 10 0 10 0 PPR growing only locally min communities, seeded from entire egonet 3 seconds 283 seconds 7807 seconds Full NCP Locally min NCP Original Egonet David Gleich · Purdue Community Size" 29 MLG2013
    • But there’s a small problem. Most people want to cover a network with communities! We just looked at the best. David Gleich · Purdue 30 MLG2013
    • The coverage of egonet-grown communities is really bad. David Gleich · Purdue 10 −3 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0.7% =0.10 2.2% =0.25 29.8% =0.88 10 −3 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0.7% =0.10 2.2% =0.25 29.8% =0.88 Coverage MaxConductance Facebook network With a conductance of 0.1 (not so good) we only cover 1% of the vertices in the network. Log-Log scale! 31 MLG2013
    • Whang-Gleich-Dhillon, CIKM2013 [upcoming…] 1.  Extract part of the graph that might have overlapping communities. 2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus. 3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers. David Gleich · Purdue 32 MLG2013
    • Student Version of MATLAB (a) AstroPh 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (percentage) MaximumConductance egonet graclus centers spread hubs random bigclam (d) Flickr Flickr social network 2M vertices" 22M edges We can cover 95% of network with communities of cond. ~0.15. David Gleich · Purdue A good partitioning helps! 33 MLG2013 flickr sample - 2M verts, 22M edges
    • F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 DBLP demon bigclam graclus centers spread hubs random egonet 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Figure 3: F1 and F2 measures comparing our algorithmic co indicates better communities. Run time Our seed Using datasets from " Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art overlapping community detection methods. " Even randomly seeded! David Gleich · Purdue And helps to find real-world overlapping communities too. 34 MLG2013
    • Conclusion & Discussion & PPR community detection is fast " [Andersen et al. FOCS06] PPR communities look real " [Abrahao et al. KDD2012; Zhu et al. ICML2013] Egonet analysis reveals basis of NCP Partitioning for seeding yields " high coverage & real communities. “Caveman” communities?! ! ! ! MLG2013 David Gleich · Purdue 35 Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample ! bit.ly/18khzO5! ! Egonet seeding bit.ly/dgleich-code! References Best conductance cut at intersection of communities?
    • Proof Sketch 1) Large clustering coefficient " ⇒ many wedges are closed 2) Heavy tailed degree dist " ⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize 10 0 10 1 10 2 10 3 10 4 0 0.2 0.4 0.6 0.8 1 CDFofNumberofWedges Degree David Gleich · Purdue 36 MLG2013