Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Higher-order graph clustering at AMS Spring Western Sectional

161 views

Published on

Slides from my talk in the special session on "Clustering of Graphs: Theory and Practice".

April 22, 2017

Published in: Data & Analytics
  • Be the first to comment

Higher-order graph clustering at AMS Spring Western Sectional

  1. 1. Higher-order graph clustering Austin Benson Stanford University AMS Spring Western Sectional Pullman, WA April 22, 2017 Joint work with David Gleich, Purdue Jure Leskovec, Stanford
  2. 2. A C B A C B 2 Unifying two themes of network analysis Co-author network 1. Real-world networks have modular organization. [Newman04, Newman06] 2. Higher-order connectivity patterns, or network motifs, drive complex systems. [Milo+02, Yaveroğlu+14]. Edge-based clustering and community detection sometimes expose this structure. Signed feed-forward loops in genetic transcription. [Mangan+03, Alon07] Triangles in social networks. [Rapoport53, Granovetter73] Bi-directed length-2 paths in brain networks. [Sporns-Kötter04, Sporns+07] A CA C
  3. 3. 3 Motifs are the fundamental units of complex networks. We should look for structure (clusters) in terms of them.
  4. 4. Network Motif Different motifs give different clusters. 4 Higher-order graph clustering is our technique for finding clusters based on motifs
  5. 5. 5 § We will generalize spectral clustering, a classical technique to find clusters or communities in a graph, to use motifs as the fundamental unit to partition. § Based on a higher-order (motif-based) conductance metric that generalizes the traditional conductance. § Comes with theoretical guarantees. § We’ll first briefly review how spectral clustering works. § Then we’ll see how to adapt it to work with network motifs. § Then we’ll see the impact of this approach on various real-world data. Higher-order graph clustering Main points and overview
  6. 6. Graph Laplacian Eigenvector 6 Background spectral clustering is a classic technique to partition graphs by looking at eigenvectors Cluster [Fiedler73]
  7. 7. Conductance is one of the most important cluster quality scores [Schaeffer07] used in Markov chain theory, spectral clustering, bioinformatics, vision, etc. The conductance of a set of vertices S is the ratio of edges leaving to total edges small conductance ó good cluster (edges leaving S) (edge end points in S) 7 S cut(S) = 7 vol(S) = 85 vol( ¯S) = 151 ffi(S) = 7/85 Background spectral clustering works based on conductance S (S) = cut(S) min(vol(S), vol(S))
  8. 8. Cheeger Inequality Finding the best conductance set is NP-hard. L § Cheeger realized the eigenvalues of the Laplacian provided surface area to volume bounds in manifolds. § Alon and Milman independently realized the same thing for a graph (conductance)! Laplacian 2 ⇤/2  2  2 ⇤ 0 = 1  2  ...  n  2 ⇤ = set of smallest conductance 8 Background spectral clustering has theoretical guarantees [Cheeger70, Alon-Milman85] D = diag(Ae) L = D−1/2 (A − D)D−1/2 Eigenvalues of the Laplacian L
  9. 9. We can find a set S that achieves the Cheeger bound. 1. Compute the eigenvector z associated with λ2 and scale to f = D-1/2z 2. Sort the vertices by their values in f: σ1, σ2, …, σn 3. Let Sr = {σ1, …, σr} and compute the conductance of φ(Sr) of each Sr. 4. Pick the set Sm with minimum conductance. 9 Background the sweep cut algorithm realizes the guarantee [Mihail89, Chung92] 0 20 40 0 0.2 0.4 0.6 0.8 1 S i φi (Sm) 2 Sr
  10. 10. 0 20 40 0 0.2 0.4 0.6 0.8 1 Si φ i 10 Background the sweep cut visualized [Mihail89, Chung92] Sr (S) = cut(S) min(vol(S), vol(S))
  11. 11. We want to cluster with richer data Motifs that may be directed, signed, colored, feature-valued, etc. 11 Spectral clustering is theoretically justified for finding edge-based clusters in undirected, simple graphs. Signed feed-forward loops in genetic transcription [Mangan+03] Gene X activates transcription in gene Y. Gene X suppresses transcription in gene Z. Gene Y suppresses transcription in gene Z.X Y Z
  12. 12. Benson-Gleich-Leskovec, Science, 2016. 12 Our contributions § A generalized conductance metric for motifs. § A new spectral clustering algorithm to minimize the generalized conductance. § AND an associated motif Cheeger inequality guarantee. § Naturally handles directed, signed, colored, weighted, and combinations of motifs. § Scales to networks with billions of edges. § Applications in ecology, biology, and transportation. A B wijk
  13. 13. 13 How do we find clusters based on motifs?
  14. 14. Motif-based conductance Need new notions of cut and volume 14 φM (S) = cutM (S) min(volM (S), volM (S)) (S) = cut(S) min(vol(S), vol(S)) vol(S) = #(edge end points in S) cut(S) = #(edges cut) cutM (S) = #(motifs cut) volM (S) = #(motif end points in S) S S S S S M = triangle motif
  15. 15. Motif-based conductance Motif M M (S) = motifs cut motif volume = 1 8 15 9 10 3 5 6 7 8 1 2 4 9 10 3 5 6 7 8 1 2 4 S S
  16. 16. Problem Given a motif M and a graph G, we want to find a set of nodes S that minimizes motif conductance This is NP-hard. [Wagner-Wagner93] Our solution Generalize spectral clustering for motifs 1. Form new weighted, undirected graph W(M) based on M and G 2. Compute Fiedler vector of Laplacian matrix of W(M) [Fiedler73, Alon-Milman85] 3. Use “sweep cut” procedure to output clusters [Mihail89, Chung92] Theorem (motif Cheeger inequality) resulting clusters will obtain near optimal motif conductance 16 Higher-order clustering
  17. 17. Motif-based spectral clustering 1 2 3 4 5 6 7 8 9 10 1 11 1 1 1 3 1 1 1 1 1 2 1 1 17 9 10 3 5 6 7 8 1 2 4 W(M) ij = #{instances of motif M that contain nodes i and j} motif M Step 1. Given directed graph G and motif M, form a weighted graph W(M). graph G weighted graph W(M)
  18. 18. Key insight Classical spectral clustering on weighted graph W(M) finds clusters of low motif conductance. M (S) = motifs cut motif volume = 1 10 1 2 3 4 5 6 7 8 9 10 1 11 1 1 1 3 1 1 1 1 1 2 1 1 18 Motif-based spectral clustering Step 1. Given directed graph G and motif M, form a weighted graph W(M). motif M W(M) ij = #{instances of motif M that contain nodes i and j} W(M)
  19. 19. Motif-based spectral clustering 19 Step 2. Compute the eigenvector f(M) associated with λ2 of the normalized Laplacian matrix of W(M) 1 2 3 4 5 6 7 8 9 10 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 Takes roughly O(# edges) time. D = diag(W(M) e) L(M) = D−1/2 (D − W(M) )D−1/2 L(M) z = –2z f (M) = D−1/2 z f(M)
  20. 20. Motif-based spectral clustering 20 Step 3 (motif sweep cut) [Mihail89,Chung92] § Sort nodes by values in f(M) → σ1, σ2, …σn. § Pick set Sr = {σ1, …, σr} with smallest motif conductance. 9 10 3 5 6 7 8 1 2 4
  21. 21. Motif Cheeger inequality Theorem If the motif has three nodes, then the sweep procedure on the weighted graph finds a set S of nodes for which For 4+ nodes, need slightly different notion of conductance. 21 M(G) = {instances of M in G} Key Proof Step cutM (S, G) = X {i,j,k}2M(G) Indicator[xi , xj , xk not the same] = 1 4 (x2 i + x2 j + x2 k xi xj xj xk xi xk ) = quadratic in x M (S) 2 M
  22. 22. Applications 22 1. We do not know the motif of interest. food webs and new applications 2. We know the motif of interest from domain knowledge. yeast transcription regulation networks, connectome, social networks 3. We seek richer information from our data. transportation networks and new applications
  23. 23. 23 Application 1 We do not know the motif of interest.
  24. 24. Application 1 Food webs Florida bay food web § Nodes are species § Edges represent carbon exchange i → j if j eats i § Motifs represent energy flow patterns 24 http://marinebio.org/oceans/marine-zones/ M6M5
  25. 25. Application 1 Food webs Which motif clusters the food web? Our approach § Run motif spectral clustering for all 3-node motifs as well as for just edges. § Examine the sweep profile to see which motif gives the best clusters. 25
  26. 26. Application 1 Food webs 26 M6 M5 edge Our finding motif M6 organizes the food web into good clusters M6M5 edge
  27. 27. B D Micronutrient sources Pelagic fishes and benthic prey Benthic macro- invertebrates Benthic Fishes Motif M6 reveals aquatic layers A 61% accuracy vs. 48% with edge- based methods 27 Application 1 Food webs
  28. 28. 28 Application 2 We know the motif of interest from domain knowledge.
  29. 29. Application 2 Yeast transcription regulation networks § Nodes are groups of genes § Edge i → j means i regulates transcription to j § Sign + / - denotes activation / suppression § Coherent feedforward loops encode biological function [Mangan+03, Alon07] A CC A 29
  30. 30. Clustering based on coherent feedforward loops identifies functions studied individually by biologists [Mangan+03] 97% accuracy vs. 68–82% with edge-based methods A B A CA C A B 30 Application 2 Yeast transcription regulation networks
  31. 31. A B 31 Structure of the found modules (all edge signs are positive) Application 2 Yeast transcription regulation networks
  32. 32. 32 Application 3 We seek richer information from our data.
  33. 33. Application 3 Transportation networks § North American air transport network. § Nodes are cites. § i → j if you can travel from i to j in < 8 hours. [Frey-Dueck07] 33
  34. 34. Accepted pending m B A Important motifs from literature [Rosvall+2014] W (M) ij = #{bi-directional length-2 paths from i to j} Weighted adjacency matrix already reveals hub-like structure 34 Application 3 Transportation networks
  35. 35. Top 8 U.S. hubs East coast non-hubs West coast non-hubs Primary spectral coordinate Atlanta, the top hub, is next to Salina, a non-hub. MOTIF SPECTRAL EMBEDDING EDGE SPECTRAL EMBEDDING 35 Application 3 Transportation networks
  36. 36. 36 §Generalization of graph clustering to higher-order structures (motifs) through a new objective (motif conductance). §Generalizing old ideas from spectral graph theory admits a new algorithm and a motif Cheeger inequality. §Applications in ecology, biology, transportation. Recap Higher-order graph clustering
  37. 37. # form W0 … W4 sc0 = spectral_cut(W0) sc1 = spectral_cut(W1) sc2 = spectral_cut(W2) sc3 = spectral_cut(W3) sc4 = spectral_cut(W4) plt = x -> semilogx( x.sweepcut_profile. conductance) plt(sc0) plt(sc1) plt(sc2) plt(sc3) plt(sc4) Find me at… http://stanford.edu/~arbenson @austinbenson arbenson@stanford.edu Thanks! Austin Benson Stanford University For fun reproduce some results from this talk! Jupyter notebooks at bit.ly/higher-order-julia Higher-order graph clustering Paper Benson-Gleich-Leskovec, Higher-order organization of complex networks, Science, 2016 Code + data http://snap.stanford.edu/higher-order

×