Higher-order spectral graph clustering uses motifs to find clusters in networks. It forms a weighted matrix based on a motif (e.g., triangles) and runs spectral clustering on that matrix. This captures clusters defined by the motif better than standard edge-based clustering. The technique has theoretical guarantees via a motif Cheeger inequality. Applications show it outperforms edge methods on food webs, gene networks, and transportation networks. Experiments on stochastic block models suggest the motif weighting improves cluster detection, potentially by making eigenvectors more localized to clusters. Open questions remain around eigenvalue distributions and convergence properties.
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
Tensor Decomposition: A Mathematical Tool for Data Analysis:
Tensors are multiway arrays, and tensor decompositions are powerful tools for data analysis. In this talk, we demonstrate the wide-ranging utility of the canonical polyadic (CP) tensor decomposition with examples in neuroscience and chemical detection. The CP model is extremely useful for interpretation, as we show with an example in neuroscience. However, it can be difficult to fit to real data for a variety of reasons. We present a novel randomized method for fitting the CP decomposition to dense data that is more scalable and robust than the standard techniques. We further consider the modeling assumptions for fitting tensor decompositions to data and explain alternative strategies for different statistical scenarios, resulting in a _generalized_ CP tensor decomposition.
Bio: Tamara G. Kolda is a member of the Data Science and Cyber Analytics Department at Sandia National Laboratories in Livermore, CA. Her research is generally in the area of computational science and data analysis, with specialties in multilinear algebra and tensor decompositions, graph models and algorithms, data mining, optimization, nonlinear solvers, parallel computing and the design of scientific software. She has received a Presidential Early Career Award for Scientists and Engineers (PECASE), been named a Distinguished Scientist of the Association for Computing Machinery (ACM) and a Fellow of the Society for Industrial and Applied Mathematics (SIAM). She was the winner of an R&D100 award and three best paper prizes at international conferences. She is currently a member of the SIAM Board of Trustees and serves as associate editor for both the SIAM J. Scientific Computing and the SIAM J. Matrix Analysis and Applications.
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Brocade
Presentation by Brocade Chief Scientist and Fellow, David Meyer, given at Orange Gardens July 2016. What is Machine Learning and what is all the excitement about?
An associated blog is available here: http://community.brocade.com/t5/CTO-Corner/Networking-Meets-Artificial-Intelligence-A-Glimpse-into-the-Very/ba-p/88196
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
Paper Explained: Understanding the wiring evolution in differentiable neural ...Devansh16
Read my Explanation of the Paper here: https://medium.com/@devanshverma425/why-and-how-is-neural-architecture-search-is-biased-778763d03f38?sk=e16a3e54d6c26090a6b28f7420d3f6f7
Abstract: Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.
Network Biology: A paradigm for modeling biological complex systemsGanesh Bagler
These slides are part of the two lectures delivered at the as part of the 'National Workshop on Network Modelling and Graph Theory' (Dec 14-16, 2017) at Department of Mathematics, Dibrugarh University, Assam, India.
(1) Network Biology: A paradigm for integrative modeling of biological complex systems -- 14 Dec 2017, 3:30pm
(2) Applications of network modeling in biomedicine -- 15 Dec 2017, 9:00pm
Sponsored by UGC under SAP DRS (II)
(1) Workshop link: https://www.dibru.ac.in/upcoming-events/2981-national-workshop-on-network-modelling-and-graph-theory
(2) The Workshop Flyer: https://www.dibru.ac.in/images/uploaded_files/2017/Nov/National_Workshop_on_Network_Modelling_and_Graph_Theory.pdf
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
Tensor Decomposition: A Mathematical Tool for Data Analysis:
Tensors are multiway arrays, and tensor decompositions are powerful tools for data analysis. In this talk, we demonstrate the wide-ranging utility of the canonical polyadic (CP) tensor decomposition with examples in neuroscience and chemical detection. The CP model is extremely useful for interpretation, as we show with an example in neuroscience. However, it can be difficult to fit to real data for a variety of reasons. We present a novel randomized method for fitting the CP decomposition to dense data that is more scalable and robust than the standard techniques. We further consider the modeling assumptions for fitting tensor decompositions to data and explain alternative strategies for different statistical scenarios, resulting in a _generalized_ CP tensor decomposition.
Bio: Tamara G. Kolda is a member of the Data Science and Cyber Analytics Department at Sandia National Laboratories in Livermore, CA. Her research is generally in the area of computational science and data analysis, with specialties in multilinear algebra and tensor decompositions, graph models and algorithms, data mining, optimization, nonlinear solvers, parallel computing and the design of scientific software. She has received a Presidential Early Career Award for Scientists and Engineers (PECASE), been named a Distinguished Scientist of the Association for Computing Machinery (ACM) and a Fellow of the Society for Industrial and Applied Mathematics (SIAM). She was the winner of an R&D100 award and three best paper prizes at international conferences. She is currently a member of the SIAM Board of Trustees and serves as associate editor for both the SIAM J. Scientific Computing and the SIAM J. Matrix Analysis and Applications.
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Brocade
Presentation by Brocade Chief Scientist and Fellow, David Meyer, given at Orange Gardens July 2016. What is Machine Learning and what is all the excitement about?
An associated blog is available here: http://community.brocade.com/t5/CTO-Corner/Networking-Meets-Artificial-Intelligence-A-Glimpse-into-the-Very/ba-p/88196
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
Paper Explained: Understanding the wiring evolution in differentiable neural ...Devansh16
Read my Explanation of the Paper here: https://medium.com/@devanshverma425/why-and-how-is-neural-architecture-search-is-biased-778763d03f38?sk=e16a3e54d6c26090a6b28f7420d3f6f7
Abstract: Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.
Network Biology: A paradigm for modeling biological complex systemsGanesh Bagler
These slides are part of the two lectures delivered at the as part of the 'National Workshop on Network Modelling and Graph Theory' (Dec 14-16, 2017) at Department of Mathematics, Dibrugarh University, Assam, India.
(1) Network Biology: A paradigm for integrative modeling of biological complex systems -- 14 Dec 2017, 3:30pm
(2) Applications of network modeling in biomedicine -- 15 Dec 2017, 9:00pm
Sponsored by UGC under SAP DRS (II)
(1) Workshop link: https://www.dibru.ac.in/upcoming-events/2981-national-workshop-on-network-modelling-and-graph-theory
(2) The Workshop Flyer: https://www.dibru.ac.in/images/uploaded_files/2017/Nov/National_Workshop_on_Network_Modelling_and_Graph_Theory.pdf
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...csandit
The high-level contributions of this paper are as f
ollows: We modify an existing branch-and-
bound based exact algorithm (for maximum clique siz
e of an entire graph) to determine the
maximal clique size that the individual vertices in
the graph are part of. We then run this
algorithm on six real-world network graphs (ranging
from random networks to scale-free
networks) and analyze the distribution of the maxim
al clique size of the vertices in these graphs.
We observe five of the six real-world network graph
s to exhibit a Poisson-style distribution for
the maximal clique size of the vertices. We analyze
the correlation between the maximal clique
size and the clustering coefficient of the vertices
, and find these two metrics to be poorly
correlated for the real-world network graphs. Final
ly, we analyze the Assortativity index of the
vertices of the real-world network graphs and obser
ve the graphs to exhibit positive
assortativity with respect to maximal clique size a
nd negative assortativity with respect to node
degree; nevertheless, we observe the Assortativity
index of the real-world network graphs with
respect to both the maximal clique size and node de
gree to increase with decrease in the
spectral radius ratio for node degree, indicating a
positive correlation between the maximal
clique size and node degree.
Networks in Space: Granular Force Networks and BeyondMason Porter
This is my talk for the Network Geometry Workshop (http://ginestra-bianconi-6flt.squarespace.com) at QMUL on 16 July 2015.
(A few of the slides are adapted from slides by my coauthors Dani Bassett and Karen Daniels.)
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphscsandit
This paper presents the applications of Eigenvalues and Eigenvectors (as part of spectral
decomposition) to analyze the bipartivity index of graphs as well as to predict the set of vertices
that will constitute the two partitions of graphs that are truly bipartite and those that are close
to being bipartite. Though the largest eigenvalue and the corresponding eigenvector (called the
principal eigenvalue and principal eigenvector) are typically used in the spectral analysis of
network graphs, we show that the smallest eigenvalue and the smallest eigenvector (called the
bipartite eigenvalue and the bipartite eigenvector) could be used to predict the bipartite
partitions of network graphs. For each of the predictions, we hypothesize an expected partition
for the input graph and compare that with the predicted partitions. We also analyze the impact
of the number of frustrated edges (edges connecting the vertices within a partition) and their
location across the two partitions on the bipartivity index. We observe that for a given number
of frustrated edges, if the frustrated edges are located in the larger of the two partitions of the
bipartite graph (rather than the smaller of the two partitions or equally distributed across the
two partitions), the bipartivity index is likely to be relatively larger.
Some engineering and scientific computer models that have high dimensional input space are actually only affected by a few essential input variables. If these active variables are identified, it would reduce the computation in the estimation of the Gaussian process (GP) model and help
researchers understand the system modeled by the computer simulation. More importantly, reducing the input dimensions would also increase the prediction accuracy, as it alleviates the "curse of dimensionality" problem.
In this talk, we propose a new approach to reduce the input dimension of the Gaussian process model. Specifically, we develop an optimization method to identify a convex combination of a subset of kernels of lower dimensions from a large candidate set of kernels, as the correlation function for the GP model. To make sure a sparse subset is selected, we add a penalty on the weights of kernels. Several numerical examples are shown to show the advantages of the
method. The proposed method has many connections with the existing methods including active subspace, additive GP, and composite GP models in the Uncertainty Quantification literature.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Higher-order spectral graph clustering with motifs
1. Higher-order spectral graph
clustering with motifs
Austin R. Benson
Cornell SCAN seminar
September 18, 2017
Joint work with
David Gleich (Purdue),
Jure Leskovec (Stanford), &
Hao Yin (Stanford)
2. 2
Brains
nodes are neurons
edges are synapses
Social networks
nodes are people
edges are
friendships
Electrical grid
nodes are power plants
edges are transmission
linesTim Meko, Washington Post
Currency
nodes are accounts
edges are transactions
Background. Networks are sets of nodes and edges (graphs)
that model real-world systems.
3. 3
G = (V, E) A To study
relationships in the
network
• centrality
• reachability
• clustering
• and more…
we often use matrix
computations
Just like most other things in life,
I often think about networks as matrices.
4. Given an undirected graph G = (V, E)
Instead of using its symmetric
adjacency matrix A
consider using the weighted matrix
Hadamard /
element-wise
5. Given a directed graph G = (V, E)
Instead of using its non-symmetric
adjacency matrix A = B + U
consider using a weighted matrix
6. what
why
where
The matrix counts the number
of triangles that edges participate in.
The matrix comes up in our higher-order
network clustering framework when using
triangles as the motif.
We often get better empirical results in
real-world networks and better numerical
properties in model partitioning problems
(stochastic block model).
7. 7
Key insight [Flake 2000; Newman 2004, 2006; many others…].
Networks for real-world systems have modules, clusters, communities.
• We want algorithms to uncover the clusters automatically.
• Main idea has been to optimize metrics involving the number of
nodes and edges in a cluster. Conductance, modularity, density, ratio cut,
…
Co-author network
Brain network, de Reus et al., RSTB,
2014.
Background. Networks are sets of nodes and edges (graphs)
that model real-world systems.
8. 8
Key insight [Milo+02].
Networks modelling real-world systems
contain certain small subgraphs patterns
way more frequently than expected.
[Mangan+ 2003; Alon 2007]
Signed feed-forward loops
in genetic transcription.
Triangles in social
relationships.
[Simmel 1908;
Rapoport 1953;
Granovetter 1973]
Bi-directed length-2 paths
in brain networks.
[Sporns-Kötter 2004;
Sporns+ 2007; Honey+ 2007]
We call these small subgraph patterns motifs.
Background. Networks are sets of nodes and edges (graphs)
that model real-world systems.
9. 9
Motifs are the fundamental units of
complex networks.
We should design our
clustering algorithms around motifs.
10. Network
Motif
Different motifs give different clusters.
10
Higher-order graph clustering is our technique for finding
coherent groups of nodes based on motifs.
11. Conductance is one of the most important cluster quality scores [Schaeffer
2007]
used in Markov chain theory, spectral clustering, bioinformatics, vision, etc.
The conductance of a set of vertices S is the ratio of
edges leaving to total edges
small conductance good cluster
(edges leaving S)
(edge end points in S)
11
S S
Background. Graph clustering and conductance.
12. Cheeger Inequality
Finding the best conductance set is NP-hard.
• Cheeger realized the eigenvalues of the Laplacian
provided surface area to volume bounds in
manifolds.
• Alon and Milman independently realized the same
thing for a graph (conductance)!
Laplacian
12
[Cheeger 1970; Alon-Milman 1985]
Background. Spectral clustering has theoretical guarantees.
13. We can find a set S that achieves the
Cheeger bound.
1. Compute the eigenvector z associated with
λ2 and scale to f = D-1/2z
2. Sort the vertices by their values in f:
σ1, σ2, …, σn
3. Let Sr = {σ1, …, σr} and compute the
conductance of φ(Sr) of each Sr.
4. Pick the set Sm with minimum
conductance.
13
[Mihail 1989; Chung 1992]
0 20 40
0
0.2
0.4
0.6
0.8
1
Si
fi
Background. The sweep cut realizes the guarantee.
15. 15
Signed feed-forward loops in genetic transcription
[Mangan+03]
• Gene X activates transcription in gene Y.
• Gene X suppresses transcription in gene Z.
• Gene Y suppresses transcription in gene Z.X
Y
Z
Spectral clustering is theoretically justified for undirected graphs
• Various extensions to multiple clusters
[Ng-Jordan-Weiss 2002; Dhillon-Guan-Kulis 2004; Lee-Gharan-Trevisan
2014]
• Weighted graphs are okay (weighted cut, weighted volume)
• Approximate eigenvectors are okay [Mihail89; Fairbanks+ 2017]
Current network models are more richly annotated
• directed, signed, colored, layered, multiplex, higher-order, etc.
Modern datasets are much more rich than the simple networks
for which spectral clustering is justified.
16. [Benson-Gleich-Leskovec 2016; Yin-Benson-Leskovec-Gleich 2017]
16
• A generalized conductance metric for motifs.
• A “new” spectral clustering algorithm to
minimize the generalized conductance.
• AND an associated motif Cheeger inequality
guarantee.
• Several applications
Aquatic layers in food webs, functional groups in
genetic regulation Hub structure in transportation.
• Extensions to localized clustering.
Our contributions.
New and preliminary! Studies on stochastic block models.
17. Need new notions of cut and volume…
17
M = triangle motif
Motif conductance.
21. 1. Pick your favorite motif.
2. Form weighted matrix W(M).
3. Compute the Fiedler eigenvector f(M)
associated with λ2 of the normalized
Laplacian matrix of W(M).
4. Run a sweep on f(M)
21
f(M)
Higher-order spectral clustering.
22. 22
(there are faster algorithms to
compute these matrices, but this
is a useful jumping off point.)
There are nice matrix computations for 3-node motifs.
24. Theorem.
If the motif has three nodes, then
the sweep procedure on the
weighted graph finds a set S of
nodes for which
24
Key Proof Step
Implication.
Just run spectral clustering with
the weighted matrix coming from
your favorite motif.
The three-node motif Cheeger inequality.
25. 25
Awesome advantages!
• Works for arbitrary non-negative combos
of motifs and weighted motifs, too.
• We inherit 40+ years of research!
• Fast algorithms (ARPACK, etc.)
• Local methods!
[Yin-Benson-Leskovec-Gleich 2017]
• Overlapping!
via [Whang-Dillon-Gleich 2015]
• Easy to implement
• Scalable (1.4B edges graphs
are not a problem)
bit.ly/motif-spectral-julia
26. 26
1. We do not know the motif of interest.
food webs and new applications
2. We know the motif of interest from domain knowledge.
yeast transcription regulation networks, connectome, social networks
3. We seek richer information from our data.
transportation networks and new applications
Lots of fun applications.
27. Florida bay food web
• Nodes are species.
• Edges represent carbon exchange
i → j if j eats i.
• Motifs represent energy flow patterns.
27
http://marinebio.org/oceans/marine-zones/
Application 1. Food webs.
28. Which motif clusters the food web?
Our approach
• Run motif spectral clustering for all 3-node motifs as well as
for just edges.
• Look at sweep cuts to see which motif gives the best clusters.
28
Application 1. Food webs.
31. • Nodes are groups of genes.
• Edge i → j means i regulates transcription to j.
• Sign + / - denotes activation / suppression.
• Coherent feedforward loops encode biological function.
[Mangan-Zaslaver-Alon 2003; Mangan-Alon 2003; Alon 2007]
31
Application 2. Yeast transcription regulation networks.
32. Clustering based on coherent
feedforward loops identifies
functions studied individually
by biologists [Mangan+ 2003]
97% accuracy vs.
68–82% with edge-based methods
32
Application 2. Yeast transcription regulation networks.
33. • North American air
transport network.
• Nodes are cites.
• i → j if you can travel from
i to j in < 8 hours.
[Frey-Dueck 2007]
33
Application 3. Transportation networks.
34. Important motifs from literature
[Rosvall+ 2014]
Weighted adjacency matrix already reveals hub-like structure.
34
Application 3. Transportation networks.
35. Top 8
U.S. hubs
East coast non-hubs
West coast non-
hubs
Primary spectral coordinate
Atlanta, the top hub, is
next to Salina, a non-hub.
MOTIF SPECTRAL
EMBEDDING
EDGE SPECTRAL EMBEDDING
35
Application 3. Transportation networks.
36. 36
In localized clustering,
we aim to find a small
set of nodes of low
conductance containing
a particular seed node
without looking at the
entire graph.
Seed
node
Local cluster
A major benefit of our theory is that it easily generalizes to related
problems.
37. 37
Random walk procedure
• With probability a, hop to a random neighbor.
• With probability 1- a, jump back to the seed node.
• Stationary distribution tells us what is “close” to the seed.
If the seed is in a good cluster, the stationary distribution is
localized and we can approximate it in sublinear time using
results of [Andersen-Chung-Lang 2006]. Uses sweep cut on approx.
x.
Background. Localized clustering works in theory and practice
with the (fast) personalized PageRank method.
38. 38
We use our same trick and generalize the theory to
triangles (and other motifs).
Now random walks look like
With probability a, hop to a random neighbor adjacent triangle and then
to a random endpoint of that triangle (not the one we came from).
With probability 1- a, go back to the seed node.
Higher-order localized clustering.
[Yin-Benson-Leskovec-Gleich 2017]
39. 39
Nodes are people at a research institute. Each person is in a department.
Edges are email correspondence.
Using each person as a seed, can we recover their department?
Average F1 0.40 0.50
Using the triangle motif in higher-order localized clustering better
captures community structure in email data.
40. what
why
where
The matrix counts the number
of triangles that edges participate in.
The matrix comes up in our higher-order
network clustering framework when using
triangles as the motif.
We often get better empirical results in
real-world networks and better numerical
properties in model partitioning problems
(stochastic block model).
41. 41
SSBM
m = 200, k = 5,
p = 0.3, q = 0.13
The symmetric stochastic block model
(SSBM)
• k blocks, each of size m-by-m
• within-block edges exist with prob. p
• between-block edges with prob. q
The task. Given a graph that is an SSBM
and given m, k, p, q, find the k blocks.
Theory (lots!). See in-prep. survey
“Community Detection and the Stochastic
Block Model” from E. Abbe.
• Exact recovery (get all correct)
• Detectability (find a non-trivial portion)
• Uses non-backtracking random walks.
The stochastic block model is a standard model used to study
recovery of planted cluster structure.
42. We are currently looking at the SSBM using our motif-
weighting based on triangles.
42
Based also on Tsourakakis, Pachocki, & Mitzenmacher, WWW,
2017.
→ triangle conductance of a block < edge-conductance of the
block.
43. We use a mixing parameter
thought of as the expected fraction of neighbors outside of a block
43
From the eyeball test, just using the motif weighting highlights the
blocks better for a range of parameters.
Exp. details
Randomly sample
SSBM(50, 10, p=0.2, q)
for varying q. Plot 2D
histogram (density).
44. Detectability
Exact recovery
Exp. details.
We take normalized
Laplacian and shift to
reverse the spectrum.
Then we deflate given
knowledge of the
leading eigenvector.
Accuracy is the most
accurate block in the
extremal m (size of
block) entries.
Note: threshold is for
all block recovery.
Accuracy
The power method identifies a cluster better using the motif
weighting than the adjacency.
45. Most accurate block
Nodes whose value is
greater than the smallest
green node value
Nodes whose value is
smaller than the smallest
green node value.
The power method identifies a cluster better using the motif
weighting than the adjacency.
46. 46
We don’t converge faster for the usual reasons that we would
think of the power method converging faster.
47. 47
There is a gap deeper in the spectrum that could explain what
is going on.
48. 48
semi-circle law (not a Marchenko-Pastur law)
The motif weighting shifts all eigenvalues down, but it pushes the
lowest ones down the most.
49. 49
• Eigenvalues show that we’ll
converge to the “cluster
subspace” faster.
• Conjecture.
Higher accuracy for motifs
because the eigenvectors we
converge to are more
localized—or sharper—around
the clusters.
• But we don’t know why they
are sharper!
We would like a numerical explanation for why we get better
results with motifs.
Top non-trivial
eigenvectors of
normalized
Laplacian fighting
less with motif
weighting?
50. 50
The iterative algorithm for personalized PageRank tends to
localize more.
Nodes whose
value is greater
than the smallest
node value in the
seed’s block.
Rest of the nodes.
51. 51
Papers
• “Higher-order organization of complex networks.”
Benson, Gleich, and Leskovec. Science, 2016.
• “Local higher-order graph clustering.”
Yin, Benson, Leskovec, and Gleich. KDD, 2017.
1. A generalized conductance metric for
motifs.
2. A “new” spectral clustering algorithm to
minimize the generalized conductance.
3. AND an associated motif Cheeger inequality
guarantee.
4. Extensions to localized clustering.
5. Applications with food webs, genetic
regulation networks, transportation systems,
social networks.
6. Eigenvalues with motifs in SSBMs.
Open questions
• What is the distribution
law for the eigenvalues of
the Laplacian of A2 ⊙ A?
• Relationship between
convergence of the power
method and localization?
• How to work with element-
wise prods like matvecs
for
http://cs.cornell.edu/~arb
@austinbenson
arb@cs.cornell.edu
Thanks!
Austin Benson
Code & Data
snap.stanford.edu/higher-order
github.com/arbenson/higher-order-organization-julia
bit.ly/motif-spectral-julia
github.com/dgleich/motif-ssbm
9 10
8
7
2
0
4
3
11
6
5
1
Editor's Notes
Title: medium 30
Text: light 26
Subtext: light 22
References: thin 22
Annotation: thin 20 blue
What is this weighted matrix?
Why do I should look at it?
How do you do this?