Simplicial closure and higher-order link prediction LA/OPT
1. Simplicial closure and
higher-order link
prediction
Austin R. Benson · Cornell
LA/OPT Seminar · April 26, 2018
LA/OPT Seminar 1
bit.ly/sc-holp-code bit.ly/arb-laopt-2018
bit.ly/sc-holp-data arXiv 1802.06916
Joint work with
Rediet Abebe & Jon Kleinberg
(Cornell)
Michael Schaub & Ali Jadbabaie (MIT)
2. Networks are sets of nodes and edges (graphs)
that model real-world systems.
LA/OPT Seminar 2
Collaboration
nodes are
people/groups
edges link entities
working together
Communications
nodes are
people/accounts
edges show info.
exchange
Physical proximity
nodes are people/animals
edges link those that
interact in close proximity
Drug compounds
nodes are substances
edge between substances
that appear in the same drug
3. Real-world systems are often composed from
“higher-order” interactions that we reduce to
pairwise ones.
LA/OPT Seminar 3
Collaboration
nodes are
people/groups
teams are made up of
small groups
Communications
nodes are
people/accounts
emails often have several
recipients, not just one
Physical proximity
nodes are people/animals
people often gather in
small groups
Drug compounds
nodes are substances
drugs are made up of
several substances
4. There are many ways to mathematically represent
the higher-order structure present in relational
data.
LA/OPT Seminar 4
• Hypergraphs [Berge 89]
• Set systems [Frankl 95]
• Affiliation networks [Feld 81, Newman-Watts-Strogatz 02]
• Abstract simplicial complexes [Lim 15, Osting-Palande-Wang 17]
• Multilayer networks [Kivelä+ 14, Boccaletti+ 14, many others…]
• Meta-paths [Sun-Han 12]
• Motif-based representations [Benson-Gleich-Leskovec 15, 17]
• …
5. LA/OPT Seminar 5
However, there are problems with higher-order data
representation…
1. Researchers and practitioners often don’t use it.
Graphs are easier and we already have lots of graph analysis tools.
2. We don’t have good frameworks for evaluating them
(especially for what they can do beyond graphs).
A goal with this research is to provide a framework through
which higher-order models can be evaluated.
6. Link prediction is a classical machine learning
problem in network science, which is used to
evaluate models.
LA/OPT Seminar 6
We observe data which is a list of edges in a graph up to some point t.
We want to predict which new edges will form in the future.
Shows up in a variety of applications
• Predicting new social relationships and friend recommendation.
[Backstrom-Leskovec 11; Wang+ 15]
• Inferring new links between genes and diseases.
[Wang-Gulbahce-Yu 11; Moreau-Tranchevent 12]
• Suggesting novel connections in the scientific community.
[Liben-Nowell-Kleinberg 07; Tang-Wu-Sun-Su 12]
Also useful as a framework to evaluate new methods!
[Liben-Nowell-Kleinberg 07; Lü-Zhau 11]
7. We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
LA/OPT Seminar 7
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
Possible applications
• Novel combinations of drugs for treatments.
• Group chat recommendation in social networks.
• Team formation.
8. Thinking of higher-order data as a weighted
projected graph with “filled in” structures is a
convenient viewpoint.
LA/OPT Seminar 8
1
2
3
4
5
6
7
8
9
Data. Pictures to have in mind.
9. Generalized means of edges weights are often
good predictors of new 3-node simplices
appearing.
LA/OPT Seminar 9
Good performance from this local information is a deviation from classical link prediction,
where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg
07].
For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
10. LA/OPT Seminar 10
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
11. Our datasets are timestamped simplices, where
each simplex is a subset of nodes.
LA/OPT Seminar 11
1. Co-authorship in different domains.
2. Emails with multiple recipients.
3. Tags on Q&A forums.
4. Threads on Q&A forums.
5. Contact/proximity measurements.
6. Musical artist collaboration.
7. Substance makeup and classification
codes applied to drugs the FDA
examines.
8. U.S. Congress committee
memberships and bill sponsorship.
9. Combinations of drugs seen in
patients in ER visits.
https://math.stackexchange.com/q/80181
12. Thinking of higher-order data as a weighted
projected graph with “filled in” structures is a
convenient viewpoint.
LA/OPT Seminar 12
1
2
3
4
5
6
7
8
9
Data. Pictures to have in mind.
13. LA/OPT Seminar 13
i
j k
i
j k
Warm-up. What’s more common?
or
“Open triangle”
each pair has been in a
simplex together but all 3
nodes have never been in the
same simplex
“Closed triangle”
there is some simplex that
contains all 3 nodes
14. There is lots of variation in the fraction of triangles
that are open, but datasets from the same domain
are similar.
LA/OPT Seminar 14
15. Most open triangles do not come from
asynchronous temporal behavior.
LA/OPT Seminar 15
i
j k
In 61.1% to 97.4% of open
triangles, all three pairs of edges
have an overlapping period of
activity.
⟶ there is an overlapping period
of
activity between all 3 edges
(Helly’s theorem).
16. A simple model can account for open triangle
variation.
LA/OPT Seminar 16
Consider a dataset with n nodes.
Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d.
Thus, we always get ϴ(pn3) = ϴ(n3 - b) closed triangles in expectation.
Proposition (rough).
If b < 1, then we get ϴ(n3) open triangles in expectation for large n.
If b > 1, then we get ϴ(n3(2-b)) open triangles in expectation for large n.
⟶ The number of open triangles grows faster for b < 3/2.
17. A simple model can account for open triangle
variation.
LA/OPT Seminar 17
Consider a dataset with n nodes.
Form a 3-node simplex between nodes i, j, k with prob. p = 1 / nb i.i.d.
b = 0.8, 0.82, 0.84, ..., 1.8
Larger b is darker marker.
18. Summary of structure of datasets.
LA/OPT Seminar 18
1. Datasets with higher-order structure are common.
2. It’s useful to think of them as “projected graphs” with additional filled
in structure (like a simplicial complex).
3. Perhaps surprisingly, several datasets have many open triangles.
Not due to asynchronous temporal behavior.
A simple model can give a range of open triangles.
19. LA/OPT Seminar 19
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
21. Simplicial closure is the appearance of a group of
nodes appearing together in a simplex for the first
time.
LA/OPT Seminar 21
H IV prot ease
inhibit ors
UGT 1A 1
inhibit ors
Breast cancer resist ance
prot ein inhibit ors
1
2+
2+
1
2+
1
1
2+
2+
1
2+
2+
2+
Reyat az
RedPharm
2003
Reyat az
Squibb & Sons
2003
K alet ra
Physicians To-
t al Care
2006
Promact a
GSK (25mg)
2008
Promact a
GSK (50mg)
2008
K alet ra
D OH Cent ral
Pharmacy
2009
Evot az
Squibb & Sons
2015
Substances in marketed drugs recorded in the National Drug Code
directory.
icons
colors 16.04
1
2+
2+
1
2+
2+
H ow can I change
t he icon colors, ap-
pearance, et c. at
t he t op panel?
2011
H ow do I change
t he icon and t ext
color?
2012
U bunt u 15.10
/ 16.04 t heme
doesn’t change
2016
U bunt u 16.04
Eclipse launcher
icon problems
2016
Set deskt op icons background
color K ubunt u 16.04
2016
Tags on askubuntu.com forum questions.
22. Simplicial closure probability on 3 nodes largely
depends on the weights in the projected network.
LA/OPT Seminar 22
Take first 80% of the data (in time), record the configuration of every triple of
nodes, and compute the fraction that simplicially close in the final 20% of the data.
Increased edge density
increases closure
probability.
Increased tie strength
increases closure
probability.
Tension between
edge density and tie
strength.
Left and middle observations are consistent with theory and empirical studies of social
networks. [Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06]
23. Simplicial closure probability on 4 nodes has
similar behavior to those with 3 nodes, just “up
one dimension”.
LA/OPT Seminar 23
Take first 80% of the data (in time), record the configuration of every 4
nodes, and compute the fraction that simplicially close in the final 20% of
the data.
Increased edge density
increases closure
probability.
Increased simplicial tie
strength increases
closure probability.
Tension between
simplicial density
and simplicial tie
strength.
24. Induced count = c - 2 * # - 2 * # - #
- 3 * # - 3 * # - 4 * #
Computing simplicial closure probabilities
requires some clever counting algorithms.
LA/OPT Seminar 24
How many sets of 4 nodes induce the structure?
1
2+
Assume we can enumerate 3- and 4-cliques and quickly
determine their “simplicial structure”.
1
2+1
1
2+ 1
2+
2+
1
1
1
2+
1
1
2+
2+
1
2+
2+
2+
25. Summary of simplicial closure.
LA/OPT Seminar 25
1. Useful to think of “trajectories” of sets of nodes in the projected graph.
2. More edges between group of nodes ⟶ more likely to close in future.
True for both 3-node and 4-node groups!
3. Stronger ties between groups of nodes ⟶ also more likely to close.
True for both 3-node and 4-node groups!
26. LA/OPT Seminar 26
Three parts of this talk.
1. Basic study of datasets.
How diverse are the types of datasets that we encounter?
2. Simplicial closure.
How do we characterize the ways in which new simplices appear?
3. Higher-order link prediction.
How do we use insights to predict which new simplices appear?
27. We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
LA/OPT Seminar 27
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
28. LA/OPT Seminar 28
Our structural analysis tells us what we should be
looking at for prediction.
1. Edge density is a positive indicator.
⟶ focus our attention on predicting “open triangles”
2. Tie strength is a positive indicator.
⟶ various ways of incorporating this information
i
j k
Wij
Wjk
Wjk
29. LA/OPT Seminar 29
For every open triangle, we assign a score
function on first 80% of data based on structural
properties.
Four broad classes of score functions for
an open triangle. Score is s(i, j, k).
s(i, j, k)…
1. is a function of Wij, Wjk, Wjk
2. is a function of A[:, [i, j, k]] and the
simplices containing these nodes
3. is built from “whole-network”
similarity scores on edges
4. is “learned” from data
i
j k
Wij
Wjk
Wjk
After computing scores, predict that open triangles with highest
scores will simplicially close in final 20% of data.
30. LA/OPT Seminar 30
1. s(i, j, k) is a function of Wij , Wjk , and Wjk
i
j k
Wij
Wjk
Wjk
1. Arithmetic mean
2. Geometric mean
3. Harmonic mean
4. Generalized mean
31. LA/OPT Seminar 31
2. s(i, j, k) is a function of is a function of A[:, [i, j,
k]] and the related simplicial information
i
j k
Wij
Wjk
Wjk
1. Common 4-th neighbors
2. Generalized Jaccard coefficient
3. Preferential attachment (graph)
4. Preferential attachment (simplices)
32. LA/OPT Seminar 32
3. s(i, j, k) is is built from “whole-network”
similarity scores on edges: s(i, j, k) = Sij + Sji + Sjk +
Skj + Sik + Ski
i
j k
Wij
Wjk
Wjk
1. PageRank (unweighted or
weighted)
2. Katz (unweighted or weighted)
3. Fancier topological methods based
on lifted random walks
(work in progress…)
33. LA/OPT Seminar 33
Hopefully Michael chimed in on matrix inverses…
Want to reduce
computational cost
(only need scores on
open triangles).
For each node i that participates in an open triangle
1. Solve
2. Store ith column
using GMRES with low tolerance
34. LA/OPT Seminar 34
4. s(i, j, k) is learned from data
1. Split data into training and validation sets.
2. Compute features of (i, j, k) from previous ideas using training data.
3. Throw features + validation labels into machine learning blender
→ learn model.
4. Re-compute features on combined training + validation
→ apply model on the data.
36. LA/OPT Seminar 36
A few lessons learned from applying all of these
ideas.
1. We can predict pretty well on all datasets using some method.
→ 4x to 107x better than random w/r/t mean average
precision
depending on the dataset/method
2. On some classes of datasets, we can consistently predict well.
→ thread co-participation and co-tagging on stack exchange
3. Simply averaging Wij, Wjk, and Wik consistently performs well.
→ something between harmonic and geometric mean is
consistently best
i
j k
Wij
Wjk
Wjk
37. Generalized means of edges weights are often
good predictors of new 3-node simplices
appearing.
LA/OPT Seminar 37
Good performance from this local information is a deviation from classical link prediction,
where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg
07].
For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
38. LA/OPT Seminar 38
Shameless plug for SIAM ALA 2018!
David Gleich and I are organizing a minitutorial.
Tensor Eigenvectors and Stochastic Processes
May 6, 10:45am – 12:45pm
Learn about how stochastics help us understand
problems in numerical multilinear algebra!
It’s a minitutorial—little background is assumed.
Lots of open problems for ICME students to solve.
39. Simplicial closure and higher-order link prediction.
LA/OPT Seminar 39
Paper. arXiv 1802.06916 Code. bit.ly/sc-holp-code
Data. bit.ly/sc-holp-data Slides. bit.ly/arb-laopt-
2018
THANKS!
http://cs.cornell.edu/~arb
@austinbenson
arb@cs.cornell.edu
1. We want higher-order models to
be pervasive.
2. Higher-order link prediction is a
way to compare models.
3. Have a better model or
algorithm? Great! Show us on
the data.
4. Lots of interesting structural
information in these datasets.