-
-
-
-
-
Social network
Human Disease Network
[Barabasi 2007]
Food Web [2007]
Terrorist Network
[Krebs 2002]Internet (AS) [2005]
Gene Regulatory Network
[Decourty 2008]
Protein Interactions
[breast cancer]
Political blogs
Power grid
(1) Motif Counting
Theory and Algorithms for Large Graphs
(2) Machine Learning Applications
for Motif Counting
§ Network motifs (graphlets) are small induced subgraphs
§ Motifs represent patterns of inter-connections, basic
elements in complex networks
CliqueTriangle CycleEdge
A study of motif frequencies in real-world complex networks
Applied to food web, genetic, neural, web, and other network
- Found distinct motifs in each case
Red dashed lines indicate edges that participate
in a feed-forward loop
Motifs are over-expressed
in real complex networks
Motifs occur in real networks
with frequencies
significantly higher than
randomly generated networks
Real networks often have
modular structure
Cj#
Ci#
Ck#
Concentration of Motifs
in real vs random networks
Two Conclusions
Nodes/edges are not the fundamental units of real networks
Motifs are the building blocks of these networks
We should analyze/model graphs in terms of motifs
Network Motifs: Simple Building Blocks of Complex Networks – [Milo et. al – Science 2002]
The Structure and Function of Complex Networks – [Newman – Siam Review 2003]
2-node
Graphlets
3-node
Graphlets
4-node
Graphlets
Connected
Disconnected
Motifs/graphlets are Small k-vertex induced subgraphs
Ex: Given an input graph G
- How many triangles in G?
- How many cliques of size 4-nodes in G?
- How many cycles of size 4-nodes in G?
Ex: Given an input graph G
- How many triangles in G?
- How many cliques of size 4-nodes in G?
- How many cycles of size 4-nodes in G?
à In practice, we would like to count all k-vertex graphlets
§ Enumerate all possible graphlets
à Exhaustive enumeration is too expensive
§ Enumerate all possible graphlets
à Exhaustive enumeration is too expensive
§ Count graphlets for each node – and combine all node counts
[Shervashidze et. al – AISTAT 2009]
§ Enumerate all possible graphlets
à Exhaustive enumeration is too expensive
§ Count graphlets for each node – and combine all node counts
à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009]
§ Enumerate all possible graphlets
à Exhaustive enumeration is too expensive
§ Count graphlets for each node – and combine all node counts
à Still expensive for relatively large k [Shervashidze et. al – AISTAT 2009]
§ Other recent work counts only connected graphlets of size k=4
[Marcus & Shavitt – Computer Networks 2012]
Not practical – scales only for small graphs with few
hundred/thousand nodes/edges
- taking hours for a graph with 30K nodes
Most work focused on graphlets of k=3 nodes
In this work, we focus on graphlets of k=3,4 nodes
Efficient Graphlet Counting for Large Networks
[Ahmed et al., ICDM 2015]
Graphlet Decomposition: Framework, Algorithms, and Applications
[Ahmed et al., KAIS Journal 2016]
Edge-centric, Parallel, Fast, Space-efficient Framework
u v
v2 v3v1 v4
v6 v7
edge
Edge Neighborhood
Jointly count motifs
Searching Edge
Neighborhoods
① For each edge do
• Count All 3-node graphlets
② Merge counts from all edges
u v
v2 v3v1 v4
v6 v7
edge
Triangle 2-star 1-edge Independent
Searching Edge
Neighborhoods
① For each edge do
• Count All 3-node graphlets
② Merge counts from all edges
u v
v2 v3v1 V4
v6 v7
edge
Triangle 2-star 1-edge Independent
q We only need to
find/count triangles
q Use equations to get
counts of others in o(1)
Triangle
How to count all 4-node graphlets?
4-Clique 4-Cycle4-Chrodal-Cycle Tailed-triangle 4-Path 3-Star
4-node-triangle 4-node-2star 4-node-2edge 4-node-1edge Independent
Step 1 Step 2 Step 3
Searching Edge
Neighborhoods
For each edge
Find the triangles
Count 4-node graphlets
For each edge
Count 4-node cliques
and 4-node cycles only
Count 4-node graphlets
For each edge
Use combinatorial
relationships to compute
counts of other graphlets
in constant time
Step 4 Merge counts from all edges
± 1 edge
4-Node Motif Transition Diagram
Count a few patterns
Use relationships & transitions
to count all other graphlets in constant time
✓ Fast
✓ Space-Efficient
4-Node Motif Transition Diagram
± 1 edge
4-Cliques
4-Cycles
Maximum no. triangles
Incident to an edge
Maximum no. stars
Incident to an edge
Count a few patterns
Use relationships & transitions
to count all other graphlets in constant time
T T
Relationship between 4-cliques & 4-ChordalCycles
4-Cliques 4-ChordalCycle
e
T T
e
No. 4-ChordalCycles No. 4-Cliques
Proof in Lemma 1 - Ahmed et al., ICDM 2015
T T
Relationship between 4-cliques & 4-ChordalCycles
T T
No. 4-ChordalCycles No. 4-Cliques
4-Cliques 4-ChordalCycle
e e
Proof in Lemma 1 - Ahmed et al., ICDM 2015
0 1 2 4 8 16
0
5
10
15
Number of Processing Units
Speedup
socfb−Texas
socfb−OR
socfb−UCLA
socfb−Berkeley13
socfb−MIT
socfb−Penn94
0 1 2 4 8 16
0
5
10
15
Number of Processing Units
Speedup
0 1 2 4 8 16
0
2
4
6
8
10
12
14
Number of Processing Units
Speedup
tech−internet−as
tech−WHOIS
web−it−2004
web−spam
0 1 2 4 8 16
0
2
4
6
8
10
12
14
Number of Processing Units
Speedup
Strong scaling results
Intel Xeon 3.10 Ghz E5-2687W server v3, 16 cores
Comparison to RAGE [Marcus & Shavitt – J. Computer Networks 2011]
Facebook100 Networks from US Schools
Ours RAGE
Time in Seconds
|V| |E| Ours RAGE
Time in Seconds
Baseline (RAGE)
did not finish for
most graphs
We take ~4.5 secs for web-google (4.3M edges)
We take ~4 secs for inf-road-usa (29M edges)
Most motif counts in orders of 106 – 1015
§ Node-level and edge-level motif counting
Role discovery, Relational Learning, Multi-label Classification
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
Key Observations
The distribution of
graphlet runtimes for
edge neighborhoods
obey a power-law.
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
Most edge neighborhoods are fast
with runtimes that are approximately
equal.
Key Observations
The distribution of
graphlet runtimes for
edge neighborhoods
obey a power-law.
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
HOWEVER, a handful of
neighborhoods are hard and take
significantly longer.
Most edge neighborhoods are fast
with runtimes that are approximately
equal.
Key Observations
The distribution of
graphlet runtimes for
edge neighborhoods
obey a power-law.
Neighborhood runtimes
are power-lawed
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
0 2 4 6 8 10
x 10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Edges
Timeinseconds
HOWEVER, a handful of
neighborhoods are hard and take
significantly longer.
Most edge neighborhoods are fast
with runtimes that are approximately
equal.
Key Observations
The distribution of
graphlet runtimes for
edge neighborhoods
obey a power-law.
Neighborhood runtimes
are power-lawed
QUESTION:
How to reduce runtime?
à Sampling
§ [Horvitz-Thompson Estimation 1952]
§ Using HT estimator we scale the sampled count by probability p
Count of graphlet i
for edge j
Estimated Count
of graphlet i for edge j
More details in the paper
See Lemma 1
Average relative error for top 1000 (highest degree) edges
Sampling with replacement p = 0.01
GFD: Graphlet Frequency Distribution
Kolmogorov distance (KS)
L1 Normalized distance
1 2 4 8 12 16
0
2
4
6
8
10
12
14
16
Number of processing units
Speedup
socfb−MIT
bio−dmela
soc−gowalla
tech−RL−caida
web−wikipedia09
1 2 4 8 12 16
0
2
4
6
8
10
12
14
16
Number of processing units
Speedup
Strong scaling results
Using Intel Xeon E5-2687W v3 server, 16 cores
Counting a few motifs
+
Storing a few motif counts
+
Obtain other counts using
combinatorial relationships
(1) Fast and parallel
(2) Space-efficient
Two Conclusions
Joint Motif Counting
for Large Graphs
✓ (1) Motif Counting
Theory and Algorithms for Large Graphs
(2) Machine Learning Applications
for Motif Counting
§ Biological Networks
• network alignment, protein function prediction
[Pržulj 2007][Milenković-Pržulj 2008] [Hulovatyy-Solava-Milenković 2014]
[Shervashidze et al. 2009][Vishwanathan et al. 2010]
§ Social Networks
• Triad analysis, role discovery, community detection
[Granovetter 1983][Holland-Leinhardt 1976][Rossi-Ahmed 2015]
[Ahmed et al. 2015][Xie-Kelley-Szymanski 2013]
§ Internet AS [Feldman et al. 2008]
§ Spam Detection
[Becchetti et al. 2008][Ahmed et al. 2016]
-
-
-
-
-
Useful for various machine learning tasks
e.g., Anomaly detection, Role Discovery, Relational Learning, Clustering etc.
Higher-order network structures
§ Visualization – “spotting anomalies” [Ahmed et al. ICDM 2015]
§ Finding large cliques, stars, and other larger network
structures [Ahmed et al. KAIS 2015]
§ Spectral clustering [Benson et al. Science 2016]
§ Role discovery [Ahmed et al. 2016]
...
Ranking by graphlet counts
Nodes are colored/weighted
by triangle counts
Links are colored/weighted
by stars of size 4 nodes
Leukemia
Colon
cancer
Deafness
Label 1
Label 0
Enzyme
Non-Enzyme
Collection of Graphs
(e.g. Protein Graphs)
.
.
.
Graphs
Each Protein is represented by a graph
Binary label represents the function of the protein
§ D&D – 1178 protein graphs. Binary labeled as Enzymes vs.
Non. Enzymes
§ MUTAG – 188 mutagenic compounds. Binary labeled
(whether or not they have a mutagenic effect on the Gram-
negative bacterium)
§ 10-fold validation, Support Vector Machine
§ Used 3,4 node motif counts as features
§ Goal: Learn representation (features) for a set of graph
elements (nodes, edges, etc.)
§ Key intuition: Map the graph elements (e.g., nodes) to the
d-dimension space, while preserving structural similarity
§ Use the features for any downstream prediction task
Given a graph G=(V,E) and a set of T network motifs,
We form the weighted motif adjacency matrices
# of instances of motif Ht that contain nodes i and j
To generalize HONE for any motif-based matrix formulation, we
can define a function
Motif Transition Matrix
Laplacian Matrix
Normalized Laplacian Matrix
Motif Matrix
Formulations
The number of paths weighted by
motif counts from node i to node j in k-
steps is
The probability of transitioning from
node i to node j in k-steps is given by
We derive all k-step motif-based matrices for all T motifs and K steps
,""""for"k=1,…,K""and""t=1,…,T
K-step Motif Matrix
Local Embeddings
Concatenate all k-step node embedding for all T motifs and K steps
and find a ”global” higher-order node embeddings by solving
H
D"×"TKD&
Z
N"×"D
Each row of Z is a D-dimensional embedding of a node
Global Embeddings
Experimental setup
§ 10-fold cross-validation, repeated for 10 random trials
§ Used all 2-4 node connected orbits
§ D=128, Dl = 16 for the local motif embeddings
§ Edge embedding derived via mean function
§ Predict link existence via logistic regression (LR)
§ # steps K selected via grid search over
Main findings:
§ Mean Gain in AUC of 19.24% (& up
to 75.21%)
✓ (1) Motif Counting
Theory and Algorithms for Large Graphs
✓ (2) Machine Learning Applications
for Motif Counting
§ Framework & Algorithms
• One of the first parallel approaches for graphlet counting
• On average 460x faster than current methods
• Edge-centric computations (only requires access to edge
neighborhood)
• Time and space-efficient
• Sampling/estimation methods
• Local/global counting
§ Applications
• Large-scale graph comparison, classification, and anomaly detection
• Visual analytics and real-time graphlet mining
• Higher-order network embeddings
§ Efficient Graphlet Counting for Large Networks. ICDM 2015, [Ahmed et al.]
§ Graphlet Decomposition: Framework, Algorithms, and Applications. J. Know. & Info. 2016 [Ahmed et al.]
§ Estimation of Graphlet Counts in Massive Networks. IEEE TNNLS 2018 [Rossi-Zhou-Ahmed]
§ Higher-order Network Representation Learning. WWW 2018 [Rossi-Ahmed-Koh]
§ Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, [Milo et al.]
§ Uncovering Biological Network Function via Graphlet Degree Signatures. Cancer Informatics 2008 [Milenković-Pržulj]
§ Graph Kernels. JMLR 2010, [Vishwanathan et al.]
§ Graph Sample and Hold: A Framework for Big Graph Analytics. KDD 2014 [Ahmed-Duffield-Neville-Kompella]
§ Role Discovery in Networks. IEEE TKDE 2015 [Rossi-Ahmed]
§ The Structure and Function of Complex Networks. SIAM Review 2003, [Newman]
§ Biological network comparison using graphlet degree distribution. Bioinformatics 2007 [Pržulj]
§ Efficient Graphlet Kernels for Large Graph Comparison. AISTAT 2009 [Shervashidze et al.]
§ Revealing Missing Parts of the Interactome via Link Prediction. PLoS ONE 2014 [Hulovatyy-Solava-Milenković]
§ Graft: An Efficient Graphlet Counting Method for Large Graph Analysis. TKDE 2014 [Rahman-Bhuiyan-Hasan]
§ Graph-Based Anomaly Detection, KDD 2003 [Noble-Cook]
§ Local structure in social networks. Sociological methodology 1976, [Holland-Leinhardt]
§ The strength of weak ties: A network theory revisited. Sociological theory 1983 [Granovetter]
§ Overlapping community detection in networks. ACM Computing Surveys 2013 [Xie-Kelley-Szymanski]
§ Automatic large scale generation of internet pop level maps. GLOBECOM 2008 [Feldman et al.]
§ Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD 2008 [Becchetti et al.]