JOSA TechTalks - Machine Learning on Graph-Structured Data

Machine Learning on
Graph-Structured Data
Sami Abu-El-Haija to JOSA
<firstName>@haija.org

Agenda
● Background & Motivation
● [Breadth] ML Models on Graphs
● [Depth] Recent ML Models on Graphs
○ MixHop (ICML’19)
○ Watch Your Step (NeurIPS’18)
● Fast Training
○ GTTF (ICLR’21)
○ Fast GRL with unique optimal solutions (ICLR’21 Workshop GTRL)
2

Motivation
Why Graphs?
● Graphs are a natural way to encode data from various domains.
4

What is a graph?
5
Nodes: entities
Edges: relationships between entities

What is a graph?
6
Nodes: entities
Edges: relationships between entities
x
x x
x x x
x
x
x
x
y
y: labels
y
y
x
x
x: features

What is a graph?
7
nodes = people
edges = friendship
Social Network
y= engaging ads
x=[age, gender, .]

What is a graph?
8
News Articles
nodes = articles
edges = citations
y=article type
x=article text

What is a graph?
Chemical compounds can be viewed as a graph:
9
y=molecule properties
(per graph)
x=[H, F, C, O,
N,.]

Why ML on Graphs?
Motivation
10
Across domains, practitioners benefit from predictions on graphs.
Some Popular Tasks:
● Predict node labels (node classification)
○ E.g., predict users engagement to ads (in a social network).
● Predict missing edges (link prediction / edge classification)
○ E.g., predict which proteins interact with each other.
● Classify an entire graph
○ E.g., predict physical properties of a chemical molecule represented as a graph
● Generate Graphs [e.g. with certain properties]
○ E.g., can answer “Give me a chemical molecule with the following properties”

High Level of Various Graph Algorithms
Fine! You have a graph.
You want to predict information on the graph. How to proceed?
Next: Identify the modeling technique!
● Option (Graph Embeddings): Place nodes onto an embedding space → throw the
graph away but keep embeddings.
● Option (Graph Regularization): Use graph as a regularizer. No graph is needed
after model training.
● Option (Graph Convolution ⊂ Message Passing). Representation of a node is a
function of its neighbors. Graph is needed for training and inference.

[Math Refresher]
Graph Matrices
12

(undirected) Graph
Adjacency Matrix
Degree Matrix
Feature Matrix
Transition Matrix
Laplacian Matrix
Quiz: What does TX encode?
L gives relaxed estimates to NP-Hard Problems e.g. Graph Partitioning.
Its eigenbasis provide an a continuous axes on which nodes live

15
● Option (Graph Embeddings)
● Option (Graph Regularization)
● Option (Graph Convolution ⊂ Message Passing)

Overview: Graph Embedding
v1 v2
v3 v5
v4
v6
v11
v9
.v1
.v11 .v6
.v2
.v3
.v4
.v5
.v9
Embed in Rd
Factorize A or L [1]
Auto Encode A [2]
Skipgram on E[walk] [3, B, D]
[1] Belkin & Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Computation 2013
[2] Wang et al, Structural Deep Network Embedding, KDD’2016
[3] Perozzi et al, DeepWalk, KDD’2014
[B] Abu-El-Haija et al, Watch Your Step: Learning Node Embeddings via Graph Attention, NeurIPS’2018
[D] Abu-El-Haija et al, Learning Edge Representations via Low-Rank Asymmetric Projections, CIKM’2017
[E] Lee, Abu-El-Haija, Varadarajan, Natsev, Collaborative Deep Metric Learning for Video Understanding, KDD’2018
16

Overview: Graph Embedding
v1 v2
v3 v5
v4
v6
v11
v9
.v1
.v11 .v6
.v2
.v3
.v4
.v5
.v9
Embed in Rd
Factorize A or L [1]
Auto Encode A [2]
Skipgram on E[walk] [3, B, D]
[1] Belkin & Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Computation 2013
17
Random Walk
v3 → v5 → v9 → v11 → v5 →
...
…
...
Random Walk Sequences
word2vec algorithm

Review: Embedding via Random Walks
● Word2vec learns word embeddings by stochastically moving
embedding of an anchor node closer to a neighboring context
node.
v3 → v5 → v9 → v11 → v5 → ...
Random Walk Sequences Embeddings Y
18
.v1
.v11 .v6
.v2
.v3
.v4
.v5
x
y
.v9

node.
Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013
v3 → v5 → v9 → v11 → v5 → ...
19
anchor
node
.v1
.v11 .v6
.v2
.v3
.v4
.v5
.v9
x
y

node.
v3 → v5 → v9 → v11 → v5 → ...
20
anchor
node
context
node
.v1
.v11 .v6
.v2
.v3
.v4
.v5
.v9
x
y

node.
v3 → v5 → v9 → v11 → v5 → ...
21
anchor
node
context
node
.v1
.v11 .v6
.v2
.v3
.v4
.v5
.v9
x
y
Stochastic
Update

22

Overview: Graph Regularization
v1 v2
v3 v5
v4
v6
v11
v9
x11
x6
fΘ : X → Y
h11
h6
2
l2
minΘ λ - y6 log h6 - y11 log h11
23
[4] Belkin et al, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR’2006.
[5] Bui et al, Neural Graph Machines, Arxiv’17

v1 v2
v3 v5
v4
v6
v11
v9
x11
x6
fΘ : X → Y
fΘ(x6)
2
l2
minΘ - y6 log fΘ(x6) - y11 log fΘ(x11)
fΘ(x11
)
λ
24

v1 v2
v3 v5
v4
v6
v11
v9
x11
x6
fΘ : X → Y
fΘ(x6)
2
l2
minΘ - y6 log fΘ(x6) - y11 log fΘ(x11)
fΘ(x11
)
λ
f(xi)
2
l2
minΘ Ai, j - yi log f(xi) - yj log f(xj)
f(xj)
Σi, j λ
Overall Objective:
25

26

Overview: Message Passing
The first neural network on graph data (I am aware of)
[6] Scarselli et al. The graph neural network model. IEEE Transactions on Neural Networks’2009

[Depth]
ML on Graphs - Models for:
Node Embedding
& Message Passing
28

Node Embedding: Watch
Your Step (NeurIPS’18)
29

Watch Your Step (Node Embedding Method)
Watch Your Step learns the context distribution (while
learning the embeddings):
Shortcoming of DeepWalk (/ node2vec): they have a fixed Context Distribution
controlled by hyperparameter (C) context window size. Graphs prefer different C:

WatchYourStep (WYS): Derivation
Rather than factorizing:
31
or:
Into low-rank L x RT, with objective:
WYS Factorizes:
Additionally training Q “the context distribution”

WYS Results: Link Prediction
32

WYS Results: Node Classification & T-SNE plot
33

WYS Experiments: What does Q learn?
Different distribution for different graph!
34
Correspond to manual sweeping of node2vec:

Graph Convolution (popular
form of Message Passing)
35

We review Image Convolution then Graph Convolution

Recall: Image Convolution
● State-of-the-art on image / video / speech.
○ (segmentation, detection, classification, etc).
input
2D (Spatial) Convolutional Layer: Representing image as a regular grid
4D trainable filter
output
vectors
Message Passing
*

[H] Chami, Abu-El-Haija, Perozzi, Re, Murphy, Machine Learning on Graphs: A Model and Comprehensive Taxonomy, arxiv’2020
[7] Kipf & Welling, Semi-supervised classification with graph convolutional networks, ICLR’2017
● There are multiple definitions we survey in [H]
● For now, we stick to the most popular [7] (=[61] above)
What is Graph Convolutions

GCN [7] for semi-supervised node classification

x1
x3
x5
x6
x4
x2

x1
x3
x5
x6
Input Features
x4
x2

x1
x3
x5
x6
Input Features
x4
x2
y2
y4
Some nodes are labeled
Task: Can we guess label of
unlabeled nodes?

GC Layer 1
x1
x3
x5
x6
Input Features
x4
x2

h1
(1)
h6
(1
)
Latent Features
h2
(1)
h3
(1)
h4
(1)
h5
(1)
GC Layer 1
x1
x3
x5
x6
Input Features
x4
x2

…
x1
x3
x5
x6
Input Features
x4
x2
h1
(1)
h6
(1
)
Latent Features
h2
(1)
h3
(1)
h4
(1)
h5
(1)
GC Layer 1

GC Layer L
h1
(L)
h6
(L
)
Output Features
h2
(L)
h3
(L)
h4
(L)
h5
(L)
x1
x3
x5
x6
Input Features
x4
x2
…
h1
(1)
h6
(1
)
Latent Features
h2
(1)
h3
(1)
h4
(1)
h5
(1)
GC Layer 1

h1
(L)
h6
(L
)
Output Features
h2
(L)
h3
(L)
h4
(L)
h5
(L)
y2
y4
● Classification Loss is
measured on labeled
nodes (e.g. y4, y2)
● GC layers optimized to
reduce loss
Loss
Loss

GC Layer L
h1
(L)
h6
(L
)
Output Features
h2
(L)
h3
(L)
h4
(L)
h5
(L)
y2
Loss
y4
Loss
SGD
…
h1
(1)
h6
(1
)
Latent Features
h2
(1)
h3
(1)
h4
(1)
h5
(1)
x1
x3
x5
x6
Input Features
x4
x2
GC Layer 1

Graph Convolutional Networks (GCN) [7]
51
NxN Normalized Adjacency Matrix
(Sparse)
Nxdl feature matrix, one row per node.
with H(0) = X
dlxdl+1 Trainable “filter”.
Location invariant.
Dimension independent of N

Shortcoming of Vanilla GCN
Vanilla GC Layer
53

😢 Appendix Experiments of [7]
shows no gains beyond 2 layers
Vanilla GC Layer
54

😢 Appendix Experiments of [7]
shows no gains beyond 2 layers
😢 cannot arbitrary mix neighbors
from various distances
i.e. cannot learn Gabor
Filters!
Vanilla GC Layer
55

MixHop
[C] Abu-El-Haija et al, MixHop, ICML 2019 56

Extend the class of representations realizable by GCNs
MixHop: Motivation
57
[C] Abu-El-Haija et al, MixHop, ICML 2019

h1
(1)
h6
(1
)
h2
(1)
h3
(1)
h4
(1)
h5
(1)
MixHop Layer 1
x1
x3
x5
x6
x4
x2
MixHop: High Order Graph Conv Layer [C]
58

MixHop
Vanilla GC Layer
MixHop GC Layer
Couple of code lines
implements concatenation
59

MixHop
MixHop GC Layer
60

MixHop
MixHop GC Layer
😀 Can incorporate distant nodes
61

MixHop
MixHop GC Layer
😀 Can incorporate distant nodes
😀 Can mix neighbors across distances
i.e. can learn Gabor Filters!
62

[G] Markowitz* et al, Graph traversal with tensor functionals: a meta-algorithm for scalable learning, ICLR’2021

Goal of GTTF
● Take any Graph Learning Algorithm.
● Re-write it using “GTTF” functions (AccumulateFn and BiasFn)
● This makes the algorithm scalable to arbitrarily large graphs!

GTTF

Graph Convolution on top of GTTF
Define the GTTF functions:
Run model on sampled (rooted) Adjacency:

Node Embeddings on top of GTTF
Define the accumulation function (No Bias Fn)

GTTF Approximates Learning of underlying algorithms

Algorithms on top of GTTF are scalable

GTTF: Scale Performance Experiments [G]
76

GTTF: Test Metrics Experiments

[J] Abu-El-Haija et al, Fast Graph Learning with Unique Optimal Solutions, ICLR’21 GTRL

What is SVD?
[J] Abu-El-Haija et al, Fast Graph Learning with Unique Optimal
Solutions, arxiv 2021

We open-source a Functional SVD for TensorFlow
https://github.com/samihaija/tf-fsvd. Useful if:
● You want to run SVD on a sparse matrix in TensorFlow (our code, out of the
box, provides a specialization of tf.linalg.svd onto sparse matrices)
● You want to run SVD on a dense matrix M (that is expensive to compute).
However, your matrix M is structured (e.g. geometric sum of sparse matrices),
such that, multiplying M by vectors is much cheaper than explicitly
constructing M.

SVD for Graph Learning
● SVD can be used as ML technique for graphs
○ Steps:
■ Linearize models.
■ Make objective function convex.
● We show this next, for two popular techniques

Classification via Graph Neural Networks (GNN)
Optimized as:
GNN models:
model:
Traditionally
Our Approximation
Optimized as:
Solution:

Link Prediction via Network Embedding
Optimized as:
Skipgram models:
model:
Traditionally
Our Approximation Optimized with SVD:
Solution:
Recall SVD:

More results: Classification (left) Link Prediction (right)

[A] Abu-El-Haija et al, YouTube-8M: A Large-Scale Video Classification Benchmark, Arxiv’2016
[C] Abu-El-Haija,…, Ver Steeg, Aram Galstyan, MixHop: Higher-Order Graph Convolution, ICML’2019.
[F] Ge, Abu-El-Haija, Xin, Itti, Zero-shot Synthesis with Group-Supervised Learning, ICLR’2021
[H] Chami, Abu-El-Haija, Perozzi, Re, Murphy, Machine Learning on Graphs: A Model and Comprehensive Taxonomy, arxiv’2020
[I] Abu-El-Haija et al, N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification, UAI’2019
[J] Abu-El-Haija et al, Fast Graph Learning with Unique Optimal Solutions, arxiv 2021
[1] Belkin & Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Computation’2013
[6] Scarselli et al, The graph neural network model, IEEE Transactions on Neural Networks’2009
[8] Daugman, Two-dimensional spectral analysis of cortical receptive field profiles, Vision Research’1980
[9] Daugman, Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by ..., JOSA’1985
[10] Honglak Lee et al, ICML’2009
[11] Alex Krizhevsky et al, NeurIPS’2012
[12] Gordon et al, MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks, CVPR’2018
References

Optional Material
(time permitting)
88

We add group L2-Lasso
Regularization to drop-out columns
feature matrices, similar to [12]
[images are rotated for space]
89
MixHop Sparsification

MixHop Sparsification
We add group L2-Lasso
Regularization to drop-out columns
feature matrices, similar to [12]
2nd layer of Cora drops-out zeroth-
power completely.
[images are rotated for space]
90

MixHop Results on Citation Datasets

MixHop Results on (Synthetic) Homophily Datasets
With less homophily, our
performance gap increases

MixHop Results on (Synthetic) Homophily Datasets
With less homophily, our
performance gap increases
With less homophily, our method
learns more feature differences
(i.e. Gabor-like Filters)

[F] Ge, Abu-El-Haija, Xin, Itti, Zero-shot Synthesis with Group-Supervised Learning, ICLR’2021
Ad: Message Passing for Zero-Shot Synthesis
Graph of semantic similarity
between training samples
We can develop an auto-enocder with a
disentangled feature space.
If two samples share one attribute value (per
graph edge), they need to prove it:
96

JOSA TechTalks - Machine Learning on Graph-Structured Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to JOSA TechTalks - Machine Learning on Graph-Structured Data

Similar to JOSA TechTalks - Machine Learning on Graph-Structured Data (20)

More from Jordan Open Source Association

More from Jordan Open Source Association (20)

Recently uploaded

Recently uploaded (20)

JOSA TechTalks - Machine Learning on Graph-Structured Data

Editor's Notes