VJAI Paper Reading#3-KDD2019-ClusterGCN

CLUSTER-GCN: AN EFFICIENT ALGORITHM FOR TRAININGCLUSTER-GCN: AN EFFICIENT ALGORITHM FOR TRAINING
DEEP ANDDEEP AND
LARGE GRAPH CONVOLUTIONAL NETWORKSLARGE GRAPH CONVOLUTIONAL NETWORKS
KDD'2019
VJAI Paper Reading festival #3
2019/8/18
Presented by: Dat Nguyen
1

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion 2

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
3

Consider a Neural Network with L layers (no bias)
Number of neural at the layer :
Features at the layer :
Parameters at the layer :
Feature vectors are transformed layer by layer
where is an activation function.
Training step: updating parameters to build an appropriate
transformation
The network transform each input feature vector into an output
feature vector independently from the others
Question: What if we build a network to
transform multiple feature vectors at once
VANILLA NEURAL NETWORK RECAPVANILLA NEURAL NETWORK RECAP
l
th
Fl
l
th
∈X
(l)
ℝ
1×Fl
l
th
∈W
(l)
ℝ
×Fl Fl+1
X
(l)
=Z
(l+1)
X
(l)
W
(l)
= σ( )X
(l+1)
Z
(l+1)
σ
W
(l)
4

THE IDEA OF GRAPH CONVOLUTIONAL NETWORKS (GCN)THE IDEA OF GRAPH CONVOLUTIONAL NETWORKS (GCN)
First, change the feature vector into a list of N feature vectors (a feature matrix)
So, at each layer , we N nodes, each node is represented by a feature vector [i]
One more idea: Take the relationship of nodes into account
In graph, one way to represent the relationship of nodes is to use adjacency matrix
The transformation at layer become
One way to de ne :
So, we have:
Intuition: accumulate features from the neighbors before applying the transformation
In fact, A can be normalized in some way before using, e.g., augment self feature, divided by degree matrix...
∈X
(l)
ℝ
1×Fl
∈X
(l)
ℝ
N×Fl
l
th
i
th
X
(l)
A ∈ ℝ
N×N
l
th
= f (A, , )Z
(l+1)
X
(l)
W
(l)
f
f (A, , ) = AX
(l)
W
(l)
X
(l)
W
(l)
= A , = σ( )Z
(l+1)
X
(l)
W
(l)
X
(l+1)
Z
(l+1)
W
(l)
5

CONNECTION BETWEEN GCN AND CONVOLUTIONAL NETWORKCONNECTION BETWEEN GCN AND CONVOLUTIONAL NETWORK
Convolution on 2D-image (left)
Convolution operator takes weighted sum of neighbor values
Fixed neighbors, decided by the kernel size
Graph convolution with GCN (right)
Feature of a node is accumulated neighbor features
Variable neighbors, decided by the graph
Credit: A Comprehensive Survey on Graph NeuralNetworks
6

OTHER VARIANTS OF GRAPH-BASED NEURAL NETWORKSOTHER VARIANTS OF GRAPH-BASED NEURAL NETWORKS
GCN is only 1 variant of Graph Neural Networks (GNN).
Other variants of GNN:
Recurrent Graph Neural Networks
Graph Autoencoders
Graph Attention Network (GAT)
7

APPLICATION OF GNNAPPLICATION OF GNN
GNN can be applied to many graph-based applications:
Recommender systems
Social network analysis
Chemistry
Traf c
Computer Vision
NLP...
8

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
9

INTRODUCTIONINTRODUCTION
Computational cost of current SGD-based algorithms exponentially grows with
number of layers.
Large space requirement for keeping the entire graph and the embedding of each
node in memory.
Propose Cluster-GCN [1] that exploits the graph clustering structure:
Samples a block of nodes that associate with a dense subgraph identi edby a graph clustering
algorithm
Restricts the neighborhood search within this subgraph
Cluster-GCN signi cantly improved memory and computational ef ciency, which
allows us to train much deeper GCN without much time and memory overhead.
5-layer Cluster-GCN achieves state-of-the-art test F1 score 99.36 on the PPI
dataset (the previous best result was 98.71)
[ ], [ ]Paper Code
10

DEFINITIONDEFINITION
Given graph , number of vertices , number of edges
Adjacency matrix , where entry is 1 if there is an edge between i and j, 0 otherwise.
Feature matrix of N nodes:
-layer GCN is de ned by:
where , ( ), A' is the normalized and regularized matrix of A
Feature transformation matrix
For simplicity, assume that all layers have the same feature dim:
In semi-supervised node classi cation problem, learn weight matrices by minimizing the loss function:
In practice, a cross-entropy loss is commonly used for node classi cation in multi-class or multi-label problems
G = (, , A) N = || ||
A ∈ ℝ
N×N
(i, j)
X ∈ ℝ
N×F
L
= , = σ( )Z
(l+1)
A
′
X
(l)
W
(l)
X
(l+1)
Z
(l+1)
∈X
(l)
ℝ
N×Fl
= XX
(0)
∈W
(l)
ℝ
×Fl Fl+1
= ⋯ = = FF1 FL
 = loss( , )
1
| |t
∑
i∈t
yi z
(L)
i
11

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
12

FULL-BATCH GRADIENT DESCENT (FROM [2])FULL-BATCH GRADIENT DESCENT (FROM [2])
Store all the embedding matrices
→ Memory problem
Update the model once per epoch
→ require more epochs to converge
{Z
(l)
}
L
l=1
13

MINI-BATCH SGD (FROM [3])MINI-BATCH SGD (FROM [3])
Update model for each batch of nodes
Signi cant computational overhead due to neighborhood
expansion problem
Converge faster in terms of epochs but much slower per-epoch
training time
Embedding utilization:
If the node ’s embedding at -th layer is computed and is reused times
for the embedding computations at layer ,
then we say the embedding utilization of is .
Embedding utilization u is small because graph is usually large and
sparse
i l z
(l)
i
u
l + 1
z
(l)
i
u
14

VR-GCN (FROM [4]) (SOTA)VR-GCN (FROM [4]) (SOTA)
Reduce the size of neighborhood sampling nodes
Requires storing all the intermediate embeddings
15

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
16

KEY IDEAKEY IDEA
Design a batch and the corresponding computation subgraph
to maximize the embedding utilization.
Set of node from layer 1 to , subgraph (links within )
Embedding utilization is the number of edges within this batch
Maximize the embedding utilization by maximizing within-batch
edges
Ef ciency of SGD updates now relates to graph clustering
algorithm
 L A, 
∥∥A,
∥∥0
17

GRAPH PARTITIONINGGRAPH PARTITIONING
Partition the graph into c groups: , where consists of the nodes in the -th partition.
only consists of the links between nodes in .
Adjacencty matrix A is partitioned into submatrices as:
Also partition feature X and training labels Y according to as and
G  = [ , · · · ]1 c t t
= [ , · · ·, ] = [{ , }, · · ·, {V c, }]G¯ G1 Gc 1 1 c
t t
c
2
A = + Δ =A¯
⎡
⎣
⎢
⎢
⎢
A11
⋮
Ac1
⋯
⋱
⋯
A1c
⋮
Acc
⎤
⎦
⎥
⎥
⎥
where = , Δ =A¯
⎡
⎣
⎢
⎢
⎢
A11
⋮
0
⋯
⋱
⋯
0
⋮
Acc
⎤
⎦
⎥
⎥
⎥
⎡
⎣
⎢
⎢
⎢
0
⋮
Ac1
⋯
⋱
⋯
A1c
⋮
0
⎤
⎦
⎥
⎥
⎥
[ , · · · ]1 c [ , · · ·, ]X1 Xc [ , · · ·, ]Y1 Yc
18

UPDATE CLUSTER-GCNUPDATE CLUSTER-GCN
The nal embedding matrix becomes
The loss function can also be decomposed into
and
At each step, sample a cluster and update based on the gradient of
= ′σ( ′σ(⋯ σ( ′X ) ) ⋯)Z
(L)
A¯ A¯ A¯ W
(0)
W
(1)
W
(L−1)
=
⎡
⎣
⎢
⎢
⎢
σ( σ(⋯ σ( ) ) ⋯)A¯ ′11 A¯ ′11 A¯ ′11 X1 W
(0)
W
(1)
W
(L−1)
⋮
σ( σ(⋯ σ( ) ) ⋯)A¯ ′cc A¯ ′cc A¯ ′cc X1 W
(0)
W
(1)
W
(L−1)
⎤
⎦
⎥
⎥
⎥
= ′A¯ ∑
t
| |t
N

A¯ ′tt
= loss( , )
A¯ ′tt
1
| |t
∑
i∈t
yi z
(L)
i
t W
(l)
L
l=1
A¯ ′tt
19

EFFICIENCY OF CLUSTER-GCNEFFICIENCY OF CLUSTER-GCN
Figure 1: The neighborhood expansion difference between
traditional graph convolution and our proposed cluster approach
Table 1: Time and space complexity of GCN training algorithms
20

STOCHASTIC MULTIPLE PARTITIONS (1)STOCHASTIC MULTIPLE PARTITIONS (1)
Although Cluster-GCN achieves good computational and memory complexity, there
are still 2 problems:
Some links (the part) are removed
Graph clustering algorithm (such as Metis and Graclus) tend to bring similar nodes together → biased
estimation of the full gradient
Figure 2: Histograms of label entropy within each batch using random partition vs clustering partition. Most clustering partitioned batches have
low label entropy, while random partition gives larger label entropy although it is less ef cient. (partitioned on Reddit dataset with 300 clusters)
Clusters are biased towards some speci c labels, then increase the variance across
different batches.
Δ
21

Figure 3: The proposed stochastic multiple partitions
scheme. Same color blocks are in the same batch
Figure 4: Comparisons of choosing one cluster (300
partitions) vs multiple clusters (1500 partitions & q=5).
(x-axis: epoch, y-axis: F1 score)
STOCHASTIC MULTIPLE PARTITIONS (2)STOCHASTIC MULTIPLE PARTITIONS (2)
To build a batch , randomly choose q clusters ( )
Nodes of the batch are
Also include between-cluster links in the batch to reduce variance across batches
 , , . . . ,t1 t2 tq
{ ∪ · · · ∪ }t1
tq
{ |i, j ∈ , . . . , }Aij t1 tq
22

TRAINING DEEPER GCNSTRAINING DEEPER GCNS
Deeper model impede the information from the rst few layers
being passed through.
Propose
and apply diagonal enhancement technique
= (D + I (A + I)Ã )
−1
= σ(( + λdiag( )) )X
l+1
Ã Ã X
(l)
W
(l)
23

CLUSTER-GCN ALGORITHMCLUSTER-GCN ALGORITHM
24

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
25

Table 3: Data statistics Table 4: The parameters used in the experiments
EXPERIMENTSEXPERIMENTS
Evaluate on multi-label and multi-class class ciation on 4 public datasets
Compare with 2 SoTA methods
VR-GCN (from [4]): maintains historical embedding & expands to only a few neighbors
GraphSAGE (from [5]): samples a xed number of neighbors per node.
Cluster-GCN:
Implemented with PyTorch (Google Research, why PyTorch?)
Adam optimizer, learning rate = 0.1, dropout rate=20%, zero weight decay
Number of partitions and clusters per batch are stated in Table 4.
All the experiments are run on 1 machine with NVIDIA Tesla V100 GPU (16GB
mem), 20-core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM.
26

Figure 6: Comparisons of different GCN training methods. We present the relation
between training time in seconds (x-axis)and the validation F1 score (y-axis)
Table 6: Benchmarking on the Sparse Tensor operations in
PyTorch and TensorFlow. A network with two linear layers
is used and the timing includes forward and backward
operations. Numbers in the brackets indicate the size of
hidden units in the rst layer. Amazon data is used.
RESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACYRESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACY
Cluster-GCN is the fastest for both PPI and Reddit datasets
For Amazon data, Cluster-GCN is faster than VRGCN for 3-layer case, but slower for 2-layer and 4-layer cases.
Defense: Table 6 (VRGCN was implemented with TensorFlow)
27

RESULTS ON MEDIAN SIZE DATASETS - MEMORY USAGERESULTS ON MEDIAN SIZE DATASETS - MEMORY USAGE
VRGCN needs to save historical embeddings during training → consume more
memory than Cluster-GCN
GraphSAGE also has higher memory requirement than Cluster-GCN due to the
exponential neighborhood growing
When increasing the number of layers, Cluster-GCN’s memory usage does not
increase a lot (extra variable introduced is the weight matrix )
Table 5: Comparisons of memory usages on different datasets.
Numbers in the brackets indicate the size of hidden units used in the model
W
(L)
28

AMAZON-2MAMAZON-2M
By far the largest public data for testing GCN is Reddit (232,965 nodes, 11,606,919
edges)
Build a much larger dataset, Amazon2M, to test the scalability of Cluster-GCN:
2 millions nodes, 61 millions edges
Raw co-purchase data from Amazon-3M
Node: product; link: whether two products are purchased together.
Node feature: bag-of-word features from product descriptions, reduced to 100-dim by PCA.
Use top-level categories as the label for product/node (Table 7)
Table 7: The most common categories in Amazon2M
29

Figure 6 Table 8: Comparisons of running time, memory and
testing accuracy (F1 score) for Amazon2M
RESULTS ON AMAZON-2MRESULTS ON AMAZON-2M
VRGCN is faster than Cluster-GCN with 2-layer GCN but slower when increasing
one layer while achieving similar accuracy.
VRGCN is using much more memory than Cluster-GCN (5 times more for 3-layer
case), and it is running out of memory when training 4-layer GCN
Cluster-GCN
Does not need much additional memory when increasing the number of layers
Achieves the best accuracy with 4-layer GCN.
30

Table 9: Comparisons of running time when
using different # of layers on PPI, 200 epochs
Table 11: Comparisons of using different diagonal enhancement
techniques on PPI. Red numbers indicate poor convergence.
TRAINING ON DEEPER GCN (2)TRAINING ON DEEPER GCN (2)
Test GCNs with more layers
Running time of VRGCN grows exponentially, while the running time of Cluster-
GCN only grows linearly (Table 9).
Evaluate diagonal enhancement techniques with PPI dataset (Table 11)
Case of 2 to 5 layers, the more layers the higher accuracy → deeper GCNs may be useful.
When 7 or 8 layers are used, the rst three methods fail to converge within 200 epochs and get a
dramatic loss of accuracy
31

TRAINING ON DEEPER GCN WITH DIAGONAL ENHANCEMENTTRAINING ON DEEPER GCN WITH DIAGONAL ENHANCEMENT
Detailed convergence of a 8-layer GCN in Figure 5.
All methods except for the one using diagonal enhancement fail to converge.
Figure 5: Convergence gure on a 8-layer GCN.
(x-axis: # of epochs; y-axis: validation accuracy)
32

STATE-OF-THE-ART RESULTSTATE-OF-THE-ART RESULT
For PPI, Cluster-GCN can achieve the state-of-art result by training a 5-layer GCN
with 2048 hidden units.
For Reddit, a 4-layer GCN with 128 hidden units.
Table 10: State-of-the-art performance of testing accuracy reported in recent papers
33

OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
4. Cluster-GCN
Vanilla Cluster-GCN
6. Conclusion
34

CONCLUSIONCONCLUSION
Cluster-GCN is fast and memory ef cient.
The method can train very deep GCN on large-scale graph
With 2 million nodes, the training time is less than an hour
Uses around 2G memory
Achieves accuracy of 90.41 (F1 score)
Successfully train much deeper GCNs, which achieve state-of-the-
art test F1 score on PPI and Reddit datasets.
35

REFERENCESREFERENCES
[1] Wei-Lin Chiang et. al., KDD 2019. Cluster-GCN: An Ef cient Algorithm for Training Deep and Large
Graph Convolutional Networks
[2] Thomas N. Kipf and Max Welling. ICLR 2017. Semi-Supervised Classi cation with Graph Convolutional
Networks.
[3] William L. Hamilton, Rex Ying, and Jure Leskovec. NIPS 2017. Inductive Representation Learning on
Large Graphs.
[4] Jianfei Chen, Jun Zhu, and Song Le. ICML 2018. Stochastic Training of Graph Convolutional Networks
with Variance Reduction.
36

VJAI Paper Reading#3-KDD2019-ClusterGCN

More Related Content

What's hot

Similar to VJAI Paper Reading#3-KDD2019-ClusterGCN

Recently uploaded

VJAI Paper Reading#3-KDD2019-ClusterGCN