CLUSTER-GCN: AN EFFICIENT ALGORITHM FOR TRAININGCLUSTER-GCN: AN EFFICIENT ALGORITHM FOR TRAINING
DEEP ANDDEEP AND
LARGE GRAPH CONVOLUTIONAL NETWORKSLARGE GRAPH CONVOLUTIONAL NETWORKS
KDD'2019
VJAI Paper Reading festival #3
2019/8/18
Presented by: Dat Nguyen
1
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion 2
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
3
Consider a Neural Network with L layers (no bias)
Number of neural at the layer :
Features at the layer :
Parameters at the layer :
Feature vectors are transformed layer by layer
where is an activation function.
Training step: updating parameters to build an appropriate
transformation
The network transform each input feature vector into an output
feature vector independently from the others
Question: What if we build a network to
transform multiple feature vectors at once
VANILLA NEURAL NETWORK RECAPVANILLA NEURAL NETWORK RECAP
l
th
Fl
l
th
∈X
(l)
ℝ
1×Fl
l
th
∈W
(l)
ℝ
×Fl Fl+1
X
(l)
=Z
(l+1)
X
(l)
W
(l)
= σ( )X
(l+1)
Z
(l+1)
σ
W
(l)
4
THE IDEA OF GRAPH CONVOLUTIONAL NETWORKS (GCN)THE IDEA OF GRAPH CONVOLUTIONAL NETWORKS (GCN)
First, change the feature vector into a list of N feature vectors (a feature matrix)
So, at each layer , we N nodes, each node is represented by a feature vector [i]
One more idea: Take the relationship of nodes into account
In graph, one way to represent the relationship of nodes is to use adjacency matrix
The transformation at layer become
One way to de ne :
So, we have:
Intuition: accumulate features from the neighbors before applying the transformation
In fact, A can be normalized in some way before using, e.g., augment self feature, divided by degree matrix...
∈X
(l)
ℝ
1×Fl
∈X
(l)
ℝ
N×Fl
l
th
i
th
X
(l)
A ∈ ℝ
N×N
l
th
= f (A, , )Z
(l+1)
X
(l)
W
(l)
f
f (A, , ) = AX
(l)
W
(l)
X
(l)
W
(l)
= A , = σ( )Z
(l+1)
X
(l)
W
(l)
X
(l+1)
Z
(l+1)
W
(l)
5
CONNECTION BETWEEN GCN AND CONVOLUTIONAL NETWORKCONNECTION BETWEEN GCN AND CONVOLUTIONAL NETWORK
Convolution on 2D-image (left)
Convolution operator takes weighted sum of neighbor values
Fixed neighbors, decided by the kernel size
Graph convolution with GCN (right)
Feature of a node is accumulated neighbor features
Variable neighbors, decided by the graph
Credit: A Comprehensive Survey on Graph NeuralNetworks
6
OTHER VARIANTS OF GRAPH-BASED NEURAL NETWORKSOTHER VARIANTS OF GRAPH-BASED NEURAL NETWORKS
GCN is only 1 variant of Graph Neural Networks (GNN).
Other variants of GNN:
Recurrent Graph Neural Networks
Graph Autoencoders
Graph Attention Network (GAT)
7
APPLICATION OF GNNAPPLICATION OF GNN
GNN can be applied to many graph-based applications:
Recommender systems
Social network analysis
Chemistry
Traf c
Computer Vision
NLP...
8
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
9
INTRODUCTIONINTRODUCTION
Computational cost of current SGD-based algorithms exponentially grows with
number of layers.
Large space requirement for keeping the entire graph and the embedding of each
node in memory.
Propose Cluster-GCN [1] that exploits the graph clustering structure:
Samples a block of nodes that associate with a dense subgraph identi edby a graph clustering
algorithm
Restricts the neighborhood search within this subgraph
Cluster-GCN signi cantly improved memory and computational ef ciency, which
allows us to train much deeper GCN without much time and memory overhead.
5-layer Cluster-GCN achieves state-of-the-art test F1 score 99.36 on the PPI
dataset (the previous best result was 98.71)
[ ], [ ]Paper Code
10
DEFINITIONDEFINITION
Given graph , number of vertices , number of edges
Adjacency matrix , where entry is 1 if there is an edge between i and j, 0 otherwise.
Feature matrix of N nodes:
-layer GCN is de ned by:
where , ( ), A' is the normalized and regularized matrix of A
Feature transformation matrix
For simplicity, assume that all layers have the same feature dim:
In semi-supervised node classi cation problem, learn weight matrices by minimizing the loss function:
In practice, a cross-entropy loss is commonly used for node classi cation in multi-class or multi-label problems
G = (, , A) N = || ||
A ∈ ℝ
N×N
(i, j)
X ∈ ℝ
N×F
L
= , = σ( )Z
(l+1)
A
′
X
(l)
W
(l)
X
(l+1)
Z
(l+1)
∈X
(l)
ℝ
N×Fl
= XX
(0)
∈W
(l)
ℝ
×Fl Fl+1
= ⋯ = = FF1 FL
 = loss( , )
1
| |t
∑
i∈t
yi z
(L)
i
11
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
12
FULL-BATCH GRADIENT DESCENT (FROM [2])FULL-BATCH GRADIENT DESCENT (FROM [2])
Store all the embedding matrices
→ Memory problem
Update the model once per epoch
→ require more epochs to converge
{Z
(l)
}
L
l=1
13
MINI-BATCH SGD (FROM [3])MINI-BATCH SGD (FROM [3])
Update model for each batch of nodes
Signi cant computational overhead due to neighborhood
expansion problem
Converge faster in terms of epochs but much slower per-epoch
training time
Embedding utilization:
If the node ’s embedding at -th layer is computed and is reused times
for the embedding computations at layer ,
then we say the embedding utilization of is .
Embedding utilization u is small because graph is usually large and
sparse
i l z
(l)
i
u
l + 1
z
(l)
i
u
14
VR-GCN (FROM [4]) (SOTA)VR-GCN (FROM [4]) (SOTA)
Reduce the size of neighborhood sampling nodes
Requires storing all the intermediate embeddings
15
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
16
KEY IDEAKEY IDEA
Design a batch and the corresponding computation subgraph
to maximize the embedding utilization.
Set of node from layer 1 to , subgraph (links within )
Embedding utilization is the number of edges within this batch
Maximize the embedding utilization by maximizing within-batch
edges
Ef ciency of SGD updates now relates to graph clustering
algorithm
 L A, 
∥∥A,
∥∥0
17
GRAPH PARTITIONINGGRAPH PARTITIONING
Partition the graph into c groups: , where consists of the nodes in the -th partition.
only consists of the links between nodes in .
Adjacencty matrix A is partitioned into submatrices as:
Also partition feature X and training labels Y according to as and
G  = [ , · · · ]1 c t t
= [ , · · ·, ] = [{ , }, · · ·, {V c, }]G¯ G1 Gc 1 1 c
t t
c
2
A = + Δ =A¯
⎡
⎣
⎢
⎢
⎢
A11
⋮
Ac1
⋯
⋱
⋯
A1c
⋮
Acc
⎤
⎦
⎥
⎥
⎥
where   = , Δ =A¯
⎡
⎣
⎢
⎢
⎢
A11
⋮
0
⋯
⋱
⋯
0
⋮
Acc
⎤
⎦
⎥
⎥
⎥
⎡
⎣
⎢
⎢
⎢
0
⋮
Ac1
⋯
⋱
⋯
A1c
⋮
0
⎤
⎦
⎥
⎥
⎥
[ , · · · ]1 c [ , · · ·, ]X1 Xc [ , · · ·, ]Y1 Yc
18
UPDATE CLUSTER-GCNUPDATE CLUSTER-GCN
The nal embedding matrix becomes
The loss function can also be decomposed into
and
At each step, sample a cluster and update based on the gradient of
= ′σ( ′σ(⋯ σ( ′X ) ) ⋯)Z
(L)
A¯ A¯ A¯ W
(0)
W
(1)
W
(L−1)
=
⎡
⎣
⎢
⎢
⎢
σ( σ(⋯ σ( ) ) ⋯)A¯ ′11 A¯ ′11 A¯ ′11 X1 W
(0)
W
(1)
W
(L−1)
⋮
σ( σ(⋯ σ( ) ) ⋯)A¯ ′cc A¯ ′cc A¯ ′cc X1 W
(0)
W
(1)
W
(L−1)
⎤
⎦
⎥
⎥
⎥
= ′A¯ ∑
t
| |t
N

A¯ ′tt
= loss( , )
A¯ ′tt
1
| |t
∑
i∈t
yi z
(L)
i
t W
(l)
L
l=1
A¯ ′tt
19
EFFICIENCY OF CLUSTER-GCNEFFICIENCY OF CLUSTER-GCN
Figure 1: The neighborhood expansion difference between
traditional graph convolution and our proposed cluster approach
Table 1: Time and space complexity of GCN training algorithms
20
STOCHASTIC MULTIPLE PARTITIONS (1)STOCHASTIC MULTIPLE PARTITIONS (1)
Although Cluster-GCN achieves good computational and memory complexity, there
are still 2 problems:
Some links (the part) are removed
Graph clustering algorithm (such as Metis and Graclus) tend to bring similar nodes together → biased
estimation of the full gradient
Figure 2: Histograms of label entropy within each batch using random partition vs clustering partition. Most clustering partitioned batches have
low label entropy, while random partition gives larger label entropy although it is less ef cient. (partitioned on Reddit dataset with 300 clusters)
Clusters are biased towards some speci c labels, then increase the variance across
different batches.
Δ
21
Figure 3: The proposed stochastic multiple partitions
scheme. Same color blocks are in the same batch
Figure 4: Comparisons of choosing one cluster (300
partitions) vs multiple clusters (1500 partitions & q=5).
(x-axis: epoch, y-axis: F1 score)
STOCHASTIC MULTIPLE PARTITIONS (2)STOCHASTIC MULTIPLE PARTITIONS (2)
To build a batch , randomly choose q clusters ( )
Nodes of the batch are
Also include between-cluster links in the batch to reduce variance across batches
 , , . . . ,t1 t2 tq
{ ∪ · · · ∪ }t1
tq
{ |i, j ∈ , . . . , }Aij t1 tq
22
TRAINING DEEPER GCNSTRAINING DEEPER GCNS
Deeper model impede the information from the rst few layers
being passed through.
Propose
and apply diagonal enhancement technique
= (D + I (A + I)à )
−1
= σ(( + λdiag( )) )X
l+1
à à X
(l)
W
(l)
23
CLUSTER-GCN ALGORITHMCLUSTER-GCN ALGORITHM
24
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
25
Table 3: Data statistics Table 4: The parameters used in the experiments
EXPERIMENTSEXPERIMENTS
Evaluate on multi-label and multi-class class ciation on 4 public datasets
Compare with 2 SoTA methods
VR-GCN (from [4]): maintains historical embedding & expands to only a few neighbors
GraphSAGE (from [5]): samples a xed number of neighbors per node.
Cluster-GCN:
Implemented with PyTorch (Google Research, why PyTorch?)
Adam optimizer, learning rate = 0.1, dropout rate=20%, zero weight decay
Number of partitions and clusters per batch are stated in Table 4.
All the experiments are run on 1 machine with NVIDIA Tesla V100 GPU (16GB
mem), 20-core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM.
26
Figure 6: Comparisons of different GCN training methods. We present the relation
between training time in seconds (x-axis)and the validation F1 score (y-axis)
Table 6: Benchmarking on the Sparse Tensor operations in
PyTorch and TensorFlow. A network with two linear layers
is used and the timing includes forward and backward
operations. Numbers in the brackets indicate the size of
hidden units in the rst layer. Amazon data is used.
RESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACYRESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACY
Cluster-GCN is the fastest for both PPI and Reddit datasets
For Amazon data, Cluster-GCN is faster than VRGCN for 3-layer case, but slower for 2-layer and 4-layer cases.
Defense: Table 6 (VRGCN was implemented with TensorFlow)
27
RESULTS ON MEDIAN SIZE DATASETS - MEMORY USAGERESULTS ON MEDIAN SIZE DATASETS - MEMORY USAGE
VRGCN needs to save historical embeddings during training → consume more
memory than Cluster-GCN
GraphSAGE also has higher memory requirement than Cluster-GCN due to the
exponential neighborhood growing
When increasing the number of layers, Cluster-GCN’s memory usage does not
increase a lot (extra variable introduced is the weight matrix )
Table 5: Comparisons of memory usages on different datasets.
Numbers in the brackets indicate the size of hidden units used in the model
W
(L)
28
AMAZON-2MAMAZON-2M
By far the largest public data for testing GCN is Reddit (232,965 nodes, 11,606,919
edges)
Build a much larger dataset, Amazon2M, to test the scalability of Cluster-GCN:
2 millions nodes, 61 millions edges
Raw co-purchase data from Amazon-3M
Node: product; link: whether two products are purchased together.
Node feature: bag-of-word features from product descriptions, reduced to 100-dim by PCA.
Use top-level categories as the label for product/node (Table 7)
Table 7: The most common categories in Amazon2M
29
Figure 6 Table 8: Comparisons of running time, memory and
testing accuracy (F1 score) for Amazon2M
RESULTS ON AMAZON-2MRESULTS ON AMAZON-2M
VRGCN is faster than Cluster-GCN with 2-layer GCN but slower when increasing
one layer while achieving similar accuracy.
VRGCN is using much more memory than Cluster-GCN (5 times more for 3-layer
case), and it is running out of memory when training 4-layer GCN
Cluster-GCN
Does not need much additional memory when increasing the number of layers
Achieves the best accuracy with 4-layer GCN.
30
Table 9: Comparisons of running time when
using different # of layers on PPI, 200 epochs
Table 11: Comparisons of using different diagonal enhancement
techniques on PPI. Red numbers indicate poor convergence.
TRAINING ON DEEPER GCN (2)TRAINING ON DEEPER GCN (2)
Test GCNs with more layers
Running time of VRGCN grows exponentially, while the running time of Cluster-
GCN only grows linearly (Table 9).
Evaluate diagonal enhancement techniques with PPI dataset (Table 11)
Case of 2 to 5 layers, the more layers the higher accuracy → deeper GCNs may be useful.
When 7 or 8 layers are used, the rst three methods fail to converge within 200 epochs and get a
dramatic loss of accuracy
31
TRAINING ON DEEPER GCN WITH DIAGONAL ENHANCEMENTTRAINING ON DEEPER GCN WITH DIAGONAL ENHANCEMENT
Detailed convergence of a 8-layer GCN in Figure 5.
All methods except for the one using diagonal enhancement fail to converge.
Figure 5: Convergence gure on a 8-layer GCN.
(x-axis: # of epochs; y-axis: validation accuracy)
32
STATE-OF-THE-ART RESULTSTATE-OF-THE-ART RESULT
For PPI, Cluster-GCN can achieve the state-of-art result by training a 5-layer GCN
with 2048 hidden units.
For Reddit, a 4-layer GCN with 128 hidden units.
Table 10: State-of-the-art performance of testing accuracy reported in recent papers
33
OUTLINEOUTLINE
1. Background
What is GCN?
Applications of GCN
2. Introduction
3. Problems of current methods
4. Cluster-GCN
Vanilla Cluster-GCN
Stochastic Multiple Partitions
Training deeper GCNs
5. Experiments & Results
6. Conclusion
34
CONCLUSIONCONCLUSION
Cluster-GCN is fast and memory ef cient.
The method can train very deep GCN on large-scale graph
With 2 million nodes, the training time is less than an hour
Uses around 2G memory
Achieves accuracy of 90.41 (F1 score)
Successfully train much deeper GCNs, which achieve state-of-the-
art test F1 score on PPI and Reddit datasets.
35
REFERENCESREFERENCES
[1] Wei-Lin Chiang et. al., KDD 2019. Cluster-GCN: An Ef cient Algorithm for Training Deep and Large
Graph Convolutional Networks
[2] Thomas N. Kipf and Max Welling. ICLR 2017. Semi-Supervised Classi cation with Graph Convolutional
Networks.
[3] William L. Hamilton, Rex Ying, and Jure Leskovec. NIPS 2017. Inductive Representation Learning on
Large Graphs.
[4] Jianfei Chen, Jun Zhu, and Song Le. ICML 2018. Stochastic Training of Graph Convolutional Networks
with Variance Reduction.
36
THANK YOU!THANK YOU!
37
Q&AQ&A
38

VJAI Paper Reading#3-KDD2019-ClusterGCN

  • 1.
    CLUSTER-GCN: AN EFFICIENTALGORITHM FOR TRAININGCLUSTER-GCN: AN EFFICIENT ALGORITHM FOR TRAINING DEEP ANDDEEP AND LARGE GRAPH CONVOLUTIONAL NETWORKSLARGE GRAPH CONVOLUTIONAL NETWORKS KDD'2019 VJAI Paper Reading festival #3 2019/8/18 Presented by: Dat Nguyen 1
  • 2.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 2
  • 3.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 3
  • 4.
    Consider a NeuralNetwork with L layers (no bias) Number of neural at the layer : Features at the layer : Parameters at the layer : Feature vectors are transformed layer by layer where is an activation function. Training step: updating parameters to build an appropriate transformation The network transform each input feature vector into an output feature vector independently from the others Question: What if we build a network to transform multiple feature vectors at once VANILLA NEURAL NETWORK RECAPVANILLA NEURAL NETWORK RECAP l th Fl l th ∈X (l) ℝ 1×Fl l th ∈W (l) ℝ ×Fl Fl+1 X (l) =Z (l+1) X (l) W (l) = σ( )X (l+1) Z (l+1) σ W (l) 4
  • 5.
    THE IDEA OFGRAPH CONVOLUTIONAL NETWORKS (GCN)THE IDEA OF GRAPH CONVOLUTIONAL NETWORKS (GCN) First, change the feature vector into a list of N feature vectors (a feature matrix) So, at each layer , we N nodes, each node is represented by a feature vector [i] One more idea: Take the relationship of nodes into account In graph, one way to represent the relationship of nodes is to use adjacency matrix The transformation at layer become One way to de ne : So, we have: Intuition: accumulate features from the neighbors before applying the transformation In fact, A can be normalized in some way before using, e.g., augment self feature, divided by degree matrix... ∈X (l) ℝ 1×Fl ∈X (l) ℝ N×Fl l th i th X (l) A ∈ ℝ N×N l th = f (A, , )Z (l+1) X (l) W (l) f f (A, , ) = AX (l) W (l) X (l) W (l) = A , = σ( )Z (l+1) X (l) W (l) X (l+1) Z (l+1) W (l) 5
  • 6.
    CONNECTION BETWEEN GCNAND CONVOLUTIONAL NETWORKCONNECTION BETWEEN GCN AND CONVOLUTIONAL NETWORK Convolution on 2D-image (left) Convolution operator takes weighted sum of neighbor values Fixed neighbors, decided by the kernel size Graph convolution with GCN (right) Feature of a node is accumulated neighbor features Variable neighbors, decided by the graph Credit: A Comprehensive Survey on Graph NeuralNetworks 6
  • 7.
    OTHER VARIANTS OFGRAPH-BASED NEURAL NETWORKSOTHER VARIANTS OF GRAPH-BASED NEURAL NETWORKS GCN is only 1 variant of Graph Neural Networks (GNN). Other variants of GNN: Recurrent Graph Neural Networks Graph Autoencoders Graph Attention Network (GAT) 7
  • 8.
    APPLICATION OF GNNAPPLICATIONOF GNN GNN can be applied to many graph-based applications: Recommender systems Social network analysis Chemistry Traf c Computer Vision NLP... 8
  • 9.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 9
  • 10.
    INTRODUCTIONINTRODUCTION Computational cost ofcurrent SGD-based algorithms exponentially grows with number of layers. Large space requirement for keeping the entire graph and the embedding of each node in memory. Propose Cluster-GCN [1] that exploits the graph clustering structure: Samples a block of nodes that associate with a dense subgraph identi edby a graph clustering algorithm Restricts the neighborhood search within this subgraph Cluster-GCN signi cantly improved memory and computational ef ciency, which allows us to train much deeper GCN without much time and memory overhead. 5-layer Cluster-GCN achieves state-of-the-art test F1 score 99.36 on the PPI dataset (the previous best result was 98.71) [ ], [ ]Paper Code 10
  • 11.
    DEFINITIONDEFINITION Given graph ,number of vertices , number of edges Adjacency matrix , where entry is 1 if there is an edge between i and j, 0 otherwise. Feature matrix of N nodes: -layer GCN is de ned by: where , ( ), A' is the normalized and regularized matrix of A Feature transformation matrix For simplicity, assume that all layers have the same feature dim: In semi-supervised node classi cation problem, learn weight matrices by minimizing the loss function: In practice, a cross-entropy loss is commonly used for node classi cation in multi-class or multi-label problems G = (, , A) N = || || A ∈ ℝ N×N (i, j) X ∈ ℝ N×F L = , = σ( )Z (l+1) A ′ X (l) W (l) X (l+1) Z (l+1) ∈X (l) ℝ N×Fl = XX (0) ∈W (l) ℝ ×Fl Fl+1 = ⋯ = = FF1 FL  = loss( , ) 1 | |t ∑ i∈t yi z (L) i 11
  • 12.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 12
  • 13.
    FULL-BATCH GRADIENT DESCENT(FROM [2])FULL-BATCH GRADIENT DESCENT (FROM [2]) Store all the embedding matrices → Memory problem Update the model once per epoch → require more epochs to converge {Z (l) } L l=1 13
  • 14.
    MINI-BATCH SGD (FROM[3])MINI-BATCH SGD (FROM [3]) Update model for each batch of nodes Signi cant computational overhead due to neighborhood expansion problem Converge faster in terms of epochs but much slower per-epoch training time Embedding utilization: If the node ’s embedding at -th layer is computed and is reused times for the embedding computations at layer , then we say the embedding utilization of is . Embedding utilization u is small because graph is usually large and sparse i l z (l) i u l + 1 z (l) i u 14
  • 15.
    VR-GCN (FROM [4])(SOTA)VR-GCN (FROM [4]) (SOTA) Reduce the size of neighborhood sampling nodes Requires storing all the intermediate embeddings 15
  • 16.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 16
  • 17.
    KEY IDEAKEY IDEA Designa batch and the corresponding computation subgraph to maximize the embedding utilization. Set of node from layer 1 to , subgraph (links within ) Embedding utilization is the number of edges within this batch Maximize the embedding utilization by maximizing within-batch edges Ef ciency of SGD updates now relates to graph clustering algorithm  L A,  ∥∥A, ∥∥0 17
  • 18.
    GRAPH PARTITIONINGGRAPH PARTITIONING Partitionthe graph into c groups: , where consists of the nodes in the -th partition. only consists of the links between nodes in . Adjacencty matrix A is partitioned into submatrices as: Also partition feature X and training labels Y according to as and G  = [ , · · · ]1 c t t = [ , · · ·, ] = [{ , }, · · ·, {V c, }]G¯ G1 Gc 1 1 c t t c 2 A = + Δ =A¯ ⎡ ⎣ ⎢ ⎢ ⎢ A11 ⋮ Ac1 ⋯ ⋱ ⋯ A1c ⋮ Acc ⎤ ⎦ ⎥ ⎥ ⎥ where   = , Δ =A¯ ⎡ ⎣ ⎢ ⎢ ⎢ A11 ⋮ 0 ⋯ ⋱ ⋯ 0 ⋮ Acc ⎤ ⎦ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ 0 ⋮ Ac1 ⋯ ⋱ ⋯ A1c ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ [ , · · · ]1 c [ , · · ·, ]X1 Xc [ , · · ·, ]Y1 Yc 18
  • 19.
    UPDATE CLUSTER-GCNUPDATE CLUSTER-GCN Thenal embedding matrix becomes The loss function can also be decomposed into and At each step, sample a cluster and update based on the gradient of = ′σ( ′σ(⋯ σ( ′X ) ) ⋯)Z (L) A¯ A¯ A¯ W (0) W (1) W (L−1) = ⎡ ⎣ ⎢ ⎢ ⎢ σ( σ(⋯ σ( ) ) ⋯)A¯ ′11 A¯ ′11 A¯ ′11 X1 W (0) W (1) W (L−1) ⋮ σ( σ(⋯ σ( ) ) ⋯)A¯ ′cc A¯ ′cc A¯ ′cc X1 W (0) W (1) W (L−1) ⎤ ⎦ ⎥ ⎥ ⎥ = ′A¯ ∑ t | |t N  A¯ ′tt = loss( , ) A¯ ′tt 1 | |t ∑ i∈t yi z (L) i t W (l) L l=1 A¯ ′tt 19
  • 20.
    EFFICIENCY OF CLUSTER-GCNEFFICIENCYOF CLUSTER-GCN Figure 1: The neighborhood expansion difference between traditional graph convolution and our proposed cluster approach Table 1: Time and space complexity of GCN training algorithms 20
  • 21.
    STOCHASTIC MULTIPLE PARTITIONS(1)STOCHASTIC MULTIPLE PARTITIONS (1) Although Cluster-GCN achieves good computational and memory complexity, there are still 2 problems: Some links (the part) are removed Graph clustering algorithm (such as Metis and Graclus) tend to bring similar nodes together → biased estimation of the full gradient Figure 2: Histograms of label entropy within each batch using random partition vs clustering partition. Most clustering partitioned batches have low label entropy, while random partition gives larger label entropy although it is less ef cient. (partitioned on Reddit dataset with 300 clusters) Clusters are biased towards some speci c labels, then increase the variance across different batches. Δ 21
  • 22.
    Figure 3: Theproposed stochastic multiple partitions scheme. Same color blocks are in the same batch Figure 4: Comparisons of choosing one cluster (300 partitions) vs multiple clusters (1500 partitions & q=5). (x-axis: epoch, y-axis: F1 score) STOCHASTIC MULTIPLE PARTITIONS (2)STOCHASTIC MULTIPLE PARTITIONS (2) To build a batch , randomly choose q clusters ( ) Nodes of the batch are Also include between-cluster links in the batch to reduce variance across batches  , , . . . ,t1 t2 tq { ∪ · · · ∪ }t1 tq { |i, j ∈ , . . . , }Aij t1 tq 22
  • 23.
    TRAINING DEEPER GCNSTRAININGDEEPER GCNS Deeper model impede the information from the rst few layers being passed through. Propose and apply diagonal enhancement technique = (D + I (A + I)à ) −1 = σ(( + λdiag( )) )X l+1 à à X (l) W (l) 23
  • 24.
  • 25.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 25
  • 26.
    Table 3: Datastatistics Table 4: The parameters used in the experiments EXPERIMENTSEXPERIMENTS Evaluate on multi-label and multi-class class ciation on 4 public datasets Compare with 2 SoTA methods VR-GCN (from [4]): maintains historical embedding & expands to only a few neighbors GraphSAGE (from [5]): samples a xed number of neighbors per node. Cluster-GCN: Implemented with PyTorch (Google Research, why PyTorch?) Adam optimizer, learning rate = 0.1, dropout rate=20%, zero weight decay Number of partitions and clusters per batch are stated in Table 4. All the experiments are run on 1 machine with NVIDIA Tesla V100 GPU (16GB mem), 20-core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM. 26
  • 27.
    Figure 6: Comparisonsof different GCN training methods. We present the relation between training time in seconds (x-axis)and the validation F1 score (y-axis) Table 6: Benchmarking on the Sparse Tensor operations in PyTorch and TensorFlow. A network with two linear layers is used and the timing includes forward and backward operations. Numbers in the brackets indicate the size of hidden units in the rst layer. Amazon data is used. RESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACYRESULTS ON MEDIAN SIZE DATASETS - TRAINING TIME VS ACCURACY Cluster-GCN is the fastest for both PPI and Reddit datasets For Amazon data, Cluster-GCN is faster than VRGCN for 3-layer case, but slower for 2-layer and 4-layer cases. Defense: Table 6 (VRGCN was implemented with TensorFlow) 27
  • 28.
    RESULTS ON MEDIANSIZE DATASETS - MEMORY USAGERESULTS ON MEDIAN SIZE DATASETS - MEMORY USAGE VRGCN needs to save historical embeddings during training → consume more memory than Cluster-GCN GraphSAGE also has higher memory requirement than Cluster-GCN due to the exponential neighborhood growing When increasing the number of layers, Cluster-GCN’s memory usage does not increase a lot (extra variable introduced is the weight matrix ) Table 5: Comparisons of memory usages on different datasets. Numbers in the brackets indicate the size of hidden units used in the model W (L) 28
  • 29.
    AMAZON-2MAMAZON-2M By far thelargest public data for testing GCN is Reddit (232,965 nodes, 11,606,919 edges) Build a much larger dataset, Amazon2M, to test the scalability of Cluster-GCN: 2 millions nodes, 61 millions edges Raw co-purchase data from Amazon-3M Node: product; link: whether two products are purchased together. Node feature: bag-of-word features from product descriptions, reduced to 100-dim by PCA. Use top-level categories as the label for product/node (Table 7) Table 7: The most common categories in Amazon2M 29
  • 30.
    Figure 6 Table8: Comparisons of running time, memory and testing accuracy (F1 score) for Amazon2M RESULTS ON AMAZON-2MRESULTS ON AMAZON-2M VRGCN is faster than Cluster-GCN with 2-layer GCN but slower when increasing one layer while achieving similar accuracy. VRGCN is using much more memory than Cluster-GCN (5 times more for 3-layer case), and it is running out of memory when training 4-layer GCN Cluster-GCN Does not need much additional memory when increasing the number of layers Achieves the best accuracy with 4-layer GCN. 30
  • 31.
    Table 9: Comparisonsof running time when using different # of layers on PPI, 200 epochs Table 11: Comparisons of using different diagonal enhancement techniques on PPI. Red numbers indicate poor convergence. TRAINING ON DEEPER GCN (2)TRAINING ON DEEPER GCN (2) Test GCNs with more layers Running time of VRGCN grows exponentially, while the running time of Cluster- GCN only grows linearly (Table 9). Evaluate diagonal enhancement techniques with PPI dataset (Table 11) Case of 2 to 5 layers, the more layers the higher accuracy → deeper GCNs may be useful. When 7 or 8 layers are used, the rst three methods fail to converge within 200 epochs and get a dramatic loss of accuracy 31
  • 32.
    TRAINING ON DEEPERGCN WITH DIAGONAL ENHANCEMENTTRAINING ON DEEPER GCN WITH DIAGONAL ENHANCEMENT Detailed convergence of a 8-layer GCN in Figure 5. All methods except for the one using diagonal enhancement fail to converge. Figure 5: Convergence gure on a 8-layer GCN. (x-axis: # of epochs; y-axis: validation accuracy) 32
  • 33.
    STATE-OF-THE-ART RESULTSTATE-OF-THE-ART RESULT ForPPI, Cluster-GCN can achieve the state-of-art result by training a 5-layer GCN with 2048 hidden units. For Reddit, a 4-layer GCN with 128 hidden units. Table 10: State-of-the-art performance of testing accuracy reported in recent papers 33
  • 34.
    OUTLINEOUTLINE 1. Background What isGCN? Applications of GCN 2. Introduction 3. Problems of current methods 4. Cluster-GCN Vanilla Cluster-GCN Stochastic Multiple Partitions Training deeper GCNs 5. Experiments & Results 6. Conclusion 34
  • 35.
    CONCLUSIONCONCLUSION Cluster-GCN is fastand memory ef cient. The method can train very deep GCN on large-scale graph With 2 million nodes, the training time is less than an hour Uses around 2G memory Achieves accuracy of 90.41 (F1 score) Successfully train much deeper GCNs, which achieve state-of-the- art test F1 score on PPI and Reddit datasets. 35
  • 36.
    REFERENCESREFERENCES [1] Wei-Lin Chianget. al., KDD 2019. Cluster-GCN: An Ef cient Algorithm for Training Deep and Large Graph Convolutional Networks [2] Thomas N. Kipf and Max Welling. ICLR 2017. Semi-Supervised Classi cation with Graph Convolutional Networks. [3] William L. Hamilton, Rex Ying, and Jure Leskovec. NIPS 2017. Inductive Representation Learning on Large Graphs. [4] Jianfei Chen, Jun Zhu, and Song Le. ICML 2018. Stochastic Training of Graph Convolutional Networks with Variance Reduction. 36
  • 37.
  • 38.