Capitol Tech U Doctoral Presentation - April 2024.pptx
A Generalization of Transformer Networks to Graphs.pptx
1. Ho-Beom Kim
Network Science Lab
Dept. of Mathematics
The Catholic University of Korea
E-mail: hobeom2001@catholic.ac.kr
2023 / 10 / 30
DWIVEDI, Vijay Prakash
AAAI 2021
2. 2
Introduction
Problem Statements
• How to deal with the sparsity characteristics of graph?
• What positional encoding technique will be used to represent the position of the node in the graph?
3. 3
Introduction
Contribution
1. They find that the most fruitful ideas from the transformers literature in NLP can be applied in a more
efficient way and posit that sparsity and positional encodings are two key aspects in the development
of a Graph Transformer
2. They put forward a generalization of transformer networks to homogeneous graphs of arbitrary
structure, namely Graph Transformer, and an extended version of Graph Transformer with edge
features that allows the usage of explicit domain information as edge features.
3. Their method includes an elegant way to fuse node positional features using Laplacian eigenvectors
for graph datasets, inspired from the heavy usage of positional encodings in NLP transformer models
and recent research on node positional features in GNNs.
4. Thier experiments demonstrate that the proposed model surpasses baseline isotropic and anisotropic
GNNs.
4. 4
Methodology
Graph Sparsity
• Sparsity can be a great inductive bias in learning graphs
• When a graph is sparse, it means that there are not many edges between nodes.
• In the original transformer, all words become a 'fully connected graph of words' through the attention
process.
• This is because it is difficult to find meaningful interactions or connections between words in a
sentence.
• The words handled by the nlp transformer are smaller than tens or hundreds of nodes when expressed
graphically, so calculations are possible even when fully connected.
• In the case of the graph dataset, arbitrary connectivity structures (i.e. edges) can be used, and the node
size can reach millions or billions.
• It would be impossible or very inefficient to use the attention method used in existing transformers in
the graph.
6. 6
Methodology
Positional Encodings
• The transformer in NLP used positional encoding, which can convey information about where each word
token is located in the entire sequence and the distance between words.
• Until now, GNN has been trained to learn the structural information of nodes that are invariant in the node's
position in graph learning.
• Node attends to local node neighbors
• Use Laplacian eigenvectors for positional encoding
Symmetric normalized Laplacian matrix
Λ : eigenvalues
U : eigenvectors
7. 7
Methodology
Graph Transformer Architecture - Input
node feature αi , edge feature βij
• Positional encoding is also embedded in d dimension through linear projection.
• And add it to the node feature. Positional encoding is added only to the node features of
the input layer.
8. 8
Methodology
Graph Transformer Architecture – Graph Transformer Layer
The attention outputs
After passing through the Feed Forward Network, it goes through residual connections and
normalization layers. Normalization uses LayerNorm or BatchNorm.
10. 10
Methodology
Graph Transformer Architecture – Graph Transformer Layer with edge features
The outputs and are then passed to separate
Feed Forward Networks preceded and succeeded by
residual connections and normalization layers
11. 11
Experiments
Datasets
• ZINC – Graph Regression
• ZINC is a molecular dataset with the task of graph property regression for constrained solubility
• PATTERN – Node Classification
• PATTERN is a node classification dataset generated using the Stochastic Block Models
• CLUSTER – Node Classification
• CLUSTER is also a synthetically generated dataset using SBM model.
14. 14
Conclusion
Conclusion / Future works
• Their experiments consistently showed that the presence of Laplacian eigenvectors as
node positional encodings and batch normalization, in place of layer normalization,
around the transformer feed forward layers enhanced the transformer universally on
all experiments.
• In future works, they are interested in building upon the graph transformer along
aspects such as efficient training on single large graphs, applicability on
heterogeneous domains, etc., and perform efficient graph representation learning
keeping in account the recent innovations in graph inductive biases.