Introduction to ArtificiaI Intelligence in Higher Education
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx
1. Structure-Aware Transformer
for Graph Representation
Learning
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/02/26
Dexiong Chen et al.
International Conference on Machine Learning, 2022
2. 2
Introduction
• The Structure-Aware Transformer is a class of simple and flexible graph Transformers built upon a new
self-attention mechanism.
• This new self-attention incorporates structural information into the original self-attention by extracting a
subgraph representation rooted at each node before computing the attention
3. 3
Problem with Traditional Transformers
• Traditional Transformers with positional encoding do not necessarily capture structural similarity between
nodes
• This can be a limitation when it comes to graph representation learning
• Over-smoothing and over-squashing problems
7. 7
Background
Transformers on Graphs
● Graph as G = (V, E, X) where the node attributes for node u, node attributes for all nodes are stored in X
● Transformer composed of two main blocks: a self-attention module followed by a feed-forward neural
network (FFN)
● X are first projected to query (Q), key (K), and value (V) matrices through a linear projection
● Self-attention
● Output of the self-attention is followed by a skip-connection and FFN, then jointly compose a Transformer
former layer
8. 8
Background
Absolute encoding
● Absolute encoding refers to adding or concatenating the positional or structural representations of the
graph to the input node features before the main Transformer model
● Example: Laplacian positional encoding, random walk positional encoding (RWPE)
● Absolute encoding don’t provide a measure of the structural similarity between nodes and their
neighborhoods
9. 9
Background
Self-attention as kernel smoothing
where v is the linear value function
● Mialon et al. (2021) propose a relative positional encoding strategy via the product of this kernel and a
diffusion kernel on the graph, which captures the positional similarity between nodes, however this
method is only position-aware
10. 10
The Structure-Aware Transformer
Structure-Aware Self-Attention
• To address this issue, the Structure-Aware Transformer was proposed.
• It incorporates structural information into the original self-attention by extracting a subgraph representation
rooted at each node before computing the attention
11. 11
The Structure-Aware Transformer
Structure-Aware Self-Attention
• The problem with kernel smoother is that it cannot filter out nodes that are structurally different from the
node of interest when they have the same or similar node features
• To incorporate the structural similarity between nodes => more generalized kernel that additionally
accounts for the local substructures around each node => A set of subgraphs centered at each node
where SG(v) denotes a subgraph in G centered at a node v associated with node features X and
Kgraph
• This takes the attributed similarity into account and structural similarity between subgraphs
• Generate more expressive node representations than the original self-attention
• No longer equivariant to any permutation of nodes but only to nodes whose features and subgraph
coincide
12. 12
The Structure-Aware Transformer
Structure-Aware Self-Attention
where is a structure extractor that extracts vector representations of some subgraph centered
that u with node features X
● k-subtree GNN extractor: extract local structural information at node u to the input graph with node
features X and take the output node representation at u as the subgraph representation at u
● Small value of k already leads to good performance, while not suffering from over-smoothing and over-
squashing
● k-subgraph GNN extractor: more expressive extractor is to use a GNN to directly compute the
representation of the entire k-hop subgraph centered at u rather than just the node representation u
● Use subgraphs rather than subtrees around the node => more powerful than the 1-WL test
● Upadted node representations of all nodes within the k-hop neighborhood using a pooling function such
as summation
14. 14
The Structure-Aware Transformer
Structure-Aware Transformer
● Followed by a skip-connection, a FFN and 2 normalization layers before and after the FFN
● Add degree factor in the skip-connection => reducing the overwhelming influence of highly connected
graph components
where dv denotes the degree of node v
● Obtain new graph with the same structure but different node features G’ = (V, E, X’) where X’
corresponds to the output of the Transformer layer
● For graph property prediction, need to aggreate node-level representations into a graph representation by
taking the average or sum or the embedding of a virtual CLS node (without any connectivity to other
nodes)
15. 15
The Structure-Aware Transformer
Combination with Absolute Encoding
● Most absolute encoding techniques are only position-aware
● They chose RWPE through other absolute positional representations
21. 21
Conclusion
• The Structure-Aware Transformer successfully combines the advantages of GNNs and Transformers
• It offers a new way to incorporate structural information into graph representation learning, leading to
improved performance on various benchmarks
• Limitations: k-subgraph SAT has higher memory requirements than k-subtree SAT
• Future work: Focus on reducing the high memory cost and time complexity of the self-attention
computation