240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx

Structure-Aware Transformer
for Graph Representation
Learning
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/02/26
Dexiong Chen et al.
International Conference on Machine Learning, 2022

2
Introduction
• The Structure-Aware Transformer is a class of simple and flexible graph Transformers built upon a new
self-attention mechanism.
• This new self-attention incorporates structural information into the original self-attention by extracting a
subgraph representation rooted at each node before computing the attention

3
Problem with Traditional Transformers
• Traditional Transformers with positional encoding do not necessarily capture structural similarity between
nodes
• This can be a limitation when it comes to graph representation learning
• Over-smoothing and over-squashing problems

4
Over-smoothing problem (message-passing strategies)

5
Over-smoothing problem (message-passing strategies)

6
Over-squashing problem (message-passing strategies)

7
Background
Transformers on Graphs
● Graph as G = (V, E, X) where the node attributes for node u, node attributes for all nodes are stored in X
● Transformer composed of two main blocks: a self-attention module followed by a feed-forward neural
network (FFN)
● X are first projected to query (Q), key (K), and value (V) matrices through a linear projection
● Self-attention
● Output of the self-attention is followed by a skip-connection and FFN, then jointly compose a Transformer
former layer

8
Background
Absolute encoding
● Absolute encoding refers to adding or concatenating the positional or structural representations of the
graph to the input node features before the main Transformer model
● Example: Laplacian positional encoding, random walk positional encoding (RWPE)
● Absolute encoding don’t provide a measure of the structural similarity between nodes and their
neighborhoods

9
Background
Self-attention as kernel smoothing
where v is the linear value function
● Mialon et al. (2021) propose a relative positional encoding strategy via the product of this kernel and a
diffusion kernel on the graph, which captures the positional similarity between nodes, however this
method is only position-aware

10
The Structure-Aware Transformer
Structure-Aware Self-Attention
• To address this issue, the Structure-Aware Transformer was proposed.
• It incorporates structural information into the original self-attention by extracting a subgraph representation
rooted at each node before computing the attention

11
• The problem with kernel smoother is that it cannot filter out nodes that are structurally different from the
node of interest when they have the same or similar node features
• To incorporate the structural similarity between nodes => more generalized kernel that additionally
accounts for the local substructures around each node => A set of subgraphs centered at each node
where SG(v) denotes a subgraph in G centered at a node v associated with node features X and
Kgraph
• This takes the attributed similarity into account and structural similarity between subgraphs
• Generate more expressive node representations than the original self-attention
• No longer equivariant to any permutation of nodes but only to nodes whose features and subgraph
coincide

12
where is a structure extractor that extracts vector representations of some subgraph centered
that u with node features X
● k-subtree GNN extractor: extract local structural information at node u to the input graph with node
features X and take the output node representation at u as the subgraph representation at u
● Small value of k already leads to good performance, while not suffering from over-smoothing and over-
squashing
● k-subgraph GNN extractor: more expressive extractor is to use a GNN to directly compute the
representation of the entire k-hop subgraph centered at u rather than just the node representation u
● Use subgraphs rather than subtrees around the node => more powerful than the 1-WL test
● Upadted node representations of all nodes within the k-hop neighborhood using a pooling function such
as summation

13

14
● Followed by a skip-connection, a FFN and 2 normalization layers before and after the FFN
● Add degree factor in the skip-connection => reducing the overwhelming influence of highly connected
graph components
where dv denotes the degree of node v
● Obtain new graph with the same structure but different node features G’ = (V, E, X’) where X’
corresponds to the output of the Transformer layer
● For graph property prediction, need to aggreate node-level representations into a graph representation by
taking the average or sum or the embedding of a virtual CLS node (without any connectivity to other
nodes)

15
Combination with Absolute Encoding
● Most absolute encoding techniques are only position-aware
● They chose RWPE through other absolute positional representations

16
Empirical Results
Comparison to SOTA methods

17
Empirical Results
Comparison to SOTA methodss

18
Empirical Results
SAT models vs Sparse GNNs

19
Empirical Results
Hyperparameter Studies

20
Empirical Results
Model Interpretation

21
Conclusion
• The Structure-Aware Transformer successfully combines the advantages of GNNs and Transformers
• It offers a new way to incorporate structural information into graph representation learning, leading to
improved performance on various benchmarks
• Limitations: k-subgraph SAT has higher memory requirements than k-subtree SAT
• Future work: Focus on reducing the high memory cost and time complexity of the self-attention
computation

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx

Recommended

Recommended

More Related Content

Similar to 240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx

Similar to 240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx (20)

More from thanhdowork

More from thanhdowork (20)

Recently uploaded

Recently uploaded (20)

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation Learning].pptx