240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
1. Pure Transformers are
Powerful Graph Learners
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/03/04
Jinwoo Kim et al.
Advances in Neural Information Processing Systems, 2022
2. 2
Introduction
• Standard Transformers without graph-specific modifications can lead to promising results
• Treat all nodes and edges as independent tokens, augment them with token embeddings, and feed
them to a Transformer
• This approach is at least as expressive as an invariant graph network (2-IGN) composed of equivariant
linear layers, which is already more expressive than all message-passing GNN
3. 3
Related Works
• Multiple works tried combining self-attention in GNN architecture where message passing was previously
dominant [50]
• Global self-attention across nodes cannot reflect the graph structure
○ Restrict self-attention to local neighborhoods [69, 51, 19]
○ Use global self-attention in conjunction with message-passing GNN [58, 43, 34]
○ Inject edge information into global self-attention via attention bias [72, 78, 29, 54]
• Issues with message-passing such as oversmoothing [40, 8, 52]
• Incompatible with useful engineering techniques like linear attention [65]
• [50] E. Min, R. Chen, Y. Bian, T. Xu, K. Zhao, W. Huang, P. Zhao, J. Huang, S. Ananiadou, and Y. Rong. Transformer for graphs: An overview from architecture perspective. arXiv, 2022
• [69] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In ICLR, 2018
• [51] D. Q. Nguyen, T. D. Nguyen, and D. Phung. Universal graph transformer self-attention networks. In WWW, 2022
• [19] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. arXiv, 2020
• [58] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-supervised graph transformer on large-scale molecular data. In NeurIPS, 2020
• [43] K. Lin, L. Wang, and Z. Liu. Mesh graphormer. In ICCV, 2021
• [34] J. Kim, S. Oh, and S. Hong. Transformers generalize deepsets and can be extended to graphs and hypergraphs. In NeurIPS, 2021
• [72] X. Wang, Z. Tu, L. Wang, and S. Shi. Self-attention with structural position representations. In EMNLP- IJCNLP, 2019
• [78] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu. Do transformers really perform bad for graph representation? In NeurIPS, 2021
• [29] M. S. Hussain, M. J. Zaki, and D. Subramanian. Edge-augmented graph transformers: Global self-attention is enough for graphs. arXiv, 2021
• [54] W. Park, W. Chang, D. Lee, J. Kim, and S. won Hwang. Grpe: Relative positional encoding for graph transformer. arXiv, 2022
• [40] Q. Li, Z. Han, and X. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, 2018
• [8] C. Cai and Y. Wang. A note on over-smoothing for graph neural networks. arXiv, 2020
• [52] K. Oono and T. Suzuki. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020
• [65] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey. arXiv, 2020
4. 4
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Opposite direction of applying a standard Transformer directly for graphs
• A pure Transformer architecture for graphs with token-wise embeddings composed of node identifiers
and type identifiers
• Node Identifiers: The first component of token-wise embedding is the orthonormal node identifier to
represent the connectivity structure given in the input graph
○ Given input graph G = (V, E), n node-wise orthonormal vectors P
• Type Identifiers: The trainable type identifier that encodes whether a token is a node or edge
5. 5
Pure Transformers for Graph Learning
Node identifiers
• The node identifiers are only required to be orthonormal
○ Orthogonal random features (ORF), obtained by QR-decomposing a random Gaussian matrix
○ Laplacian eigenvectors (Lap), obtained by eigendecomposing the graph Laplacian matrix
• Laplacian eigenvectors are already widely used for graph PE
6. 6
Pure Transformers for Graph Learning
Type identifiers
• The type identifiers are only required to be trainable
• Use two trainable vectores, one for all nodes and one for all edges
8. 8
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Treat n nodes and m edges as (n+m) independent tokens
9. 9
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Treat n nodes and m edges as (n+m) independent tokens
• Concat simple token-wise embeddings
10. 10
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Treat n nodes and m edges as (n+m) independent tokens
• Concat simple token-wise embeddings
○ Trainable type identifiers (node?/edge?) + orthonormal node identifiers
11. 11
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Treat n nodes and m edges as (n+m) independent tokens
• Concat simple token-wise embeddings
○ Trainable type identifiers (node?/edge?) + orthonormal node identifiers
12. 12
Pure Transformers for Graph Learning
Tokenized Graph Transformer (TokenGT)
• Treat n nodes and m edges as (n+m) independent tokens
• Concat simple token-wise embeddings
○ Trainable type identifiers (node?/edge?) + orthonormal node identifiers
• Feed the (n+m) tokens to standard Transformer encoder
13. 13
Pure Transformers for Graph Learning
How does this work?
• Compare the node identifiers of a pair of tokens reveals incidence info
• This allows self-attention to identify and exploit the graph structure
14. 14
Pure Transformers for Graph Learning
How does this work?
• Compare the node identifiers of a pair of tokens reveals incidence info
• This allows self-attention to identify and exploit the graph structure
19. 20
Conclusion
Pure Transformers (TokenGT) are powerful graph learners
• Minimal modification to Transformer architecture, theory, and codebase
• Theoretically more expressive than all MPNNs
• Empirically learns well from large-scale data
• Can adopt transformer-specific techniques like kernelization