Trauma-Informed Leadership - Five Practical Principles
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Graph Representations].pptx
1. Van Thuy Hoang
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: hoangvanthuy90@gmail.com
2024-04-08
2. 2
BACKGROUND: Message Passing GNNs vs Graph Transf
ormers
• Generate node embeddings based on local network neighborhoods
• Nodes have embeddings at each layer, repeating combine messages
from their neighbor using neural networks
3. 3
Message Passing GNNs vs Graph Transformers
• a node’s update is a function over its neighbors, in GTs, a node’s update is a function
of all nodes in a graph (thanks to the self-attention mechanism in the Transformer layer).
4. 4
Graph Transformers: Challenges
• How to build GT for large-graph representations:
• The quadratic global attentions hinder the scalability for large graphs
• Over-fitting problem
5. 5
Deep attention layers
• Do we need many attention layers?
• Other Transformers often require multiple attention layers for desired capacity
6. 6
The power of 1-layer attention
• mini-batch sampling that randomly partitions the input graph into mini-batches with
smaller sizes.
• will be fed into the SGFormer model that is implemented with a one-layer global
attention and a GNN network
7. 7
Simple Global Attention
• A single-layer global attention is sufficient.
• This is because through one-layer propagation over a densely connected attention
graph, the information of each node can be adaptively propagated to arbitrary nodes
within the batch.
• The computation of Eq. (3) can be achieved in O(N) time complexity, which is much
more efficient than the Softmax attention in original Transformers
8. 8
Incorporation of Structural Information
• A simple-yet-effective scheme that combines Z with the propagated embeddings by
GNNs at the output layer:
11. 11
Empirical Evaluation
• Scalability test of training time per epoch
• Amazon2M dataset and randomly sample a subset of nodes with the node number
ranging from 10K to 100K.
12. 12
SUMMARY
• The potential of simple Transformer-style architectures for learning large-graph
representations where the scalability challenge plays a bottleneck
• A one-layer attention model combined with a vanilla GCN can surprisingly produce
highly competitive performance.
• Challenge of out-of-distribution learning