Reading_0413_var_Transformers.pptx

Presentor: Jinduk park
reading group meeting material 1
Transformers in various domains
School of Mathematics and Computing (Computational Science and Engineering)
Yonsei Univ, Seoul, Korea

Contents
- On vision transformer
Recap of Transformer
Transformer in various domains
2
- On graph transformer
- Introduction
- Recap of Transformer
- Some discussion

A Preliminary: Inductive Bias
Introduction
4
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
Inductive biases are the characteristics of learning algorithms that influence
their generalization behaviour, independent of data.
(a) has a weak inductive bias than (b) or (c)

Introduction
5
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
[Training paths vs epochs on MNIST task]
- Proper choose of inductive bias
-> good convergence
with limited training resources.
- Without inductive bias -> local minima
- Wrong choose of inductive bias
(wrong assumption) -> wrong results

Introduction
6
How can we inject Inductive bias?
1. Appropriate architecture 2. Appropriate objective function
3. Appropriate optimization method
+ 𝛼...
But not limited to
https://www.researchgate.net/figure/This-figure-Shows-multi-SGD-optimizer_fig3_327135988
https://www.ibm.com/cloud/learn/convolutional-neural-networks

Concept of Attention
7
An Introductory Survey on Attention Mechanisms in NLP Problems, 2019 SAIISC
http://projects.i-ctm.eu/en/project/visual-attention
attention mechanism is a method that used for encoding data based on
the importance score each element is assigned
First derived from human intuition,
Introduction

Why Transformer?
8
Introduction
Inspired by the major success of transformer architectures in the field of NLP,
many researcher apply it to other domains (vision, graph, ...)
Han, Kai, et al. "A survey on visual transformer." arXiv e-prints (2020): arXiv-2012.

Why Transformer?
9
Introduction
Transformer is designed for NLP
: can’t directly apply to other tasks
proper inductive bias for specific data structure is need.

Applying Transforemer to various domains
10
NLP
token sequence
Vision
Image
Graph
Transformer Network
Components
- Positional encoding
- Self-attention
- Batch normalization
...
Focus on which components
should be revised
Introduction

11
Motivation of designing Transformer
1) Parallel computation is available
2) Long-range dependencies (global perspective)
Why Transformer?

2 Major Components of Transformer
12
1) Positional encoding
2) Multi-head attention
(with self-attention)
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
information processing systems 30 (2017).

1) Positional encoding (PE)
13
Each position is encoded with sin/cos function,
Then summed up.
E E E
I love you
E E E
I love you
+ + +
*sum is memory efficient, less expressive tho.

14
Simple matrix multiplications using the concept of
Query, Keys, Value vectors
matrix
*
*
*
𝑊
𝑞
𝑊𝑘
𝑊
𝑣
Parameters to be learned
Query
Key
Value
I
love
you
I
love
you
I
love
you
I
love
you
2) Self Attention

15
Now, based on the Q, K, V,
Scaled Dot-Product Attention is calculated as:
2) Self Attention
I
love
you
𝑄 𝐾𝑇
𝑉
𝑆𝑐𝑜𝑟𝑒
I * I
I * Iove
I * you
*
*
*
130
50
10
𝑆𝑜𝑓𝑡𝑚𝑎𝑥
0.92
0.06
0.02
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ∗ 𝑉
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟
𝑜𝑢𝑡𝑝𝑢𝑡
∑
There is no dependency between each query at all
: parallel computation available !

16

TF in Vision Domain
On vision transformer
17
How can we apply transformer (TF) to an image ?

TF in Vision Domain
18
Two major components of TF :
1) Self-attention
How can we apply transformer (TF) to an image ?

Vision Transformer (ViT)
19
Defining unit (token in NLP) of encoding
However, pixel-wise self-attention is
too insufficient.
Q K V
Unit of image: pixel
1) Self-attention

20
Defining unit (token in NLP) of encoding
Q K V
Proposed method:
flattened 2D patches
Flattend
*
W
1) Self-attention

21
Since image is in a spatial domain,
encode position in a 1d order, or 2d coordinate
1D)
i-th patch
in the raster
order
or
2D)
(i,j)-th patch
For 2d,
x,y components are encoded seperately
and concatenated
1) Self-attention

22
Choose 1-D encoding, empirically.
1) Self-attention
Acc.

23
Vision Transformer (ViT): Overview
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers
for image recognition at scale." (2020).

24
Inductive biases in CNN
: locality + translation equivariance
Inductive biases in CNN
https://anhreynolds.com/blogs/cnn.html
However, strong inductive bias can be harmful for some tasks

25
ViT uses Transformer
Vision Transformer (ViT): Overview
1. Parallel computation
2. Global perspective

26
When to choose ViT?
1. Parallel computation
: Already available in CNN
2. Global perspective
: less inductive bias than CNN
ViT is useful for tasks where generalization property is important
(few-shot learning, large dataset training, ...)
ViT uses Transformer

27
*ViT-model size(L,B)/patch size(16,32)
As # pretraining sample increases,
performance finally exceeds the CNN model.
Experimental Validation

28
Attention in Graph: GAT
On graph transformer
Veličković, Petar, et al. "Graph attention networks." (2017).
GAT (Graph Attention Network)
Attention is the function of
the neighborhood connectivity
Attention in graph:

29
Attention in Graph: GAT
GAT (Graph Attention Network)
Limitation:
GAT attention is nothing to do with
global connectivity.

30
GT (Graph Transformer)
1) Self-attention
Sentence in NLP can be viewed as
discrete line graph
“My future is bright”
Unit of GT is each node, like each token in NLP
e.g.)

31
1) Self-attention
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).

32
1) Self-attention
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Seems unable to utilze the power of self-attention
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).

33
Limitation of GAT: no global connectivity
How to encode position of a node
considering global graph structure?
1) Self-attention

34
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention

35
eigenvectors
eigenvalues
1) Self-attention
1) distance-aware node feature
(i.e., nearby nodes have similar positional features
and farther nodes have dissimilar positional features)
2) NLP graph’s Laplacian eigenvectors are
naturally cosine and sine function
Why graph Laplacian?

36
2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function
Let’s try derive it.
…
1
1 1
1 1
1 1
…
1 1
1
1
0.5 0.5
0.5 0.5
0.5 0.5
…
0.5 0.5
1
(Normalized)
1 -1
-0.5 1 -0.5
-0.5 1 -0.5
-0.5 1 -0.5
…
-0.5 1 -0.5
-1

37
2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function
Let’s try derive it.
…
1 -1
-0.5 1 -0.5
-0.5 1 -0.5
-0.5 1 -0.5
…
-0.5 1 -0.5
-1
If a matrix is tridiagonal and is also Toeplitz,
its eigenvalues are known to be [ref]:
Which is a function of cosine.
Noschese, S.; Pasquini, L.; Reichel, L. (2013). "Tridiagonal Toeplitz matrices:
Properties and novel applications". Numerical Linear Algebra with Applications.

38
eigenvectors
eigenvalues
1) Self-attention
Then, select k-smallest non-trivial eigen
vectors for PE for each node
1) k for dimension matching
2) Smallest to provide smooth encoding

PE ablation study
39
(MAE)
(Acc)
(Acc)

40
GT (Graph Transformer): Overview
1) Self-attention 2) PE
(neighbors)

41
GT (Graph Transformer): Overview
* Edge feature-aided version

Conclusions
42
Long-range
dependency
Parallel
computation
RNN (NLP) CNN (Vision) GNN (Graph)
(Already available) (Already available)
(Insufficient)

Conclusions
43
1. Injecting proper inductive bias for a given task is important
3. The task that benefits the most is NLP,
which is followed by vision, graph.
(NLP > Vision > Graph)
2. For a graph-domain,
It seems that justification of using TF is weak.

reading group meeting material 44
Thank you for listening.

Reading_0413_var_Transformers.pptx

Recommended

Recommended

More Related Content

Similar to Reading_0413_var_Transformers.pptx

Similar to Reading_0413_var_Transformers.pptx (20)

More from congtran88

More from congtran88 (9)

Recently uploaded

Recently uploaded (20)

Reading_0413_var_Transformers.pptx