Student Profile Sample report on improving academic performance by uniting gr...
Reading_0413_var_Transformers.pptx
1. Presentor: Jinduk park
reading group meeting material 1
Transformers in various domains
School of Mathematics and Computing (Computational Science and Engineering)
Yonsei Univ, Seoul, Korea
2. Contents
- On vision transformer
Recap of Transformer
Transformer in various domains
2
- On graph transformer
- Introduction
- Recap of Transformer
- Some discussion
4. A Preliminary: Inductive Bias
Introduction
4
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
Inductive biases are the characteristics of learning algorithms that influence
their generalization behaviour, independent of data.
(a) has a weak inductive bias than (b) or (c)
5. A Preliminary: Inductive Bias
Introduction
5
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
[Training paths vs epochs on MNIST task]
- Proper choose of inductive bias
-> good convergence
with limited training resources.
- Without inductive bias -> local minima
- Wrong choose of inductive bias
(wrong assumption) -> wrong results
6. A Preliminary: Inductive Bias
Introduction
6
How can we inject Inductive bias?
1. Appropriate architecture 2. Appropriate objective function
3. Appropriate optimization method
+ 𝛼...
But not limited to
https://www.researchgate.net/figure/This-figure-Shows-multi-SGD-optimizer_fig3_327135988
https://www.ibm.com/cloud/learn/convolutional-neural-networks
7. Concept of Attention
7
An Introductory Survey on Attention Mechanisms in NLP Problems, 2019 SAIISC
http://projects.i-ctm.eu/en/project/visual-attention
attention mechanism is a method that used for encoding data based on
the importance score each element is assigned
First derived from human intuition,
Introduction
8. Why Transformer?
8
Introduction
Inspired by the major success of transformer architectures in the field of NLP,
many researcher apply it to other domains (vision, graph, ...)
Han, Kai, et al. "A survey on visual transformer." arXiv e-prints (2020): arXiv-2012.
10. Applying Transforemer to various domains
10
NLP
token sequence
Vision
Image
Graph
Transformer Network
Components
- Positional encoding
- Self-attention
- Batch normalization
...
Focus on which components
should be revised
Introduction
11. 11
Motivation of designing Transformer
1) Parallel computation is available
2) Long-range dependencies (global perspective)
Why Transformer?
Recap of Transformer
12. 2 Major Components of Transformer
Recap of Transformer
12
1) Positional encoding
2) Multi-head attention
(with self-attention)
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
13. 1) Positional encoding (PE)
13
Each position is encoded with sin/cos function,
Then summed up.
E E E
I love you
E E E
I love you
+ + +
Recap of Transformer
*sum is memory efficient, less expressive tho.
14. 14
Simple matrix multiplications using the concept of
Query, Keys, Value vectors
matrix
*
*
*
𝑊
𝑞
𝑊𝑘
𝑊
𝑣
Parameters to be learned
Query
Key
Value
I
love
you
I
love
you
I
love
you
I
love
you
2) Self Attention
Recap of Transformer
15. 15
Now, based on the Q, K, V,
Scaled Dot-Product Attention is calculated as:
2) Self Attention
Recap of Transformer
I
love
you
𝑄 𝐾𝑇
𝑉
𝑆𝑐𝑜𝑟𝑒
I * I
I * Iove
I * you
*
*
*
130
50
10
𝑆𝑜𝑓𝑡𝑚𝑎𝑥
0.92
0.06
0.02
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ∗ 𝑉
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟
𝑜𝑢𝑡𝑝𝑢𝑡
∑
There is no dependency between each query at all
: parallel computation available !
17. TF in Vision Domain
On vision transformer
17
How can we apply transformer (TF) to an image ?
18. TF in Vision Domain
18
Two major components of TF :
1) Self-attention
2) Positional encoding (PE)
On vision transformer
How can we apply transformer (TF) to an image ?
19. Vision Transformer (ViT)
19
Defining unit (token in NLP) of encoding
However, pixel-wise self-attention is
too insufficient.
Q K V
On vision transformer
Unit of image: pixel
1) Self-attention
2) Positional encoding (PE)
20. Vision Transformer (ViT)
20
Defining unit (token in NLP) of encoding
Q K V
On vision transformer
Proposed method:
flattened 2D patches
Flattend
*
W
1) Self-attention
2) Positional encoding (PE)
21. Vision Transformer (ViT)
21
On vision transformer
Since image is in a spatial domain,
encode position in a 1d order, or 2d coordinate
1D)
i-th patch
in the raster
order
or
2D)
(i,j)-th patch
For 2d,
x,y components are encoded seperately
and concatenated
1) Self-attention
2) Positional encoding (PE)
23. 23
Vision Transformer (ViT): Overview
On vision transformer
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers
for image recognition at scale." (2020).
24. 24
Inductive biases in CNN
: locality + translation equivariance
Inductive biases in CNN
On vision transformer
https://anhreynolds.com/blogs/cnn.html
However, strong inductive bias can be harmful for some tasks
25. 25
ViT uses Transformer
Vision Transformer (ViT): Overview
On vision transformer
1. Parallel computation
2. Global perspective
26. 26
When to choose ViT?
On vision transformer
1. Parallel computation
: Already available in CNN
2. Global perspective
: less inductive bias than CNN
ViT is useful for tasks where generalization property is important
(few-shot learning, large dataset training, ...)
ViT uses Transformer
28. 28
Attention in Graph: GAT
On graph transformer
Veličković, Petar, et al. "Graph attention networks." (2017).
GAT (Graph Attention Network)
Attention is the function of
the neighborhood connectivity
Attention in graph:
29. 29
Attention in Graph: GAT
On graph transformer
GAT (Graph Attention Network)
Limitation:
GAT attention is nothing to do with
global connectivity.
30. 30
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
Sentence in NLP can be viewed as
discrete line graph
“My future is bright”
Unit of GT is each node, like each token in NLP
e.g.)
31. 31
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
32. 32
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Seems unable to utilze the power of self-attention
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
33. 33
GT (Graph Transformer)
On graph transformer
Limitation of GAT: no global connectivity
How to encode position of a node
considering global graph structure?
1) Self-attention
2) Positional encoding (PE)
35. 35
GT (Graph Transformer)
On graph transformer
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention
2) Positional encoding (PE)
1) distance-aware node feature
(i.e., nearby nodes have similar positional features
and farther nodes have dissimilar positional features)
2) NLP graph’s Laplacian eigenvectors are
naturally cosine and sine function
Why graph Laplacian?
37. 37
GT (Graph Transformer)
On graph transformer
2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function
Let’s try derive it.
…
1 -1
-0.5 1 -0.5
-0.5 1 -0.5
-0.5 1 -0.5
…
-0.5 1 -0.5
-1
If a matrix is tridiagonal and is also Toeplitz,
its eigenvalues are known to be [ref]:
Which is a function of cosine.
Noschese, S.; Pasquini, L.; Reichel, L. (2013). "Tridiagonal Toeplitz matrices:
Properties and novel applications". Numerical Linear Algebra with Applications.
38. 38
GT (Graph Transformer)
On graph transformer
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention
2) Positional encoding (PE)
Then, select k-smallest non-trivial eigen
vectors for PE for each node
1) k for dimension matching
2) Smallest to provide smooth encoding
43. Conclusions
Transformer in various domains
43
1. Injecting proper inductive bias for a given task is important
3. The task that benefits the most is NLP,
which is followed by vision, graph.
(NLP > Vision > Graph)
2. For a graph-domain,
It seems that justification of using TF is weak.