SlideShare a Scribd company logo
1 of 44
Download to read offline
Presentor: Jinduk park
reading group meeting material 1
Transformers in various domains
School of Mathematics and Computing (Computational Science and Engineering)
Yonsei Univ, Seoul, Korea
Contents
- On vision transformer
Recap of Transformer
Transformer in various domains
2
- On graph transformer
- Introduction
- Recap of Transformer
- Some discussion
Recap of Transformer
3
A Preliminary: Inductive Bias
Introduction
4
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
Inductive biases are the characteristics of learning algorithms that influence
their generalization behaviour, independent of data.
(a) has a weak inductive bias than (b) or (c)
A Preliminary: Inductive Bias
Introduction
5
Abnar et al, "Transferring inductive biases through knowledge distillation." (2020).
[Training paths vs epochs on MNIST task]
- Proper choose of inductive bias
-> good convergence
with limited training resources.
- Without inductive bias -> local minima
- Wrong choose of inductive bias
(wrong assumption) -> wrong results
A Preliminary: Inductive Bias
Introduction
6
How can we inject Inductive bias?
1. Appropriate architecture 2. Appropriate objective function
3. Appropriate optimization method
+ 𝛼...
But not limited to
https://www.researchgate.net/figure/This-figure-Shows-multi-SGD-optimizer_fig3_327135988
https://www.ibm.com/cloud/learn/convolutional-neural-networks
Concept of Attention
7
An Introductory Survey on Attention Mechanisms in NLP Problems, 2019 SAIISC
http://projects.i-ctm.eu/en/project/visual-attention
attention mechanism is a method that used for encoding data based on
the importance score each element is assigned
First derived from human intuition,
Introduction
Why Transformer?
8
Introduction
Inspired by the major success of transformer architectures in the field of NLP,
many researcher apply it to other domains (vision, graph, ...)
Han, Kai, et al. "A survey on visual transformer." arXiv e-prints (2020): arXiv-2012.
Why Transformer?
9
Introduction
Transformer is designed for NLP
: can’t directly apply to other tasks
proper inductive bias for specific data structure is need.
Applying Transforemer to various domains
10
NLP
token sequence
Vision
Image
Graph
Transformer Network
Components
- Positional encoding
- Self-attention
- Batch normalization
...
Focus on which components
should be revised
Introduction
11
Motivation of designing Transformer
1) Parallel computation is available
2) Long-range dependencies (global perspective)
Why Transformer?
Recap of Transformer
2 Major Components of Transformer
Recap of Transformer
12
1) Positional encoding
2) Multi-head attention
(with self-attention)
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
1) Positional encoding (PE)
13
Each position is encoded with sin/cos function,
Then summed up.
E E E
I love you
E E E
I love you
+ + +
Recap of Transformer
*sum is memory efficient, less expressive tho.
14
Simple matrix multiplications using the concept of
Query, Keys, Value vectors
matrix
*
*
*
𝑊
𝑞
𝑊𝑘
𝑊
𝑣
Parameters to be learned
Query
Key
Value
I
love
you
I
love
you
I
love
you
I
love
you
2) Self Attention
Recap of Transformer
15
Now, based on the Q, K, V,
Scaled Dot-Product Attention is calculated as:
2) Self Attention
Recap of Transformer
I
love
you
𝑄 𝐾𝑇
𝑉
𝑆𝑐𝑜𝑟𝑒
I * I
I * Iove
I * you
*
*
*
130
50
10
𝑆𝑜𝑓𝑡𝑚𝑎𝑥
0.92
0.06
0.02
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ∗ 𝑉
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟
𝑜𝑢𝑡𝑝𝑢𝑡
∑
There is no dependency between each query at all
: parallel computation available !
Transformer in various domains
16
TF in Vision Domain
On vision transformer
17
How can we apply transformer (TF) to an image ?
TF in Vision Domain
18
Two major components of TF :
1) Self-attention
2) Positional encoding (PE)
On vision transformer
How can we apply transformer (TF) to an image ?
Vision Transformer (ViT)
19
Defining unit (token in NLP) of encoding
However, pixel-wise self-attention is
too insufficient.
Q K V
On vision transformer
Unit of image: pixel
1) Self-attention
2) Positional encoding (PE)
Vision Transformer (ViT)
20
Defining unit (token in NLP) of encoding
Q K V
On vision transformer
Proposed method:
flattened 2D patches
Flattend
*
W
1) Self-attention
2) Positional encoding (PE)
Vision Transformer (ViT)
21
On vision transformer
Since image is in a spatial domain,
encode position in a 1d order, or 2d coordinate
1D)
i-th patch
in the raster
order
or
2D)
(i,j)-th patch
For 2d,
x,y components are encoded seperately
and concatenated
1) Self-attention
2) Positional encoding (PE)
Vision Transformer (ViT)
22
On vision transformer
Choose 1-D encoding, empirically.
1) Self-attention
2) Positional encoding (PE)
Acc.
23
Vision Transformer (ViT): Overview
On vision transformer
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers
for image recognition at scale." (2020).
24
Inductive biases in CNN
: locality + translation equivariance
Inductive biases in CNN
On vision transformer
https://anhreynolds.com/blogs/cnn.html
However, strong inductive bias can be harmful for some tasks
25
ViT uses Transformer
Vision Transformer (ViT): Overview
On vision transformer
1. Parallel computation
2. Global perspective
26
When to choose ViT?
On vision transformer
1. Parallel computation
: Already available in CNN
2. Global perspective
: less inductive bias than CNN
ViT is useful for tasks where generalization property is important
(few-shot learning, large dataset training, ...)
ViT uses Transformer
27
*ViT-model size(L,B)/patch size(16,32)
As # pretraining sample increases,
performance finally exceeds the CNN model.
Experimental Validation
On vision transformer
28
Attention in Graph: GAT
On graph transformer
Veličković, Petar, et al. "Graph attention networks." (2017).
GAT (Graph Attention Network)
Attention is the function of
the neighborhood connectivity
Attention in graph:
29
Attention in Graph: GAT
On graph transformer
GAT (Graph Attention Network)
Limitation:
GAT attention is nothing to do with
global connectivity.
30
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
Sentence in NLP can be viewed as
discrete line graph
“My future is bright”
Unit of GT is each node, like each token in NLP
e.g.)
31
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
32
GT (Graph Transformer)
On graph transformer
1) Self-attention
2) Positional encoding (PE)
The difference is layer-wise global Q,K,V is
constructed,
instead of constructing node-wise Q,K,V
Seems unable to utilze the power of self-attention
Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of
transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
33
GT (Graph Transformer)
On graph transformer
Limitation of GAT: no global connectivity
How to encode position of a node
considering global graph structure?
1) Self-attention
2) Positional encoding (PE)
34
GT (Graph Transformer)
On graph transformer
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention
2) Positional encoding (PE)
35
GT (Graph Transformer)
On graph transformer
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention
2) Positional encoding (PE)
1) distance-aware node feature
(i.e., nearby nodes have similar positional features
and farther nodes have dissimilar positional features)
2) NLP graph’s Laplacian eigenvectors are
naturally cosine and sine function
Why graph Laplacian?
36
GT (Graph Transformer)
On graph transformer
2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function
Let’s try derive it.
…
1
1 1
1 1
1 1
…
1 1
1
1
0.5 0.5
0.5 0.5
0.5 0.5
…
0.5 0.5
1
(Normalized)
1 -1
-0.5 1 -0.5
-0.5 1 -0.5
-0.5 1 -0.5
…
-0.5 1 -0.5
-1
37
GT (Graph Transformer)
On graph transformer
2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function
Let’s try derive it.
…
1 -1
-0.5 1 -0.5
-0.5 1 -0.5
-0.5 1 -0.5
…
-0.5 1 -0.5
-1
If a matrix is tridiagonal and is also Toeplitz,
its eigenvalues are known to be [ref]:
Which is a function of cosine.
Noschese, S.; Pasquini, L.; Reichel, L. (2013). "Tridiagonal Toeplitz matrices:
Properties and novel applications". Numerical Linear Algebra with Applications.
38
GT (Graph Transformer)
On graph transformer
Laplacian eigenvectors
eigenvectors
eigenvalues
1) Self-attention
2) Positional encoding (PE)
Then, select k-smallest non-trivial eigen
vectors for PE for each node
1) k for dimension matching
2) Smallest to provide smooth encoding
PE ablation study
Transformer in various domains
39
(MAE)
(Acc)
(Acc)
40
GT (Graph Transformer): Overview
On graph transformer
1) Self-attention 2) PE
(neighbors)
41
GT (Graph Transformer): Overview
On graph transformer
* Edge feature-aided version
Conclusions
Transformer in various domains
42
Long-range
dependency
Parallel
computation
RNN (NLP) CNN (Vision) GNN (Graph)
(Already available) (Already available)
(Insufficient)
Conclusions
Transformer in various domains
43
1. Injecting proper inductive bias for a given task is important
3. The task that benefits the most is NLP,
which is followed by vision, graph.
(NLP > Vision > Graph)
2. For a graph-domain,
It seems that justification of using TF is weak.
reading group meeting material 44
Thank you for listening.

More Related Content

Similar to Reading_0413_var_Transformers.pptx

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Image Fusion Ehancement using DT-CWT Technique
Image Fusion Ehancement using DT-CWT TechniqueImage Fusion Ehancement using DT-CWT Technique
Image Fusion Ehancement using DT-CWT TechniqueIRJET Journal
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Ashray Bhandare
 
Efficient analytical and hybrid simulations using OpenSees
Efficient analytical and hybrid simulations using OpenSeesEfficient analytical and hybrid simulations using OpenSees
Efficient analytical and hybrid simulations using OpenSeesopenseesdays
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageIRJET Journal
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Wuhyun Rico Shin
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Pirouz Nourian
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 

Similar to Reading_0413_var_Transformers.pptx (20)

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Image Fusion Ehancement using DT-CWT Technique
Image Fusion Ehancement using DT-CWT TechniqueImage Fusion Ehancement using DT-CWT Technique
Image Fusion Ehancement using DT-CWT Technique
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
Efficient analytical and hybrid simulations using OpenSees
Efficient analytical and hybrid simulations using OpenSeesEfficient analytical and hybrid simulations using OpenSees
Efficient analytical and hybrid simulations using OpenSees
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from Image
 
Smart Room Gesture Control
Smart Room Gesture ControlSmart Room Gesture Control
Smart Room Gesture Control
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 

More from congtran88

Logic menh de nhap mon tri tue nhan tao ptit
Logic menh de nhap mon tri tue nhan tao ptitLogic menh de nhap mon tri tue nhan tao ptit
Logic menh de nhap mon tri tue nhan tao ptitcongtran88
 
InceptionV3 model deep learning image processing
InceptionV3 model deep learning image processingInceptionV3 model deep learning image processing
InceptionV3 model deep learning image processingcongtran88
 
2. Chuyen de 11 Hang 3.pdf
2. Chuyen de 11 Hang 3.pdf2. Chuyen de 11 Hang 3.pdf
2. Chuyen de 11 Hang 3.pdfcongtran88
 
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptx
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptxGENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptx
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptxcongtran88
 
Chapter 01 slides.pptx
Chapter 01 slides.pptxChapter 01 slides.pptx
Chapter 01 slides.pptxcongtran88
 
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...congtran88
 
Knowledge Based Systems.ppt
Knowledge Based Systems.pptKnowledge Based Systems.ppt
Knowledge Based Systems.pptcongtran88
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 

More from congtran88 (9)

Logic menh de nhap mon tri tue nhan tao ptit
Logic menh de nhap mon tri tue nhan tao ptitLogic menh de nhap mon tri tue nhan tao ptit
Logic menh de nhap mon tri tue nhan tao ptit
 
InceptionV3 model deep learning image processing
InceptionV3 model deep learning image processingInceptionV3 model deep learning image processing
InceptionV3 model deep learning image processing
 
2. Chuyen de 11 Hang 3.pdf
2. Chuyen de 11 Hang 3.pdf2. Chuyen de 11 Hang 3.pdf
2. Chuyen de 11 Hang 3.pdf
 
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptx
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptxGENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptx
GENERATIVE GRAPH CONVOLUTIONAL NETWORK FOR GROWING GRAPHS.pptx
 
CSDLPT
CSDLPTCSDLPT
CSDLPT
 
Chapter 01 slides.pptx
Chapter 01 slides.pptxChapter 01 slides.pptx
Chapter 01 slides.pptx
 
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...
S2-5.-KardiaChain-1-Ông-Nguyễn-Ngọc-Hưng-Chủ-tịch-Công-ty-Cổ-phần-Công-nghệ-K...
 
Knowledge Based Systems.ppt
Knowledge Based Systems.pptKnowledge Based Systems.ppt
Knowledge Based Systems.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 

Recently uploaded

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Recently uploaded (20)

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

Reading_0413_var_Transformers.pptx

  • 1. Presentor: Jinduk park reading group meeting material 1 Transformers in various domains School of Mathematics and Computing (Computational Science and Engineering) Yonsei Univ, Seoul, Korea
  • 2. Contents - On vision transformer Recap of Transformer Transformer in various domains 2 - On graph transformer - Introduction - Recap of Transformer - Some discussion
  • 4. A Preliminary: Inductive Bias Introduction 4 Abnar et al, "Transferring inductive biases through knowledge distillation." (2020). Inductive biases are the characteristics of learning algorithms that influence their generalization behaviour, independent of data. (a) has a weak inductive bias than (b) or (c)
  • 5. A Preliminary: Inductive Bias Introduction 5 Abnar et al, "Transferring inductive biases through knowledge distillation." (2020). [Training paths vs epochs on MNIST task] - Proper choose of inductive bias -> good convergence with limited training resources. - Without inductive bias -> local minima - Wrong choose of inductive bias (wrong assumption) -> wrong results
  • 6. A Preliminary: Inductive Bias Introduction 6 How can we inject Inductive bias? 1. Appropriate architecture 2. Appropriate objective function 3. Appropriate optimization method + 𝛼... But not limited to https://www.researchgate.net/figure/This-figure-Shows-multi-SGD-optimizer_fig3_327135988 https://www.ibm.com/cloud/learn/convolutional-neural-networks
  • 7. Concept of Attention 7 An Introductory Survey on Attention Mechanisms in NLP Problems, 2019 SAIISC http://projects.i-ctm.eu/en/project/visual-attention attention mechanism is a method that used for encoding data based on the importance score each element is assigned First derived from human intuition, Introduction
  • 8. Why Transformer? 8 Introduction Inspired by the major success of transformer architectures in the field of NLP, many researcher apply it to other domains (vision, graph, ...) Han, Kai, et al. "A survey on visual transformer." arXiv e-prints (2020): arXiv-2012.
  • 9. Why Transformer? 9 Introduction Transformer is designed for NLP : can’t directly apply to other tasks proper inductive bias for specific data structure is need.
  • 10. Applying Transforemer to various domains 10 NLP token sequence Vision Image Graph Transformer Network Components - Positional encoding - Self-attention - Batch normalization ... Focus on which components should be revised Introduction
  • 11. 11 Motivation of designing Transformer 1) Parallel computation is available 2) Long-range dependencies (global perspective) Why Transformer? Recap of Transformer
  • 12. 2 Major Components of Transformer Recap of Transformer 12 1) Positional encoding 2) Multi-head attention (with self-attention) Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
  • 13. 1) Positional encoding (PE) 13 Each position is encoded with sin/cos function, Then summed up. E E E I love you E E E I love you + + + Recap of Transformer *sum is memory efficient, less expressive tho.
  • 14. 14 Simple matrix multiplications using the concept of Query, Keys, Value vectors matrix * * * 𝑊 𝑞 𝑊𝑘 𝑊 𝑣 Parameters to be learned Query Key Value I love you I love you I love you I love you 2) Self Attention Recap of Transformer
  • 15. 15 Now, based on the Q, K, V, Scaled Dot-Product Attention is calculated as: 2) Self Attention Recap of Transformer I love you 𝑄 𝐾𝑇 𝑉 𝑆𝑐𝑜𝑟𝑒 I * I I * Iove I * you * * * 130 50 10 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 0.92 0.06 0.02 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ∗ 𝑉 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟 𝑜𝑢𝑡𝑝𝑢𝑡 ∑ There is no dependency between each query at all : parallel computation available !
  • 17. TF in Vision Domain On vision transformer 17 How can we apply transformer (TF) to an image ?
  • 18. TF in Vision Domain 18 Two major components of TF : 1) Self-attention 2) Positional encoding (PE) On vision transformer How can we apply transformer (TF) to an image ?
  • 19. Vision Transformer (ViT) 19 Defining unit (token in NLP) of encoding However, pixel-wise self-attention is too insufficient. Q K V On vision transformer Unit of image: pixel 1) Self-attention 2) Positional encoding (PE)
  • 20. Vision Transformer (ViT) 20 Defining unit (token in NLP) of encoding Q K V On vision transformer Proposed method: flattened 2D patches Flattend * W 1) Self-attention 2) Positional encoding (PE)
  • 21. Vision Transformer (ViT) 21 On vision transformer Since image is in a spatial domain, encode position in a 1d order, or 2d coordinate 1D) i-th patch in the raster order or 2D) (i,j)-th patch For 2d, x,y components are encoded seperately and concatenated 1) Self-attention 2) Positional encoding (PE)
  • 22. Vision Transformer (ViT) 22 On vision transformer Choose 1-D encoding, empirically. 1) Self-attention 2) Positional encoding (PE) Acc.
  • 23. 23 Vision Transformer (ViT): Overview On vision transformer Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." (2020).
  • 24. 24 Inductive biases in CNN : locality + translation equivariance Inductive biases in CNN On vision transformer https://anhreynolds.com/blogs/cnn.html However, strong inductive bias can be harmful for some tasks
  • 25. 25 ViT uses Transformer Vision Transformer (ViT): Overview On vision transformer 1. Parallel computation 2. Global perspective
  • 26. 26 When to choose ViT? On vision transformer 1. Parallel computation : Already available in CNN 2. Global perspective : less inductive bias than CNN ViT is useful for tasks where generalization property is important (few-shot learning, large dataset training, ...) ViT uses Transformer
  • 27. 27 *ViT-model size(L,B)/patch size(16,32) As # pretraining sample increases, performance finally exceeds the CNN model. Experimental Validation On vision transformer
  • 28. 28 Attention in Graph: GAT On graph transformer Veličković, Petar, et al. "Graph attention networks." (2017). GAT (Graph Attention Network) Attention is the function of the neighborhood connectivity Attention in graph:
  • 29. 29 Attention in Graph: GAT On graph transformer GAT (Graph Attention Network) Limitation: GAT attention is nothing to do with global connectivity.
  • 30. 30 GT (Graph Transformer) On graph transformer 1) Self-attention 2) Positional encoding (PE) Sentence in NLP can be viewed as discrete line graph “My future is bright” Unit of GT is each node, like each token in NLP e.g.)
  • 31. 31 GT (Graph Transformer) On graph transformer 1) Self-attention 2) Positional encoding (PE) The difference is layer-wise global Q,K,V is constructed, instead of constructing node-wise Q,K,V Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
  • 32. 32 GT (Graph Transformer) On graph transformer 1) Self-attention 2) Positional encoding (PE) The difference is layer-wise global Q,K,V is constructed, instead of constructing node-wise Q,K,V Seems unable to utilze the power of self-attention Dwivedi, Vijay Prakash, and Xavier Bresson. "A generalization of transformer networks to graphs." arXiv preprint arXiv:2012.09699 (2020).
  • 33. 33 GT (Graph Transformer) On graph transformer Limitation of GAT: no global connectivity How to encode position of a node considering global graph structure? 1) Self-attention 2) Positional encoding (PE)
  • 34. 34 GT (Graph Transformer) On graph transformer Laplacian eigenvectors eigenvectors eigenvalues 1) Self-attention 2) Positional encoding (PE)
  • 35. 35 GT (Graph Transformer) On graph transformer Laplacian eigenvectors eigenvectors eigenvalues 1) Self-attention 2) Positional encoding (PE) 1) distance-aware node feature (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features) 2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function Why graph Laplacian?
  • 36. 36 GT (Graph Transformer) On graph transformer 2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function Let’s try derive it. … 1 1 1 1 1 1 1 … 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 … 0.5 0.5 1 (Normalized) 1 -1 -0.5 1 -0.5 -0.5 1 -0.5 -0.5 1 -0.5 … -0.5 1 -0.5 -1
  • 37. 37 GT (Graph Transformer) On graph transformer 2) NLP graph’s Laplacian eigenvectors are naturally cosine and sine function Let’s try derive it. … 1 -1 -0.5 1 -0.5 -0.5 1 -0.5 -0.5 1 -0.5 … -0.5 1 -0.5 -1 If a matrix is tridiagonal and is also Toeplitz, its eigenvalues are known to be [ref]: Which is a function of cosine. Noschese, S.; Pasquini, L.; Reichel, L. (2013). "Tridiagonal Toeplitz matrices: Properties and novel applications". Numerical Linear Algebra with Applications.
  • 38. 38 GT (Graph Transformer) On graph transformer Laplacian eigenvectors eigenvectors eigenvalues 1) Self-attention 2) Positional encoding (PE) Then, select k-smallest non-trivial eigen vectors for PE for each node 1) k for dimension matching 2) Smallest to provide smooth encoding
  • 39. PE ablation study Transformer in various domains 39 (MAE) (Acc) (Acc)
  • 40. 40 GT (Graph Transformer): Overview On graph transformer 1) Self-attention 2) PE (neighbors)
  • 41. 41 GT (Graph Transformer): Overview On graph transformer * Edge feature-aided version
  • 42. Conclusions Transformer in various domains 42 Long-range dependency Parallel computation RNN (NLP) CNN (Vision) GNN (Graph) (Already available) (Already available) (Insufficient)
  • 43. Conclusions Transformer in various domains 43 1. Injecting proper inductive bias for a given task is important 3. The task that benefits the most is NLP, which is followed by vision, graph. (NLP > Vision > Graph) 2. For a graph-domain, It seems that justification of using TF is weak.
  • 44. reading group meeting material 44 Thank you for listening.