SlideShare a Scribd company logo
1 of 26
Download to read offline
Pay Attention to MLPs
seolhokim
Contents
● Introduction
● Preliminary
● Model
● Experiments
○ Image classification
○ Masked Language Modeling with BERT
● Conclusion
● References
Introduction
● Transformer has had a great influence on
almost all areas of deep learning
○ many experiments have been conducted on each
element of transformer
Introduction
● There are two important properties about
transformer
○ Recurrent layer free architecture
○ Self-attention block aggregates spatial information
across tokens
■ Dynamically parameterized by attention
mechanism and Positional Encoding
● -> Inductive bias!
Preliminary
● The inductive bias (also known as learning bias) of a learning algorithm is the
set of assumptions that the learner uses to predict outputs of given inputs that
it has not encountered.
○ examples : prior, locality, relation
Table 1: examples of Inductive bias
Preliminary
● Do we really need that inductive bias?
○ Let's create an architecture that can replace self-attention that can maintain spatial
information without such inductive bias!
■ -> gMLPs without positional encoder
● static parameterization
Preliminary
● Self-attention
○ Example : The animal didn’t cross the
street, because it was too tired.
Figure 2 : Self-attention example
Preliminary
● Positional Encoding
○ To get permutation-invariant!
○ It should output a unique encoding for each
time-step (word’s position in a sentence)
○ Distance between any two time-steps
should be consistent across sentences with
different lengths.
○ Our model should generalize to longer
sentences without any efforts. Its values
should be bounded.
○ It must be deterministic.
Figure 3 : Transformer positional encoding function
Model
● gMLP consists of a stack of L blocks
same as Multi-head attention block.
1
2
3
1
2
3
Figure 4 : simple gMLP block architecture
Model
● Channel projection(Linear projection)
○ 1,3
● Spatial Gating Unit
○ 2
1
2
3
Model
● Channel projection
○ Same as those in the FFNs of Transformers(Fully connected layers)
○ U,V are linear projections along the channel dimension
○ Activation function is GeLU
○ Flexibly set the block input channel and output channel to be the same
Model
● Spatial Gating Unit
●
○ function f differs from in matrix multiplication order.
■ considers the sum of weights of each token element.
■ function f considers the sum of the element weights of each token at a specific
location.
1
1
Model
● Spatial Gating Unit
○ For a stable start, W is initialized to zero matrix
and b to one vector.
○ Got better performance by calculating as
follows
■ Split Z into two independent parts (Z1,Z2)
in half along the channel dimension. (U,V
will be different size)
● Normalization
Figure 5 : entire gMLP block architecture
Model
● Spatial Gating Unit
○ SGU shows 2nd-order interactions in
■
○ Self-attention shows 3rd-order interactions
■
2
Model
● Related Works
○ Gating of gMLP is computed based on a projection over the spatial dimension rather than the
channel dimension(compare to Highway Network)
● Squeeze-and-Excite block also shows only channel dimension multiplication
Figure 6 : SENet architecture
Experiments
Image Classification
● Image classification task on ImageNet
○ Input and Output protocols follow ViT/B16
○ It Shows overfitting like transformer -> used DeiT regularization recipe
○ Figure 7 is evidence that if gMLP is moderately regularized, it depends on the capacity of the
network rather than the existence of self-attention.
Table 2 : Architecture specifications of gMLP models for vision
Figure 7 : ImageNet accuracy vs model capacity
Experiments
Image Classification
● Image classification task on ImageNet
○ Each row shows the filters for selected set of
tokens in the same layer
■ -> locality and spatial invariance
Figure 9 : Spatial projection weights in gMLP-B
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Input and Output protocols follow BERT
○ They didn’t use positional encoding
○ They didn’t mask out <pad>
○ gMLP can learn shift invariant property when any
offset of the input sequence does not affect the
outcome. -> spatial matrices become
Toeplitz-like matrices -> like 1-d convolution
Figure 10: Spatial projection matrices learned on the MLM pretraining task without the
shift invariance prior
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Table 2 : Ablation study
○ Figure 11 : row in W associated with the token in the middle of the sequence
Table 3 : MLM validation perplexities of Transformer baselines and four
versions of gMLPs Figure 11: Visualization of the spatial filters in gMLP learned on the MLM task.
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Comparison between gMLP and Self-attention+FFN
○ Even if the perplexity is the same, factors such as inductive bias affect finetuning.
○ Looking at the performance slope according to capacity, it is judged to be a factor that can be
properly overcome.
Table 4 : Pretraining and dev-set finetuning results over increased model capacity figure 12: Scaling properies with respect to perplexity and finetuning accuracies
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ aMLP
■ added tiny self attention
Figure 13: Hybrid spatial gating unit with a tiny self-attention module
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ aMLP shows slightly better performance than Transformer in MNLI-m
Figure 14 : Transferability from MLM pretraining perpexity to finetuning accuracies on GLUE
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
Figure 15 : Comparing the scaling properties of Transformers, gMLPs and aMLP
Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
Table 5 : Model specifications in the full BERT setup
Table 6 : Pretraining perplexities and dev-set result for finetuning
Conclusion
● Experiments show that better performance can be achieved by mitigating conventional
induction bias (still slightly ambiguous).
● aMLP shows SGU can replace Positional Encoding
○ From the viewpoint of capturing the spatial interaction, the operation of the SGU seems
reasonable.
○ I think an ablation study is needed for comparison between 2nd order interaction and
3rd order interaction.
■ However, in order to achieve this, appropriate measures are needed to reduce
the network size.
References
1. Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay Attention to MLPs. arXiv preprint arXiv:2105.08050.
2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).
Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
3. Mitchell, T. M. (1980). The need for biases in learning generalizations (pp. 184-191). Piscataway, NJ, USA:
Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ.
4. Kim, H. (n.d.). [NLP 논문 구현] pytorch로 구현하는 Transformer (Attention is All You Need). Hansu Kim’s
Blog. https://cpm0722.github.io/pytorch-implementation/transformer
5. Kazemnejad, A. (n.d.). Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad’s Blog.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
6. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
7. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 7132-7141).

More Related Content

What's hot

Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationVikas Jain
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Rohit Kumar
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentationMrsShwetaBanait1
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...akshaya870130
 
"Fundamentals of Monocular SLAM," a Presentation from Cadence
"Fundamentals of Monocular SLAM," a Presentation from Cadence"Fundamentals of Monocular SLAM," a Presentation from Cadence
"Fundamentals of Monocular SLAM," a Presentation from CadenceEdge AI and Vision Alliance
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...Edge AI and Vision Alliance
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxKarimdabbabi
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSungjoon Choi
 
Computer Vision with Deep Learning
Computer Vision with Deep LearningComputer Vision with Deep Learning
Computer Vision with Deep LearningCapgemini
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionBrodmann17
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionAmar Jindal
 

What's hot (20)

Object detection
Object detectionObject detection
Object detection
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentation
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Computer Vision Introduction
Computer Vision IntroductionComputer Vision Introduction
Computer Vision Introduction
 
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...
vdocument.in_prof-saroj-kaushik1-introduction-to-artificial-intelligence-lect...
 
Yolo
YoloYolo
Yolo
 
"Fundamentals of Monocular SLAM," a Presentation from Cadence
"Fundamentals of Monocular SLAM," a Presentation from Cadence"Fundamentals of Monocular SLAM," a Presentation from Cadence
"Fundamentals of Monocular SLAM," a Presentation from Cadence
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
Computer Vision with Deep Learning
Computer Vision with Deep LearningComputer Vision with Deep Learning
Computer Vision with Deep Learning
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 

Similar to Pay attention to MLPs

Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewVimukthi Wickramasinghe
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxvipul6601
 
Deep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitDeep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitSatoru Katsumata
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsDeha Deniz Türköz
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshellLULU CHENG
 
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...thanhdowork
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxssuser2624f71
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on KaggleNasuka Sumino
 
Functional programming
Functional programmingFunctional programming
Functional programmingijcd
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」kurotaki_weblab
 

Similar to Pay attention to MLPs (20)

Bert
BertBert
Bert
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptx
 
Deep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitDeep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative Unit
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshell
 
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...
240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning...
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Graph-based SLAM
Graph-based SLAMGraph-based SLAM
Graph-based SLAM
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Pay attention to MLPs

  • 1. Pay Attention to MLPs seolhokim
  • 2. Contents ● Introduction ● Preliminary ● Model ● Experiments ○ Image classification ○ Masked Language Modeling with BERT ● Conclusion ● References
  • 3. Introduction ● Transformer has had a great influence on almost all areas of deep learning ○ many experiments have been conducted on each element of transformer
  • 4. Introduction ● There are two important properties about transformer ○ Recurrent layer free architecture ○ Self-attention block aggregates spatial information across tokens ■ Dynamically parameterized by attention mechanism and Positional Encoding ● -> Inductive bias!
  • 5. Preliminary ● The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. ○ examples : prior, locality, relation Table 1: examples of Inductive bias
  • 6. Preliminary ● Do we really need that inductive bias? ○ Let's create an architecture that can replace self-attention that can maintain spatial information without such inductive bias! ■ -> gMLPs without positional encoder ● static parameterization
  • 7. Preliminary ● Self-attention ○ Example : The animal didn’t cross the street, because it was too tired. Figure 2 : Self-attention example
  • 8. Preliminary ● Positional Encoding ○ To get permutation-invariant! ○ It should output a unique encoding for each time-step (word’s position in a sentence) ○ Distance between any two time-steps should be consistent across sentences with different lengths. ○ Our model should generalize to longer sentences without any efforts. Its values should be bounded. ○ It must be deterministic. Figure 3 : Transformer positional encoding function
  • 9. Model ● gMLP consists of a stack of L blocks same as Multi-head attention block. 1 2 3 1 2 3 Figure 4 : simple gMLP block architecture
  • 10. Model ● Channel projection(Linear projection) ○ 1,3 ● Spatial Gating Unit ○ 2 1 2 3
  • 11. Model ● Channel projection ○ Same as those in the FFNs of Transformers(Fully connected layers) ○ U,V are linear projections along the channel dimension ○ Activation function is GeLU ○ Flexibly set the block input channel and output channel to be the same
  • 12. Model ● Spatial Gating Unit ● ○ function f differs from in matrix multiplication order. ■ considers the sum of weights of each token element. ■ function f considers the sum of the element weights of each token at a specific location. 1 1
  • 13. Model ● Spatial Gating Unit ○ For a stable start, W is initialized to zero matrix and b to one vector. ○ Got better performance by calculating as follows ■ Split Z into two independent parts (Z1,Z2) in half along the channel dimension. (U,V will be different size) ● Normalization Figure 5 : entire gMLP block architecture
  • 14. Model ● Spatial Gating Unit ○ SGU shows 2nd-order interactions in ■ ○ Self-attention shows 3rd-order interactions ■ 2
  • 15. Model ● Related Works ○ Gating of gMLP is computed based on a projection over the spatial dimension rather than the channel dimension(compare to Highway Network) ● Squeeze-and-Excite block also shows only channel dimension multiplication Figure 6 : SENet architecture
  • 16. Experiments Image Classification ● Image classification task on ImageNet ○ Input and Output protocols follow ViT/B16 ○ It Shows overfitting like transformer -> used DeiT regularization recipe ○ Figure 7 is evidence that if gMLP is moderately regularized, it depends on the capacity of the network rather than the existence of self-attention. Table 2 : Architecture specifications of gMLP models for vision Figure 7 : ImageNet accuracy vs model capacity
  • 17. Experiments Image Classification ● Image classification task on ImageNet ○ Each row shows the filters for selected set of tokens in the same layer ■ -> locality and spatial invariance Figure 9 : Spatial projection weights in gMLP-B
  • 18. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task ○ Input and Output protocols follow BERT ○ They didn’t use positional encoding ○ They didn’t mask out <pad> ○ gMLP can learn shift invariant property when any offset of the input sequence does not affect the outcome. -> spatial matrices become Toeplitz-like matrices -> like 1-d convolution Figure 10: Spatial projection matrices learned on the MLM pretraining task without the shift invariance prior
  • 19. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task ○ Table 2 : Ablation study ○ Figure 11 : row in W associated with the token in the middle of the sequence Table 3 : MLM validation perplexities of Transformer baselines and four versions of gMLPs Figure 11: Visualization of the spatial filters in gMLP learned on the MLM task.
  • 20. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task ○ Comparison between gMLP and Self-attention+FFN ○ Even if the perplexity is the same, factors such as inductive bias affect finetuning. ○ Looking at the performance slope according to capacity, it is judged to be a factor that can be properly overcome. Table 4 : Pretraining and dev-set finetuning results over increased model capacity figure 12: Scaling properies with respect to perplexity and finetuning accuracies
  • 21. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task ○ aMLP ■ added tiny self attention Figure 13: Hybrid spatial gating unit with a tiny self-attention module
  • 22. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task ○ aMLP shows slightly better performance than Transformer in MNLI-m Figure 14 : Transferability from MLM pretraining perpexity to finetuning accuracies on GLUE
  • 23. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task Figure 15 : Comparing the scaling properties of Transformers, gMLPs and aMLP
  • 24. Experiments Masked Language Modeling with BERT ● masked language modeling(MLM) task Table 5 : Model specifications in the full BERT setup Table 6 : Pretraining perplexities and dev-set result for finetuning
  • 25. Conclusion ● Experiments show that better performance can be achieved by mitigating conventional induction bias (still slightly ambiguous). ● aMLP shows SGU can replace Positional Encoding ○ From the viewpoint of capturing the spatial interaction, the operation of the SGU seems reasonable. ○ I think an ablation study is needed for comparison between 2nd order interaction and 3rd order interaction. ■ However, in order to achieve this, appropriate measures are needed to reduce the network size.
  • 26. References 1. Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay Attention to MLPs. arXiv preprint arXiv:2105.08050. 2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). 3. Mitchell, T. M. (1980). The need for biases in learning generalizations (pp. 184-191). Piscataway, NJ, USA: Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. 4. Kim, H. (n.d.). [NLP 논문 구현] pytorch로 구현하는 Transformer (Attention is All You Need). Hansu Kim’s Blog. https://cpm0722.github.io/pytorch-implementation/transformer 5. Kazemnejad, A. (n.d.). Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad’s Blog. https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ 6. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387. 7. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).