Pay attention to MLPs

Pay Attention to MLPs
seolhokim

Contents
● Introduction
● Preliminary
● Model
● Experiments
○ Image classification
○ Masked Language Modeling with BERT
● Conclusion
● References

Introduction
● Transformer has had a great influence on
almost all areas of deep learning
○ many experiments have been conducted on each
element of transformer

Introduction
● There are two important properties about
transformer
○ Recurrent layer free architecture
○ Self-attention block aggregates spatial information
across tokens
■ Dynamically parameterized by attention
mechanism and Positional Encoding
● -> Inductive bias!

Preliminary
● The inductive bias (also known as learning bias) of a learning algorithm is the
set of assumptions that the learner uses to predict outputs of given inputs that
it has not encountered.
○ examples : prior, locality, relation
Table 1: examples of Inductive bias

Preliminary
● Do we really need that inductive bias?
○ Let's create an architecture that can replace self-attention that can maintain spatial
information without such inductive bias!
■ -> gMLPs without positional encoder
● static parameterization

Preliminary
● Self-attention
○ Example : The animal didn’t cross the
street, because it was too tired.
Figure 2 : Self-attention example

Preliminary
● Positional Encoding
○ To get permutation-invariant!
○ It should output a unique encoding for each
time-step (word’s position in a sentence)
○ Distance between any two time-steps
should be consistent across sentences with
different lengths.
○ Our model should generalize to longer
sentences without any efforts. Its values
should be bounded.
○ It must be deterministic.
Figure 3 : Transformer positional encoding function

Model
● gMLP consists of a stack of L blocks
same as Multi-head attention block.
1
2
3
1
2
3
Figure 4 : simple gMLP block architecture

Model
● Channel projection(Linear projection)
○ 1,3
● Spatial Gating Unit
○ 2
1
2
3

Model
● Channel projection
○ Same as those in the FFNs of Transformers(Fully connected layers)
○ U,V are linear projections along the channel dimension
○ Activation function is GeLU
○ Flexibly set the block input channel and output channel to be the same

Model
●
○ function f differs from in matrix multiplication order.
■ considers the sum of weights of each token element.
■ function f considers the sum of the element weights of each token at a specific
location.
1
1

Model
○ For a stable start, W is initialized to zero matrix
and b to one vector.
○ Got better performance by calculating as
follows
■ Split Z into two independent parts (Z1,Z2)
in half along the channel dimension. (U,V
will be different size)
● Normalization
Figure 5 : entire gMLP block architecture

Model
○ SGU shows 2nd-order interactions in
■
○ Self-attention shows 3rd-order interactions
■
2

Model
● Related Works
○ Gating of gMLP is computed based on a projection over the spatial dimension rather than the
channel dimension(compare to Highway Network)
● Squeeze-and-Excite block also shows only channel dimension multiplication
Figure 6 : SENet architecture

Experiments
Image Classification
● Image classification task on ImageNet
○ Input and Output protocols follow ViT/B16
○ It Shows overfitting like transformer -> used DeiT regularization recipe
○ Figure 7 is evidence that if gMLP is moderately regularized, it depends on the capacity of the
network rather than the existence of self-attention.
Table 2 : Architecture specifications of gMLP models for vision
Figure 7 : ImageNet accuracy vs model capacity

Experiments
Image Classification
● Image classification task on ImageNet
○ Each row shows the filters for selected set of
tokens in the same layer
■ -> locality and spatial invariance
Figure 9 : Spatial projection weights in gMLP-B

Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Input and Output protocols follow BERT
○ They didn’t use positional encoding
○ They didn’t mask out <pad>
○ gMLP can learn shift invariant property when any
offset of the input sequence does not affect the
outcome. -> spatial matrices become
Toeplitz-like matrices -> like 1-d convolution
Figure 10: Spatial projection matrices learned on the MLM pretraining task without the
shift invariance prior

Experiments
○ Table 2 : Ablation study
○ Figure 11 : row in W associated with the token in the middle of the sequence
Table 3 : MLM validation perplexities of Transformer baselines and four
versions of gMLPs Figure 11: Visualization of the spatial filters in gMLP learned on the MLM task.

Experiments
○ Comparison between gMLP and Self-attention+FFN
○ Even if the perplexity is the same, factors such as inductive bias affect finetuning.
○ Looking at the performance slope according to capacity, it is judged to be a factor that can be
properly overcome.
Table 4 : Pretraining and dev-set finetuning results over increased model capacity figure 12: Scaling properies with respect to perplexity and finetuning accuracies

Experiments
○ aMLP
■ added tiny self attention
Figure 13: Hybrid spatial gating unit with a tiny self-attention module

Experiments
○ aMLP shows slightly better performance than Transformer in MNLI-m
Figure 14 : Transferability from MLM pretraining perpexity to finetuning accuracies on GLUE

Experiments
Figure 15 : Comparing the scaling properties of Transformers, gMLPs and aMLP

Experiments
Table 5 : Model specifications in the full BERT setup
Table 6 : Pretraining perplexities and dev-set result for finetuning

Conclusion
● Experiments show that better performance can be achieved by mitigating conventional
induction bias (still slightly ambiguous).
● aMLP shows SGU can replace Positional Encoding
○ From the viewpoint of capturing the spatial interaction, the operation of the SGU seems
reasonable.
○ I think an ablation study is needed for comparison between 2nd order interaction and
3rd order interaction.
■ However, in order to achieve this, appropriate measures are needed to reduce
the network size.

References
1. Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay Attention to MLPs. arXiv preprint arXiv:2105.08050.
2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).
Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
3. Mitchell, T. M. (1980). The need for biases in learning generalizations (pp. 184-191). Piscataway, NJ, USA:
Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ.
4. Kim, H. (n.d.). [NLP 논문 구현] pytorch로 구현하는 Transformer (Attention is All You Need). Hansu Kim’s
Blog. https://cpm0722.github.io/pytorch-implementation/transformer
5. Kazemnejad, A. (n.d.). Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad’s Blog.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
6. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
7. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 7132-7141).

Pay attention to MLPs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pay attention to MLPs

Similar to Pay attention to MLPs (20)

Recently uploaded

Recently uploaded (20)

Pay attention to MLPs