3. Introduction
● Transformer has had a great influence on
almost all areas of deep learning
○ many experiments have been conducted on each
element of transformer
4. Introduction
● There are two important properties about
transformer
○ Recurrent layer free architecture
○ Self-attention block aggregates spatial information
across tokens
■ Dynamically parameterized by attention
mechanism and Positional Encoding
● -> Inductive bias!
5. Preliminary
● The inductive bias (also known as learning bias) of a learning algorithm is the
set of assumptions that the learner uses to predict outputs of given inputs that
it has not encountered.
○ examples : prior, locality, relation
Table 1: examples of Inductive bias
6. Preliminary
● Do we really need that inductive bias?
○ Let's create an architecture that can replace self-attention that can maintain spatial
information without such inductive bias!
■ -> gMLPs without positional encoder
● static parameterization
8. Preliminary
● Positional Encoding
○ To get permutation-invariant!
○ It should output a unique encoding for each
time-step (word’s position in a sentence)
○ Distance between any two time-steps
should be consistent across sentences with
different lengths.
○ Our model should generalize to longer
sentences without any efforts. Its values
should be bounded.
○ It must be deterministic.
Figure 3 : Transformer positional encoding function
9. Model
● gMLP consists of a stack of L blocks
same as Multi-head attention block.
1
2
3
1
2
3
Figure 4 : simple gMLP block architecture
11. Model
● Channel projection
○ Same as those in the FFNs of Transformers(Fully connected layers)
○ U,V are linear projections along the channel dimension
○ Activation function is GeLU
○ Flexibly set the block input channel and output channel to be the same
12. Model
● Spatial Gating Unit
●
○ function f differs from in matrix multiplication order.
■ considers the sum of weights of each token element.
■ function f considers the sum of the element weights of each token at a specific
location.
1
1
13. Model
● Spatial Gating Unit
○ For a stable start, W is initialized to zero matrix
and b to one vector.
○ Got better performance by calculating as
follows
■ Split Z into two independent parts (Z1,Z2)
in half along the channel dimension. (U,V
will be different size)
● Normalization
Figure 5 : entire gMLP block architecture
14. Model
● Spatial Gating Unit
○ SGU shows 2nd-order interactions in
■
○ Self-attention shows 3rd-order interactions
■
2
15. Model
● Related Works
○ Gating of gMLP is computed based on a projection over the spatial dimension rather than the
channel dimension(compare to Highway Network)
● Squeeze-and-Excite block also shows only channel dimension multiplication
Figure 6 : SENet architecture
16. Experiments
Image Classification
● Image classification task on ImageNet
○ Input and Output protocols follow ViT/B16
○ It Shows overfitting like transformer -> used DeiT regularization recipe
○ Figure 7 is evidence that if gMLP is moderately regularized, it depends on the capacity of the
network rather than the existence of self-attention.
Table 2 : Architecture specifications of gMLP models for vision
Figure 7 : ImageNet accuracy vs model capacity
17. Experiments
Image Classification
● Image classification task on ImageNet
○ Each row shows the filters for selected set of
tokens in the same layer
■ -> locality and spatial invariance
Figure 9 : Spatial projection weights in gMLP-B
18. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Input and Output protocols follow BERT
○ They didn’t use positional encoding
○ They didn’t mask out <pad>
○ gMLP can learn shift invariant property when any
offset of the input sequence does not affect the
outcome. -> spatial matrices become
Toeplitz-like matrices -> like 1-d convolution
Figure 10: Spatial projection matrices learned on the MLM pretraining task without the
shift invariance prior
19. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Table 2 : Ablation study
○ Figure 11 : row in W associated with the token in the middle of the sequence
Table 3 : MLM validation perplexities of Transformer baselines and four
versions of gMLPs Figure 11: Visualization of the spatial filters in gMLP learned on the MLM task.
20. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ Comparison between gMLP and Self-attention+FFN
○ Even if the perplexity is the same, factors such as inductive bias affect finetuning.
○ Looking at the performance slope according to capacity, it is judged to be a factor that can be
properly overcome.
Table 4 : Pretraining and dev-set finetuning results over increased model capacity figure 12: Scaling properies with respect to perplexity and finetuning accuracies
21. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ aMLP
■ added tiny self attention
Figure 13: Hybrid spatial gating unit with a tiny self-attention module
22. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
○ aMLP shows slightly better performance than Transformer in MNLI-m
Figure 14 : Transferability from MLM pretraining perpexity to finetuning accuracies on GLUE
23. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
Figure 15 : Comparing the scaling properties of Transformers, gMLPs and aMLP
24. Experiments
Masked Language Modeling with BERT
● masked language modeling(MLM) task
Table 5 : Model specifications in the full BERT setup
Table 6 : Pretraining perplexities and dev-set result for finetuning
25. Conclusion
● Experiments show that better performance can be achieved by mitigating conventional
induction bias (still slightly ambiguous).
● aMLP shows SGU can replace Positional Encoding
○ From the viewpoint of capturing the spatial interaction, the operation of the SGU seems
reasonable.
○ I think an ablation study is needed for comparison between 2nd order interaction and
3rd order interaction.
■ However, in order to achieve this, appropriate measures are needed to reduce
the network size.
26. References
1. Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay Attention to MLPs. arXiv preprint arXiv:2105.08050.
2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).
Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
3. Mitchell, T. M. (1980). The need for biases in learning generalizations (pp. 184-191). Piscataway, NJ, USA:
Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ.
4. Kim, H. (n.d.). [NLP 논문 구현] pytorch로 구현하는 Transformer (Attention is All You Need). Hansu Kim’s
Blog. https://cpm0722.github.io/pytorch-implementation/transformer
5. Kazemnejad, A. (n.d.). Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad’s Blog.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
6. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
7. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 7132-7141).