#PR12 #PR344
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다.
오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다.
부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다.
우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다.
셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고
CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고,
Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다.
또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다.
이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다.
개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다.
자세한 내용은 영상을 참고해주세요! 감사합니다
영상링크: https://youtu.be/NVLMZZglx14
논문링크: https://arxiv.org/abs/2108.13002
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
1. A Battle of Network Structures:
An Empirical Study of CNN, Transformer, and MLP
Yucheng Zhao et al., “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP”
5th September, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics
2. Introduction
• Recently, Transformer and multi-layer perceptron (MLP)-based
models, such as Vision Transformer(ViT) and MLP-Mixer, started to
lead new trends as they showed promising results in the ImageNet
classification task.
• However, many new networks are still proposed, and no conclusion
can be made as which structure among CNN, Transformer and MLP
performs the best or is most suitable for vision tasks.
3. Introduction
• This is partly due to the pursuit of high scores that leads to
multifarious tricks and exhaustive parameter tuning, and structure-
free components play an important role in the final performance.
• As a result, network structures cannot be fairly compared in a
systematic way.
• The work presented in this paper fills this blank by conducting a series
of controlled experiments over CNN, Transformer, and MLP in a
unified framework.
4. Backgrounds
• Convolution-based vision models
▪ CNN was a de-facto standard in computer vision models.
▪ A standard convolution layer learns filters in a 3D space(2 spatial, 1 channel),
thus the learning of spatial correlations and channel correlations are coupled
inside a single kernel.
▪ Differently, a depthwise convolution layer only learns spatial correlations by
moving the learning process of channel correlations to an additional 1x1
convolution.
▪ The idea of decoupling spatial and channel correlations is adopted in the
Vision Transformer.
6. Backgrounds
• Transformer-based vision models
▪ ViT reveals that Transformer models trained on large-scale datasets could
attain SOTA image recognition performance.
▪ However, when the training data is insufficient, ViT does not perform well due
to the lack of inductive biases.
▪ Swin and Twins adopt locally-grouped self attention and the local mechanism
not only leads to performance improvement, but also bring sufficient
improvement on memory and computational efficiency.
8. Three Desired Properties in Different Network
Structures
Convolution layer
- Locally connected
- Same weights for all inputs
Dense layer
- Fully connected
- Same weights for all inputs
Attention layer
- Fully connected
- Different weights for all inputs
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
10. SPACH – the serial structure of SPAtial and
Channel processing
• The architecture is very simple and consists mainly of a cascade of
mixing blocks, plus some necessary auxiliary modules.
• SPACH-MS(Multi-Stage) is designed to keep a high-resolution in the
initial stage of the framework and progressively perform down-
sampling.
• Each stage contains Ns mixing blocks, where s is the stage index.
• Due to the extremely high computational cost of Transformer and
MLP on high-resolution feature maps, the mixing blocks in the first
stage are implemented with convolutions only.
• The patch size is 16 in the single-stage implementation and 4 in multi-
stage implementation.
11. Mixing Block Design
• Mixing blocks are key components in the SPACH framework.
• Input is first processed by a spatial mixing function ℱ𝑠 and then by a
channel mixing function ℱ𝑐. ℱ𝑠 focuses on aggregating context
information from different spatial locations while ℱ𝑐 focuses on
channel information fusion.
• ℱ𝑐 can be viewed as a 1x1 convolution and only performs channel
fusion and does not explore any spatial context.
12. Mixing Block Design
• The spatial mixing function ℱ𝑠 is the key to implement different
architectures.
• Specifically, the convolution structure is implemented by a 3x3 depth-wise
convolution.
• For the Transformer structure, there is a positional embedding module in
the original design. But recent research suggests that absolute positional
embedding breaks translation variance, which is not suitable for images.
• Inspired by recent vision transformer design, the authors introduce a
convolutional positional encoding (CPE) as a bypass in each spatial mixing
module.
• MLP-Mixer does not use any positional embedding, but empirically very
lightweight CPE significantly improves the model performance.
13. Empirical Studies on Mixing Blocks
• Datasets and Training Pipelines
▪ ImageNet-1K (IN-1K)
▪ Input resolution: 224x224
▪ Most of training settings are inherited from DeiT
▪ AdamW optimizer for 300 epochs with a cosine decay and 20 epochs of linear
warm-up
▪ Weight decay: 0.05
▪ Learning rate: 0.005 x batchsize / 512
▪ 8 GPU with minibatch 128 per GPU are used in training
▪ Same data augmentation and regularization configurations as DeiT
14. Empirical Studies on Mixing Blocks
• Multi-Stage is Superior to Single-Stage
▪ First finding is that multi-stage design should always be adopted in vision
models no matter which of the three network structures is chosen.
15. Empirical Studies on Mixing Blocks
• Local Modeling is Crucial
▪ The spatial mixing block of the convolution structure is implemented by a 3x3
depth-wise convolution, which is a typical local modeling operation.
▪ It is so light-weight that it only contributes to 0.3% of the model parameter
and 0.5% of the FLOPs.
▪ The convolution bypass only slightly decreases the throughput, but brings a
notable accuracy increase to both models.
▪ This experiment confirms the importance
of local modeling and suggests the use of
3x3 depth-wise convolution as a bypass for
any designed network structures.
16. Empirical Studies on Mixing Blocks
• A Detailed Analysis of MLP
▪ Due to the excessive number of parameters, MLP models suffer severely from
over-fitting.
▪ Strong generalization capability of the multi-stage framework brings gain.
▪ Parameter reduction through weight sharing can help too.
➢The single-stage model, all N mixing blocks use the same ℱ𝑠.
➢The multi-stage model, each stage use the same ℱ𝑠 for its 𝑁𝑠 mixing blocks.
17. Empirical Studies on Mixing Blocks
• Convolution and Transformer are Complementary
▪ Convolution structure has the best generalization capability.
▪ Transformer structure has the largest model capacity among the three structures.
▪ Before the performance of Conv-MS saturates, it achieves a higher test accuracy
than Trans-MS.
▪ This suggests that convolution is still the best choice in designing lightweight
vision models.
▪ Transformer models achieves higher accuracy than
the other structures when increase the model size
and allow higher computational cost.
▪ It is now clear that the sparse connectivity helps
to increase generalization capability, while
dynamic weight and global receptive field help
to increase model capacity.
18. Hybrid Models
• The authors construct hybrid models at the XS and S scales based on
these two structures.
• Hybrid-MS-XS: It is based on Conv-MS-XS. The last ten layers in Stage
3 and the last two layers in Stage 4 are replaced by Transformer layers.
Stage 1 and 2 remain unchanged.
• Hybrid-MS-S: It is based on Conv-MS-S. The last two layers in Stage 2,
the last ten layers in Stage 3, and the last two layers in Stage 4 are
replaced by Transformer layers. Stage 1 remains unchanged.
• The authors further adopt the deep patch embedding layer(PEL)
implements as suggested in LV-ViT. Different from default PEL, the
deep PEL uses four convolution kernels with kernel size {7,3,3,2},
stride {2,1,1,2} and channel number {64,64,64,C}.
19. Hybrid Models
• The Hybrid-MS-XS achieves 82.4% top-1 accuracy
with 28M parameters, which is higher than Conv-MS-
S with 44M parameters and only a little lower than
Trans- MS-S with 40M parameters.
• In addition, the Hybrid-MS-S achieve 83.7% top-1
accuracy with 63M parameters, which has 0.8point
gain compared with Trans-MS-S.
• The Hybrid-MS-S+ model achieves 83.9% top-1
accuracy with 63M parameters. This number is
higher than the accuracy achieved by SOTA models
Swin-B and CaiT-S36.
20. Conclusion
• The objective of this work is to understand how the emerging
Transformer and MLP structures compare with CNNs in the computer
vision domain.
• The authors first built a simple and unified framework, called SPACH,
that could use CNN, Transformer or MLP as plug-and-play
components.
• All 3 network structures are similarly competitive in terms of the
accuracy-complexity trade-off, although they show distinctive
properties when the network scales up.
• Multi-stage framework and local modeling are very important design
choices.
• The authors propose two hybrid models which achieve SOTA
performance on ImageNet-1k classification.