SlideShare a Scribd company logo
1 of 21
Download to read offline
A Battle of Network Structures:
An Empirical Study of CNN, Transformer, and MLP
Yucheng Zhao et al., “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP”
5th September, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics
Introduction
• Recently, Transformer and multi-layer perceptron (MLP)-based
models, such as Vision Transformer(ViT) and MLP-Mixer, started to
lead new trends as they showed promising results in the ImageNet
classification task.
• However, many new networks are still proposed, and no conclusion
can be made as which structure among CNN, Transformer and MLP
performs the best or is most suitable for vision tasks.
Introduction
• This is partly due to the pursuit of high scores that leads to
multifarious tricks and exhaustive parameter tuning, and structure-
free components play an important role in the final performance.
• As a result, network structures cannot be fairly compared in a
systematic way.
• The work presented in this paper fills this blank by conducting a series
of controlled experiments over CNN, Transformer, and MLP in a
unified framework.
Backgrounds
• Convolution-based vision models
▪ CNN was a de-facto standard in computer vision models.
▪ A standard convolution layer learns filters in a 3D space(2 spatial, 1 channel),
thus the learning of spatial correlations and channel correlations are coupled
inside a single kernel.
▪ Differently, a depthwise convolution layer only learns spatial correlations by
moving the learning process of channel correlations to an additional 1x1
convolution.
▪ The idea of decoupling spatial and channel correlations is adopted in the
Vision Transformer.
Backgrouds
• Transformer-based vision models
Backgrounds
• Transformer-based vision models
▪ ViT reveals that Transformer models trained on large-scale datasets could
attain SOTA image recognition performance.
▪ However, when the training data is insufficient, ViT does not perform well due
to the lack of inductive biases.
▪ Swin and Twins adopt locally-grouped self attention and the local mechanism
not only leads to performance improvement, but also bring sufficient
improvement on memory and computational efficiency.
Backgrounds
• MLP-based vision models
Three Desired Properties in Different Network
Structures
Convolution layer
- Locally connected
- Same weights for all inputs
Dense layer
- Fully connected
- Same weights for all inputs
Attention layer
- Fully connected
- Different weights for all inputs
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
A Unified Experimental Framework – SPACH
SPACH – the serial structure of SPAtial and
Channel processing
• The architecture is very simple and consists mainly of a cascade of
mixing blocks, plus some necessary auxiliary modules.
• SPACH-MS(Multi-Stage) is designed to keep a high-resolution in the
initial stage of the framework and progressively perform down-
sampling.
• Each stage contains Ns mixing blocks, where s is the stage index.
• Due to the extremely high computational cost of Transformer and
MLP on high-resolution feature maps, the mixing blocks in the first
stage are implemented with convolutions only.
• The patch size is 16 in the single-stage implementation and 4 in multi-
stage implementation.
Mixing Block Design
• Mixing blocks are key components in the SPACH framework.
• Input is first processed by a spatial mixing function ℱ𝑠 and then by a
channel mixing function ℱ𝑐. ℱ𝑠 focuses on aggregating context
information from different spatial locations while ℱ𝑐 focuses on
channel information fusion.
• ℱ𝑐 can be viewed as a 1x1 convolution and only performs channel
fusion and does not explore any spatial context.
Mixing Block Design
• The spatial mixing function ℱ𝑠 is the key to implement different
architectures.
• Specifically, the convolution structure is implemented by a 3x3 depth-wise
convolution.
• For the Transformer structure, there is a positional embedding module in
the original design. But recent research suggests that absolute positional
embedding breaks translation variance, which is not suitable for images.
• Inspired by recent vision transformer design, the authors introduce a
convolutional positional encoding (CPE) as a bypass in each spatial mixing
module.
• MLP-Mixer does not use any positional embedding, but empirically very
lightweight CPE significantly improves the model performance.
Empirical Studies on Mixing Blocks
• Datasets and Training Pipelines
▪ ImageNet-1K (IN-1K)
▪ Input resolution: 224x224
▪ Most of training settings are inherited from DeiT
▪ AdamW optimizer for 300 epochs with a cosine decay and 20 epochs of linear
warm-up
▪ Weight decay: 0.05
▪ Learning rate: 0.005 x batchsize / 512
▪ 8 GPU with minibatch 128 per GPU are used in training
▪ Same data augmentation and regularization configurations as DeiT
Empirical Studies on Mixing Blocks
• Multi-Stage is Superior to Single-Stage
▪ First finding is that multi-stage design should always be adopted in vision
models no matter which of the three network structures is chosen.
Empirical Studies on Mixing Blocks
• Local Modeling is Crucial
▪ The spatial mixing block of the convolution structure is implemented by a 3x3
depth-wise convolution, which is a typical local modeling operation.
▪ It is so light-weight that it only contributes to 0.3% of the model parameter
and 0.5% of the FLOPs.
▪ The convolution bypass only slightly decreases the throughput, but brings a
notable accuracy increase to both models.
▪ This experiment confirms the importance
of local modeling and suggests the use of
3x3 depth-wise convolution as a bypass for
any designed network structures.
Empirical Studies on Mixing Blocks
• A Detailed Analysis of MLP
▪ Due to the excessive number of parameters, MLP models suffer severely from
over-fitting.
▪ Strong generalization capability of the multi-stage framework brings gain.
▪ Parameter reduction through weight sharing can help too.
➢The single-stage model, all N mixing blocks use the same ℱ𝑠.
➢The multi-stage model, each stage use the same ℱ𝑠 for its 𝑁𝑠 mixing blocks.
Empirical Studies on Mixing Blocks
• Convolution and Transformer are Complementary
▪ Convolution structure has the best generalization capability.
▪ Transformer structure has the largest model capacity among the three structures.
▪ Before the performance of Conv-MS saturates, it achieves a higher test accuracy
than Trans-MS.
▪ This suggests that convolution is still the best choice in designing lightweight
vision models.
▪ Transformer models achieves higher accuracy than
the other structures when increase the model size
and allow higher computational cost.
▪ It is now clear that the sparse connectivity helps
to increase generalization capability, while
dynamic weight and global receptive field help
to increase model capacity.
Hybrid Models
• The authors construct hybrid models at the XS and S scales based on
these two structures.
• Hybrid-MS-XS: It is based on Conv-MS-XS. The last ten layers in Stage
3 and the last two layers in Stage 4 are replaced by Transformer layers.
Stage 1 and 2 remain unchanged.
• Hybrid-MS-S: It is based on Conv-MS-S. The last two layers in Stage 2,
the last ten layers in Stage 3, and the last two layers in Stage 4 are
replaced by Transformer layers. Stage 1 remains unchanged.
• The authors further adopt the deep patch embedding layer(PEL)
implements as suggested in LV-ViT. Different from default PEL, the
deep PEL uses four convolution kernels with kernel size {7,3,3,2},
stride {2,1,1,2} and channel number {64,64,64,C}.
Hybrid Models
• The Hybrid-MS-XS achieves 82.4% top-1 accuracy
with 28M parameters, which is higher than Conv-MS-
S with 44M parameters and only a little lower than
Trans- MS-S with 40M parameters.
• In addition, the Hybrid-MS-S achieve 83.7% top-1
accuracy with 63M parameters, which has 0.8point
gain compared with Trans-MS-S.
• The Hybrid-MS-S+ model achieves 83.9% top-1
accuracy with 63M parameters. This number is
higher than the accuracy achieved by SOTA models
Swin-B and CaiT-S36.
Conclusion
• The objective of this work is to understand how the emerging
Transformer and MLP structures compare with CNNs in the computer
vision domain.
• The authors first built a simple and unified framework, called SPACH,
that could use CNN, Transformer or MLP as plug-and-play
components.
• All 3 network structures are similarly competitive in terms of the
accuracy-complexity trade-off, although they show distinctive
properties when the network scales up.
• Multi-stage framework and local modeling are very important design
choices.
• The authors propose two hybrid models which achieve SOTA
performance on ImageNet-1k classification.
Thank you

More Related Content

What's hot

PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksJinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingJinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_papershanullah3
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]taeseon ryu
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 

What's hot (20)

PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 

Similar to PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

ConvNeXt.pptx
ConvNeXt.pptxConvNeXt.pptx
ConvNeXt.pptxYanhuaSi
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxssuser2624f71
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
 
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...ssuser4b1f48
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...ssuser4b1f48
 
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...ssuser4b1f48
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Dongmin Choi
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel acceleratorBaharJV
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptxthanhdowork
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxssuser2624f71
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksRimzim Thube
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptxZainULABIDIN496386
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняLviv Startup Club
 

Similar to PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (20)

ConvNeXt.pptx
ConvNeXt.pptxConvNeXt.pptx
ConvNeXt.pptx
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
 
Conformer review
Conformer reviewConformer review
Conformer review
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
 
VGG.pptx
VGG.pptxVGG.pptx
VGG.pptx
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
NS-CUK Seminar: S.T.Nguyen, Review on "Are More Layers Beneficial to Graph Tr...
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
I- Tasser
I- TasserI- Tasser
I- Tasser
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
 

More from Jinwon Lee

PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044Jinwon Lee
 
PVANet - PR033
PVANet - PR033PVANet - PR033
PVANet - PR033Jinwon Lee
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Jinwon Lee
 
YOLO9000 - PR023
YOLO9000 - PR023YOLO9000 - PR023
YOLO9000 - PR023Jinwon Lee
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝Jinwon Lee
 

More from Jinwon Lee (13)

PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
PVANet - PR033
PVANet - PR033PVANet - PR033
PVANet - PR033
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031
 
YOLO9000 - PR023
YOLO9000 - PR023YOLO9000 - PR023
YOLO9000 - PR023
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  • 1. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP Yucheng Zhao et al., “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP” 5th September, 2021 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. Introduction • Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer(ViT) and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. • However, many new networks are still proposed, and no conclusion can be made as which structure among CNN, Transformer and MLP performs the best or is most suitable for vision tasks.
  • 3. Introduction • This is partly due to the pursuit of high scores that leads to multifarious tricks and exhaustive parameter tuning, and structure- free components play an important role in the final performance. • As a result, network structures cannot be fairly compared in a systematic way. • The work presented in this paper fills this blank by conducting a series of controlled experiments over CNN, Transformer, and MLP in a unified framework.
  • 4. Backgrounds • Convolution-based vision models ▪ CNN was a de-facto standard in computer vision models. ▪ A standard convolution layer learns filters in a 3D space(2 spatial, 1 channel), thus the learning of spatial correlations and channel correlations are coupled inside a single kernel. ▪ Differently, a depthwise convolution layer only learns spatial correlations by moving the learning process of channel correlations to an additional 1x1 convolution. ▪ The idea of decoupling spatial and channel correlations is adopted in the Vision Transformer.
  • 6. Backgrounds • Transformer-based vision models ▪ ViT reveals that Transformer models trained on large-scale datasets could attain SOTA image recognition performance. ▪ However, when the training data is insufficient, ViT does not perform well due to the lack of inductive biases. ▪ Swin and Twins adopt locally-grouped self attention and the local mechanism not only leads to performance improvement, but also bring sufficient improvement on memory and computational efficiency.
  • 8. Three Desired Properties in Different Network Structures Convolution layer - Locally connected - Same weights for all inputs Dense layer - Fully connected - Same weights for all inputs Attention layer - Fully connected - Different weights for all inputs x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
  • 9. A Unified Experimental Framework – SPACH
  • 10. SPACH – the serial structure of SPAtial and Channel processing • The architecture is very simple and consists mainly of a cascade of mixing blocks, plus some necessary auxiliary modules. • SPACH-MS(Multi-Stage) is designed to keep a high-resolution in the initial stage of the framework and progressively perform down- sampling. • Each stage contains Ns mixing blocks, where s is the stage index. • Due to the extremely high computational cost of Transformer and MLP on high-resolution feature maps, the mixing blocks in the first stage are implemented with convolutions only. • The patch size is 16 in the single-stage implementation and 4 in multi- stage implementation.
  • 11. Mixing Block Design • Mixing blocks are key components in the SPACH framework. • Input is first processed by a spatial mixing function ℱ𝑠 and then by a channel mixing function ℱ𝑐. ℱ𝑠 focuses on aggregating context information from different spatial locations while ℱ𝑐 focuses on channel information fusion. • ℱ𝑐 can be viewed as a 1x1 convolution and only performs channel fusion and does not explore any spatial context.
  • 12. Mixing Block Design • The spatial mixing function ℱ𝑠 is the key to implement different architectures. • Specifically, the convolution structure is implemented by a 3x3 depth-wise convolution. • For the Transformer structure, there is a positional embedding module in the original design. But recent research suggests that absolute positional embedding breaks translation variance, which is not suitable for images. • Inspired by recent vision transformer design, the authors introduce a convolutional positional encoding (CPE) as a bypass in each spatial mixing module. • MLP-Mixer does not use any positional embedding, but empirically very lightweight CPE significantly improves the model performance.
  • 13. Empirical Studies on Mixing Blocks • Datasets and Training Pipelines ▪ ImageNet-1K (IN-1K) ▪ Input resolution: 224x224 ▪ Most of training settings are inherited from DeiT ▪ AdamW optimizer for 300 epochs with a cosine decay and 20 epochs of linear warm-up ▪ Weight decay: 0.05 ▪ Learning rate: 0.005 x batchsize / 512 ▪ 8 GPU with minibatch 128 per GPU are used in training ▪ Same data augmentation and regularization configurations as DeiT
  • 14. Empirical Studies on Mixing Blocks • Multi-Stage is Superior to Single-Stage ▪ First finding is that multi-stage design should always be adopted in vision models no matter which of the three network structures is chosen.
  • 15. Empirical Studies on Mixing Blocks • Local Modeling is Crucial ▪ The spatial mixing block of the convolution structure is implemented by a 3x3 depth-wise convolution, which is a typical local modeling operation. ▪ It is so light-weight that it only contributes to 0.3% of the model parameter and 0.5% of the FLOPs. ▪ The convolution bypass only slightly decreases the throughput, but brings a notable accuracy increase to both models. ▪ This experiment confirms the importance of local modeling and suggests the use of 3x3 depth-wise convolution as a bypass for any designed network structures.
  • 16. Empirical Studies on Mixing Blocks • A Detailed Analysis of MLP ▪ Due to the excessive number of parameters, MLP models suffer severely from over-fitting. ▪ Strong generalization capability of the multi-stage framework brings gain. ▪ Parameter reduction through weight sharing can help too. ➢The single-stage model, all N mixing blocks use the same ℱ𝑠. ➢The multi-stage model, each stage use the same ℱ𝑠 for its 𝑁𝑠 mixing blocks.
  • 17. Empirical Studies on Mixing Blocks • Convolution and Transformer are Complementary ▪ Convolution structure has the best generalization capability. ▪ Transformer structure has the largest model capacity among the three structures. ▪ Before the performance of Conv-MS saturates, it achieves a higher test accuracy than Trans-MS. ▪ This suggests that convolution is still the best choice in designing lightweight vision models. ▪ Transformer models achieves higher accuracy than the other structures when increase the model size and allow higher computational cost. ▪ It is now clear that the sparse connectivity helps to increase generalization capability, while dynamic weight and global receptive field help to increase model capacity.
  • 18. Hybrid Models • The authors construct hybrid models at the XS and S scales based on these two structures. • Hybrid-MS-XS: It is based on Conv-MS-XS. The last ten layers in Stage 3 and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 and 2 remain unchanged. • Hybrid-MS-S: It is based on Conv-MS-S. The last two layers in Stage 2, the last ten layers in Stage 3, and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 remains unchanged. • The authors further adopt the deep patch embedding layer(PEL) implements as suggested in LV-ViT. Different from default PEL, the deep PEL uses four convolution kernels with kernel size {7,3,3,2}, stride {2,1,1,2} and channel number {64,64,64,C}.
  • 19. Hybrid Models • The Hybrid-MS-XS achieves 82.4% top-1 accuracy with 28M parameters, which is higher than Conv-MS- S with 44M parameters and only a little lower than Trans- MS-S with 40M parameters. • In addition, the Hybrid-MS-S achieve 83.7% top-1 accuracy with 63M parameters, which has 0.8point gain compared with Trans-MS-S. • The Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters. This number is higher than the accuracy achieved by SOTA models Swin-B and CaiT-S36.
  • 20. Conclusion • The objective of this work is to understand how the emerging Transformer and MLP structures compare with CNNs in the computer vision domain. • The authors first built a simple and unified framework, called SPACH, that could use CNN, Transformer or MLP as plug-and-play components. • All 3 network structures are similarly competitive in terms of the accuracy-complexity trade-off, although they show distinctive properties when the network scales up. • Multi-stage framework and local modeling are very important design choices. • The authors propose two hybrid models which achieve SOTA performance on ImageNet-1k classification.