SlideShare a Scribd company logo
A ConvNet for the 2020s
Zhuang Liu et al.
Sushant Gautam
MSIISE
Department of Electronics and
Computer Engineering
Institute of Engineering,
Thapathali Campus
13 March 2022
Introduction: ConvNets
• LeNet/ AlexNet / ResNets
• Have several built-in inductive biases
• well-suited to a wide variety of computer vision applications.
• Translation equivariance
• Shared Computations:
• inherently efficient in a sliding-window
• Fundamental building block in visual recognition system
3/13/2022 2
Transformers in NLP: BERT
3/13/2022 3
• NN design for NLP:
• Transformers replaced
RNNs
• become the dominant
backbone architecture
Imagine. . .
• But how do we tokenize image?
3/13/2022 4
Introduction: Vision Transformers
• Vision Transformers (ViT) in 2020:
• completely altered the landscape of network architecture design.
• two streams surprisingly converged
• Transformers outperform ResNets by a significant margin.
• on image classification
• only on image classification
3/13/2022 5
Vision Transformer: Patchify
3/13/2022 6
For 3x3 image:
8x8 image
Vision Transformer: ViT
3/13/2022 7
3/13/2022 8
3/13/2022 9
Length: C=4
3/13/2022 10
Segmentation:
Each token is a pixel.
ViT’s were challenging
• ViT’s global attention design:
• quadratic complexity with respect to the input size.
• Hierarchical Transformers
• hybrid approach to bridge this gap.
• for example, the sliding window strategy
(e.g., attention within local windows)
• behave more similarly to ConvNets.
3/13/2022 11
Swin Transformer
• a milestone work in this direction
• demonstrated Transformers as a generic vision backbone
• achieved SOTA results across a range of computer vision tasks
• beyond image classification.
• successful and rapid adoption revealed:
• the essence of convolution is not becoming irrelevant; rather,
• it remains much desired and has never faded.
3/13/2022 12
3/13/2022 13
3/13/2022 14
3/13/2022 15
3/13/2022 16
3/13/2022 17
3/13/2022 18
Swin Transformer
• Still relies on Patches
• But instead of choosing one
• Start with smaller patch and merge in the deeper transformer layers
• Similar to UNet and Convolution
3/13/2022 19
3/13/2022 20
Swin Transformer
• Shifted windows Transformer
• Self-attention in non-overlapped windows
• Hierarchical feature maps
3/13/2022 21
Back to the topic: ConvNeXt
Objectives:
• bridge the gap between the pre-ViT and post-ViT ConvNets
• test the limits of what a pure ConvNet can achieve.
• Exploration: How do design decisions in Transformers impact
ConvNets’ performance?
• Result: a family of pure ConvNets: ConvNeXt.
3/13/2022 22
Modernizing a ConvNet: a Roadmap
• Recipe: Two ResNet model sizes in terms of G-FLOPs
 ResNet-50 / Swin-T ~4.5x109 FLOPs
 ResNet-200 / Swin-B ~15.0x109 FLOPs
• Network Modernization
1. ViT training techniques for ResNet
2. ResNeXt
3. Inverted bottleneck
4. Large kernel size
5. Various layer-wise micro designs
3/13/2022 23
ViT Training Techniques
• Training epochs: 300 epochs (from the original 90 epochs for ResNets)
• Optimizer: AdamW (Weight Decay)
• Data augmentation
 Mixup
 Cutmix
 RandAugment
 Random Erasing
• Regularization
 Stochastic Depth
 Label Smoothing
• This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%)
3/13/2022 24
Label Smoothing making
model less over smart
Changing Stage Compute Ratio
• Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger
• Swin-B Transformers, the ratio is 1:1:9:1.
• #Conv blocks in ResNet-50: (3, 4, 6, 3)
• ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50
• which also aligns the FLOPs with Swin-T.
• This improves the model accuracy from 78.8% to 79.4%.
3/13/2022 25
Changing Stem to “Patchify”
• Stem cell of Standard ResNet: 7x7 convolution layer
• followed by a max pool, which result in a 4x downsampling.
• Stem cell of ViT: a more aggressive “patchify” strategy
• corresponds to a large kernel size(14 or 16) + non-overlapping convolution.
• Swin Transformer uses a similar “patchfiy” layer
• smaller patch size of 4: accommodate the architecture’s multi-stage design.
• ConvNeXt replaces the ResNet-style CNN stem cell
• with a patchify layer implemented using a 4x4, stride 4 convolution layer.
• The accuracy has changed from 79.4% to 79.5%.
3/13/2022 26
ResNeXt-ify
• ResNeXt has a better FLOPs/accuracy trade-off than ResNet.
• The core component is grouped convolution.
• Guiding principle: “use more groups, expand width”.
• ConvNeXt uses depthwise convolution, a special case of grouped
convolution where the number of groups equals the number of
channels.
• ConvNeXt increases the network width to the same number of
channels as Swin-T’s (from 64 to 96)
• This brings the network performance to 80.5% with increased
FLOPs (5.3G)
3/13/2022 27
3/13/2022 28
Depth wise Convolution
Inverted Bottleneck
• Important design in every Transformer
• the hidden dimension of the MLP block is 4 times wider than the input
• Connected to the design in ConvNets with an expansion ratio of
• popularized by MobileNetV2 and gained traction in advanced ConvNets.
• Costly: increased FLOPs for the depth wise convolution layer
• Significant FLOPs reduction in the down sampling residual blocks.
• Reduces the whole network FLOPs to 4.6G
• Interestingly, improved performance
(80.5% to 80.6%).
3/13/2022 29
Moving Up Depthwise Conv Layer
• Moving up position of depth wise conv layer as in Transformers.
• The MSA block is placed prior to the MLP layers.
• As ConvNeXt has an inverted bottleneck block, this is a natural design
choice.
 The complex/inefficien modules (MSA, large-kernel conv) will have fewer
channels, while the efficient, dense 1x1 layers will do the heavy lifting.
• This intermediate step reduces the FLOPs to 4.1G, resulting in a
temporary performance degradation to 79.9%.
3/13/2022 30
Note: a self-attention operation can be viewed as a
dynamic lightweight convolution, a depthwise separable
convolutions.
Increasing the Kernel Size
• Swin Transformers reintroduced the local window to the self-
attention block, the window size is at least 7x7, significantly larger
than the ResNe(X)t kernel size of 3x3.
• The authors experimented with several kernel sizes, including 3, 5, 7,
9, and 11. The network’s performance increases from 79.9% (3x3) to
80.6% (7x7), while the network’s FLOPs stay roughly the same.
• A ResNet-200 regime model does not exhibit further gain when
increasing the kernel size beyond 7x7.
3/13/2022 31
Replacing ReLU with GELU
• One discrepancy between NLP and vision architectures is the
specifics of which activation functions to use.
• Gaussian Error Linear Unit, or GELU
• a smoother variant of ReLU,
• utilized in the most advanced Transformers
• Google’s BERT and OpenAI’s GPT-2, ViTs.
3/13/2022 32
• ReLU can be substituted with
GELU in ConvNeXt too.
• But the accuracy stays
unchanged (80.6%).
Fewer Activation Functions
• Transformers have fewer activation functions than ResNet
• There is only one activation function present in the MLP block.
• In comparison, it is common practice to append an activation function
to each convolutional layer, including the 1x1 convs.
• Replicate the Transformer block:
• eliminates all GELU layers from the residual block except for one between
two 1x1 layers
• This process improves the result by 0.7% to 81.3%, practically
matching the performance of Swin-T.
3/13/2022 33
Fewer Normalization Layers
• Transformer blocks usually have fewer normalization layers as well.
• ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN
layer before the conv 1x1 layers.
• This further boosts the performance to 81.4%, already surpassing
Swin-T’s result.
3/13/2022 34
Substituting BN with LN
• BatchNorm is an essential component in ConvNets as it improves the
convergence and reduces overfitting.
• Simpler Layer Normalization(LN) has been used in Transformers,
• resulting in good performance across different application scenarios.
• Directly substituting LN for BN in the original ResNet will result in
suboptimal performance.
• With all modifications in network architecture and training techniques,
the authors revisit the impact of using LN in place of BN.
• ConvNeXt model does not have any difficulties training with LN; in
fact, the performance is slightly better, obtaining an accuracy of
81.5%.
3/13/2022 35
Separate Downsampling Layers
• In ResNet, the spatial downsampling is achieved by the residual block
at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv
with stride 2 at the shortcut connection).
• In Swin Transformers, a separate downsampling layer is added
between stages.
• ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling
and adding normalization layers wherever spatial resolution is
changed can help stablize training.
• This can improve the accuracy to 82.0%, significantly exceeding
Swin-
• T’s 81.3%.
3/13/2022 36
ConvNeXt Block Design
3/13/2022 37
Modernizing a Standard ConvNet Results
3/13/2022 38
Closing Remarks
• ConvNeXt, a pure ConvNet, that can outperform the Swin
Transformer for ImageNet-1K classification in this compute regime.
• It is also worth noting that none of the design options discussed thus
far is novel—they have all been researched separately, but not
collectively, over the last decade.
• ConvNeXt model has approximately the same FLOPs, #params.,
throughput, and memory use as the Swin Transformer, but does not
require specialized modules such as shifted window attention or
relative position biases.
3/13/2022 39
Results – ImageNet-1K
3/13/2022 40
Results – ImageNet-22K
• A widely held view is that vision
Transformers have fewer inductive
biases thus can per form better
than ConvNets when pre- trained
on a larger scale.
• Properly designed ConvNets are
not inferior to vision Transformers
when pre-trained with large dataset
ConvNeXts still perform on par or
better than similarly-sized Swin
Transformers, with slightly higher
throughput.
3/13/2022 41
Comments on Twitter about the Results
3/13/2022 42
Isotropic ConvNeXt vs. ViT
• Examining if ConvNeXt block design is generalizable to ViT-style
isotropic architectures which have no downsampling layers and keep
the same feature resolutions (e.g. 14x14) at all depths.
• ConvNeXt can perform generally on par with ViT, showing that
ConvNeXt block design is competitive when used in non-hierarchical
models.
3/13/2022 43
Results – Object Detection and Segmentation
on COCO, Semantic Segmentation on ADE20K
3/13/2022 44
Results – Robustness Evaluation of ConvNeXt
3/13/2022 45
Remarks on Model Efficiency[1]
• Under similar FLOPs, models with depthwise convolutions are known to be slower
and consume more memory than ConvNets with only dense convolutions.
• Natural to ask whether the design of ConvNeXt will render it practically inefficient.
• As demonstrated throughout the paper, the inference throughputs of ConvNeXts
are comparable to or exceed that of Swin Transformers. This is true for both
classification and other tasks requiring higher-resolution inputs.
3/13/2022 46
Remarks on Model Efficiency[2]
• Furthermore, training ConvNeXts requires less memory than training Swin
Transformers. Training Cascade Mask-RCNN using ConvNeXt-B backbone
consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the
reference number for Swin-B is 18.5GB.
• It is worth noting that this improved efficiency is a result of the ConvNet inductive
bias, and is not directly related to the self-attention mechanism in vision
Transformers
3/13/2022 47
Conclusions[1]
• In the 2020s, vision Transformers, particularly hierarchical ones such as Swin
Transformers, began to overtake ConvNets as the favored choice for generic
vision backbones.
• The widely held belief is that vision Transformers are more accurate,
efficient, and scalable than ConvNets.
• ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art
hierarchical vision Transformers across multiple computer vision benchmarks,
while retaining the simplicity and efficiency of standard ConvNets.
3/13/2022 48
Conclusions[2]
• In some ways, these observations are surprising while ConvNeXt
model itself is not completely new—many design choices have all
been examined separately over the last decade, but not collectively.
• The authors hope that the new results reported in this study will
challenge several widely held views and prompt people to rethink the
importance of convolution in computer vision.
3/13/2022 49
Thank you
3/13/2022 50

More Related Content

What's hot

PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
DaeHeeKim31
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
ibrahimalshareef3
 
Resnet
ResnetResnet
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
Taeoh Kim
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
Joonhyung Lee
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
Vignesh Suresh
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
Changjin Lee
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
SungminYou
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
 
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
JUGAL GANDHI
 
Dcgan
DcganDcgan
Dcgan
Brian Kim
 

What's hot (20)

PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
 
Resnet
ResnetResnet
Resnet
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
AlexNet
AlexNetAlexNet
AlexNet
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
 
Dcgan
DcganDcgan
Dcgan
 

Similar to ConvNeXt: A ConvNet for the 2020s explained

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
ssuser2624f71
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
Gioele Ciaparrone
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
Wonbeom Jang
 
Mix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional KernelsMix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional Kernels
Seunghyun Hwang
 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
thanhdowork
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptx
MsKiranSingh
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Introduction to CNN Models: DenseNet & MobileNet
Introduction to CNN Models: DenseNet & MobileNetIntroduction to CNN Models: DenseNet & MobileNet
Introduction to CNN Models: DenseNet & MobileNet
KrishnakoumarC
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
ZainULABIDIN496386
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Ontico
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...
Universitat Politècnica de Catalunya
 
GoogLeNet.pptx
GoogLeNet.pptxGoogLeNet.pptx
GoogLeNet.pptx
ssuser2624f71
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 

Similar to ConvNeXt: A ConvNet for the 2020s explained (20)

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
Mix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional KernelsMix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional Kernels
 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptx
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
VGG.pptx
VGG.pptxVGG.pptx
VGG.pptx
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
Introduction to CNN Models: DenseNet & MobileNet
Introduction to CNN Models: DenseNet & MobileNetIntroduction to CNN Models: DenseNet & MobileNet
Introduction to CNN Models: DenseNet & MobileNet
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...
 
GoogLeNet.pptx
GoogLeNet.pptxGoogLeNet.pptx
GoogLeNet.pptx
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 

ConvNeXt: A ConvNet for the 2020s explained

  • 1. A ConvNet for the 2020s Zhuang Liu et al. Sushant Gautam MSIISE Department of Electronics and Computer Engineering Institute of Engineering, Thapathali Campus 13 March 2022
  • 2. Introduction: ConvNets • LeNet/ AlexNet / ResNets • Have several built-in inductive biases • well-suited to a wide variety of computer vision applications. • Translation equivariance • Shared Computations: • inherently efficient in a sliding-window • Fundamental building block in visual recognition system 3/13/2022 2
  • 3. Transformers in NLP: BERT 3/13/2022 3 • NN design for NLP: • Transformers replaced RNNs • become the dominant backbone architecture
  • 4. Imagine. . . • But how do we tokenize image? 3/13/2022 4
  • 5. Introduction: Vision Transformers • Vision Transformers (ViT) in 2020: • completely altered the landscape of network architecture design. • two streams surprisingly converged • Transformers outperform ResNets by a significant margin. • on image classification • only on image classification 3/13/2022 5
  • 6. Vision Transformer: Patchify 3/13/2022 6 For 3x3 image: 8x8 image
  • 11. ViT’s were challenging • ViT’s global attention design: • quadratic complexity with respect to the input size. • Hierarchical Transformers • hybrid approach to bridge this gap. • for example, the sliding window strategy (e.g., attention within local windows) • behave more similarly to ConvNets. 3/13/2022 11
  • 12. Swin Transformer • a milestone work in this direction • demonstrated Transformers as a generic vision backbone • achieved SOTA results across a range of computer vision tasks • beyond image classification. • successful and rapid adoption revealed: • the essence of convolution is not becoming irrelevant; rather, • it remains much desired and has never faded. 3/13/2022 12
  • 19. Swin Transformer • Still relies on Patches • But instead of choosing one • Start with smaller patch and merge in the deeper transformer layers • Similar to UNet and Convolution 3/13/2022 19
  • 21. Swin Transformer • Shifted windows Transformer • Self-attention in non-overlapped windows • Hierarchical feature maps 3/13/2022 21
  • 22. Back to the topic: ConvNeXt Objectives: • bridge the gap between the pre-ViT and post-ViT ConvNets • test the limits of what a pure ConvNet can achieve. • Exploration: How do design decisions in Transformers impact ConvNets’ performance? • Result: a family of pure ConvNets: ConvNeXt. 3/13/2022 22
  • 23. Modernizing a ConvNet: a Roadmap • Recipe: Two ResNet model sizes in terms of G-FLOPs  ResNet-50 / Swin-T ~4.5x109 FLOPs  ResNet-200 / Swin-B ~15.0x109 FLOPs • Network Modernization 1. ViT training techniques for ResNet 2. ResNeXt 3. Inverted bottleneck 4. Large kernel size 5. Various layer-wise micro designs 3/13/2022 23
  • 24. ViT Training Techniques • Training epochs: 300 epochs (from the original 90 epochs for ResNets) • Optimizer: AdamW (Weight Decay) • Data augmentation  Mixup  Cutmix  RandAugment  Random Erasing • Regularization  Stochastic Depth  Label Smoothing • This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%) 3/13/2022 24 Label Smoothing making model less over smart
  • 25. Changing Stage Compute Ratio • Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger • Swin-B Transformers, the ratio is 1:1:9:1. • #Conv blocks in ResNet-50: (3, 4, 6, 3) • ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50 • which also aligns the FLOPs with Swin-T. • This improves the model accuracy from 78.8% to 79.4%. 3/13/2022 25
  • 26. Changing Stem to “Patchify” • Stem cell of Standard ResNet: 7x7 convolution layer • followed by a max pool, which result in a 4x downsampling. • Stem cell of ViT: a more aggressive “patchify” strategy • corresponds to a large kernel size(14 or 16) + non-overlapping convolution. • Swin Transformer uses a similar “patchfiy” layer • smaller patch size of 4: accommodate the architecture’s multi-stage design. • ConvNeXt replaces the ResNet-style CNN stem cell • with a patchify layer implemented using a 4x4, stride 4 convolution layer. • The accuracy has changed from 79.4% to 79.5%. 3/13/2022 26
  • 27. ResNeXt-ify • ResNeXt has a better FLOPs/accuracy trade-off than ResNet. • The core component is grouped convolution. • Guiding principle: “use more groups, expand width”. • ConvNeXt uses depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels. • ConvNeXt increases the network width to the same number of channels as Swin-T’s (from 64 to 96) • This brings the network performance to 80.5% with increased FLOPs (5.3G) 3/13/2022 27
  • 28. 3/13/2022 28 Depth wise Convolution
  • 29. Inverted Bottleneck • Important design in every Transformer • the hidden dimension of the MLP block is 4 times wider than the input • Connected to the design in ConvNets with an expansion ratio of • popularized by MobileNetV2 and gained traction in advanced ConvNets. • Costly: increased FLOPs for the depth wise convolution layer • Significant FLOPs reduction in the down sampling residual blocks. • Reduces the whole network FLOPs to 4.6G • Interestingly, improved performance (80.5% to 80.6%). 3/13/2022 29
  • 30. Moving Up Depthwise Conv Layer • Moving up position of depth wise conv layer as in Transformers. • The MSA block is placed prior to the MLP layers. • As ConvNeXt has an inverted bottleneck block, this is a natural design choice.  The complex/inefficien modules (MSA, large-kernel conv) will have fewer channels, while the efficient, dense 1x1 layers will do the heavy lifting. • This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradation to 79.9%. 3/13/2022 30 Note: a self-attention operation can be viewed as a dynamic lightweight convolution, a depthwise separable convolutions.
  • 31. Increasing the Kernel Size • Swin Transformers reintroduced the local window to the self- attention block, the window size is at least 7x7, significantly larger than the ResNe(X)t kernel size of 3x3. • The authors experimented with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (3x3) to 80.6% (7x7), while the network’s FLOPs stay roughly the same. • A ResNet-200 regime model does not exhibit further gain when increasing the kernel size beyond 7x7. 3/13/2022 31
  • 32. Replacing ReLU with GELU • One discrepancy between NLP and vision architectures is the specifics of which activation functions to use. • Gaussian Error Linear Unit, or GELU • a smoother variant of ReLU, • utilized in the most advanced Transformers • Google’s BERT and OpenAI’s GPT-2, ViTs. 3/13/2022 32 • ReLU can be substituted with GELU in ConvNeXt too. • But the accuracy stays unchanged (80.6%).
  • 33. Fewer Activation Functions • Transformers have fewer activation functions than ResNet • There is only one activation function present in the MLP block. • In comparison, it is common practice to append an activation function to each convolutional layer, including the 1x1 convs. • Replicate the Transformer block: • eliminates all GELU layers from the residual block except for one between two 1x1 layers • This process improves the result by 0.7% to 81.3%, practically matching the performance of Swin-T. 3/13/2022 33
  • 34. Fewer Normalization Layers • Transformer blocks usually have fewer normalization layers as well. • ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN layer before the conv 1x1 layers. • This further boosts the performance to 81.4%, already surpassing Swin-T’s result. 3/13/2022 34
  • 35. Substituting BN with LN • BatchNorm is an essential component in ConvNets as it improves the convergence and reduces overfitting. • Simpler Layer Normalization(LN) has been used in Transformers, • resulting in good performance across different application scenarios. • Directly substituting LN for BN in the original ResNet will result in suboptimal performance. • With all modifications in network architecture and training techniques, the authors revisit the impact of using LN in place of BN. • ConvNeXt model does not have any difficulties training with LN; in fact, the performance is slightly better, obtaining an accuracy of 81.5%. 3/13/2022 35
  • 36. Separate Downsampling Layers • In ResNet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv with stride 2 at the shortcut connection). • In Swin Transformers, a separate downsampling layer is added between stages. • ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling and adding normalization layers wherever spatial resolution is changed can help stablize training. • This can improve the accuracy to 82.0%, significantly exceeding Swin- • T’s 81.3%. 3/13/2022 36
  • 38. Modernizing a Standard ConvNet Results 3/13/2022 38
  • 39. Closing Remarks • ConvNeXt, a pure ConvNet, that can outperform the Swin Transformer for ImageNet-1K classification in this compute regime. • It is also worth noting that none of the design options discussed thus far is novel—they have all been researched separately, but not collectively, over the last decade. • ConvNeXt model has approximately the same FLOPs, #params., throughput, and memory use as the Swin Transformer, but does not require specialized modules such as shifted window attention or relative position biases. 3/13/2022 39
  • 41. Results – ImageNet-22K • A widely held view is that vision Transformers have fewer inductive biases thus can per form better than ConvNets when pre- trained on a larger scale. • Properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput. 3/13/2022 41
  • 42. Comments on Twitter about the Results 3/13/2022 42
  • 43. Isotropic ConvNeXt vs. ViT • Examining if ConvNeXt block design is generalizable to ViT-style isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14x14) at all depths. • ConvNeXt can perform generally on par with ViT, showing that ConvNeXt block design is competitive when used in non-hierarchical models. 3/13/2022 43
  • 44. Results – Object Detection and Segmentation on COCO, Semantic Segmentation on ADE20K 3/13/2022 44
  • 45. Results – Robustness Evaluation of ConvNeXt 3/13/2022 45
  • 46. Remarks on Model Efficiency[1] • Under similar FLOPs, models with depthwise convolutions are known to be slower and consume more memory than ConvNets with only dense convolutions. • Natural to ask whether the design of ConvNeXt will render it practically inefficient. • As demonstrated throughout the paper, the inference throughputs of ConvNeXts are comparable to or exceed that of Swin Transformers. This is true for both classification and other tasks requiring higher-resolution inputs. 3/13/2022 46
  • 47. Remarks on Model Efficiency[2] • Furthermore, training ConvNeXts requires less memory than training Swin Transformers. Training Cascade Mask-RCNN using ConvNeXt-B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for Swin-B is 18.5GB. • It is worth noting that this improved efficiency is a result of the ConvNet inductive bias, and is not directly related to the self-attention mechanism in vision Transformers 3/13/2022 47
  • 48. Conclusions[1] • In the 2020s, vision Transformers, particularly hierarchical ones such as Swin Transformers, began to overtake ConvNets as the favored choice for generic vision backbones. • The widely held belief is that vision Transformers are more accurate, efficient, and scalable than ConvNets. • ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art hierarchical vision Transformers across multiple computer vision benchmarks, while retaining the simplicity and efficiency of standard ConvNets. 3/13/2022 48
  • 49. Conclusions[2] • In some ways, these observations are surprising while ConvNeXt model itself is not completely new—many design choices have all been examined separately over the last decade, but not collectively. • The authors hope that the new results reported in this study will challenge several widely held views and prompt people to rethink the importance of convolution in computer vision. 3/13/2022 49