SlideShare a Scribd company logo
1 of 42
Learning with Purpose
ConvNeXt:
A ConvNet for the 2020s
Learning with Purpose
INTRODUCTION: CONVNETS
• LeNet/ AlexNet / ResNet
• Have several built-in inductive biases
– well-suited to a wide variety of computer vision applications.
• Translation equivariance
• Shared Computations:
– inherently efficient in a sliding-window
• Fundamental building block in visual recognition system
Learning with Purpose
TRANSFORMERS IN NLP: BERT
• NN design for NLP:
– Transformers replaced
RNNs
– become the dominant
backbone architecture
Learning with Purpose
IMAGINE. . .
• But how do we tokenize image?
Learning with Purpose
INTRODUCTION: VISION TRANSFORMERS
• Vision Transformers (ViT) in 2020:
– completely altered the landscape of network architecture design.
– two streams surprisingly converged
• • Transformers outperform ResNets by a significant margin.
– on image classification
– only on image classification
Learning with Purpose
VISION TRANSFORMER: PATCHIFY
For 3*3 image:
Learning with Purpose
VISION TRANSFORMER: VIT
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021
Learning with Purpose
Learning with Purpose
Learning with Purpose
Learning with Purpose
CHALLENGES OF VIT
• ViT’s global attention design:
– quadratic complexity with respect to the input size.
• Hierarchical Transformers
– hybrid approach to bridge this gap.
– for example, the sliding window strategy
(e.g., attention within local windows)
– behave more similarly to ConvNets.
Learning with Purpose
SWIN TRANSFORMER
• a milestone work in this direction
• demonstrated Transformers as a generic vision backbone
• achieved SOTA results across a range of computer vision tasks
– beyond image classification.
• successful and rapid adoption revealed:
– the essence of convolution is not becoming irrelevant; rather,
– it remains much desired and has never faded.
Learning with Purpose
SWIN TRANSFORMER
• Still relies on Patches
• But instead of choosing one
• Start with smaller patch and merge in the deeper transformer layers
• Similar to Convolution
Learning with Purpose
SWIN TRANSFORMER
• Shifted windows Transformer
• Self-attention in non-overlapped windows
• Hierarchical feature maps
Learning with Purpose
BACK TO THE TOPIC: CONVNEXT
Objectives:
• bridge the gap between the pre-ViT and post-ViT ConvNets
• test the limits of what a pure ConvNet can achieve.
• Exploration: How do design decisions in Transformers impact
ConvNets’ performance?
• Result: a family of pure ConvNets: ConvNeXt.
Learning with Purpose
MODERNIZING A CONVNET: A ROADMAP
• Recipe: Two ResNet model sizes in terms of G-FLOPs
– ResNet-50 / Swin-T ~4.5x109 FLOPs
– ResNet-200 / Swin-B ~15.0x109 FLOPs
• Network Modernization
1. ViT training techniques for ResNet
2. ResNeXt
3. Inverted bottleneck
4. Large kernel size
5. Various layer-wise micro designs
Learning with Purpose
VIT TRAINING TECHNIQUES
• Training epochs: 300 epochs (from the original 90 epochs for ResNets)
• Optimizer: AdamW (Weight Decay)
• Data augmentation
– Mixup
– Cutmix
– RandAugment
– Random Erasing
• Regularization
– Stochastic Depth
– Label Smoothing
• This increased the performance of the ResNet-50 model from 76.1% to
78.8%(+2.7%)
Learning with Purpose
CHANGING STAGE COMPUTE RATIO
• Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger
• Swin-B Transformers, the ratio is 1:1:9:1.
• #Conv blocks in ResNet-50: (3, 4, 6, 3)
• ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50
– which also aligns the FLOPs with Swin-T.
• This improves the model accuracy from 78.8% to 79.4%.
Learning with Purpose
CHANGING STEM TO “PATCHIFY”
• Stem cell of Standard ResNet: 7x7 convolution layer
– followed by a max pool, which result in a 4x downsampling.
• Stem cell of ViT: a more aggressive “patchify” strategy
– corresponds to a large kernel size(14 or 16) + non-overlapping convolution.
• Swin Transformer uses a similar “patchfiy” layer
– smaller patch size of 4: accommodate the architecture’s multi-stage design.
• ConvNeXt replaces the ResNet-style CNN stem cell
– with a patchify layer implemented using a 4x4, stride 4 convolution layer.
• The accuracy has changed from 79.4% to 79.5%
Learning with Purpose
RESNEXT-IFY
• ResNeXt has a better FLOPs/accuracy trade-off than ResNet.
– The core component is grouped convolution.
• Guiding principle: “use more groups, expand width”.
• ConvNeXt uses depthwise convolution, a special case of grouped
convolution where the number of groups equals the number of
channels.
• ConvNeXt increases the network width to the same number of
channels as Swin-T’s (from 64 to 96)
• This brings the network performance to 80.5% with increased FLOPs
(5.3G)
Learning with Purpose
Learning with Purpose
INVERTED BOTTLENECK
• Important design in every Transformer
• the hidden dimension of the MLP block is 4 times wider than the
input
• Connected to the design in ConvNets with an expansion ratio of
– popularized by MobileNetV2 and gained traction in advanced ConvNets.
– Costly: increased FLOPs for the depth wise convolution layer
• Significant FLOPs reduction in the down sampling residual blocks.
– Reduces the whole network FLOPs to 4.6G
• Interestingly, improved performance
(80.5% to 80.6%).
Learning with Purpose
MOVING UP DEPTHWISE CONV LAYER
• Moving up position of depth wise conv layer as in Transformers.
• The MSA block is placed prior to the MLP layers.
• As ConvNeXt has an inverted bottleneck block, this is a natural
design choice.
– The complex/inefficien modules (MSA, large-kernel conv) will have fewer
channels, while the efficient, dense 1x1 layers will do the heavy lifting.
• This intermediate step reduces the FLOPs to 4.1G, resulting in a
temporary performance degradation to 79.9%.
Learning with Purpose
INCREASING THE KERNEL SIZE
• Swin Transformers reintroduced the local window to the selfattention
block, the window size is at least 7x7, significantly larger than the
ResNeXt kernel size of 3x3.
• The authors experimented with several kernel sizes, including 3, 5, 7,
9, and 11. The network’s performance increases from 79.9% (3x3) to
80.6% (7x7), while the network’s FLOPs stay roughly the same.
• A ResNet-200 regime model does not exhibit further gain when
increasing the kernel size beyond 7x7.
Learning with Purpose
REPLACING RELU WITH GELU
• One discrepancy between NLP and vision architectures is the
specifics of which activation functions to use.
• Gaussian Error Linear Unit, or GELU
– a smoother variant of ReLU,
– utilized in the most advanced Transformers
• Google’s BERT and OpenAI’s GPT-2, ViTs.
• ReLU can be substituted with
GELU in ConvNeXt too.
• But the accuracy stays
unchanged (80.6%).
GAUSSIAN ERROR LINEAR UNITS (GELUS) 2020
Learning with Purpose
FEWER ACTIVATION FUNCTIONS
• Transformers have fewer activation functions than ResNet
• There is only one activation function present in the MLP block.
• In comparison, it is common practice to append an activation function
to each convolutional layer, including the 1x1 convs.
• Replicate the Transformer block:
– eliminates all GELU layers from the residual block except for one between two
1x1 layers
• This process improves the result by 0.7% to 81.3%, practically
matching the performance of Swin-T
Learning with Purpose
FEWER NORMALIZATION LAYERS
• Transformer blocks usually have fewer normalization layers as well.
• ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN
layer before the conv 1x1 layers.
• This further boosts the performance to 81.4%, already surpassing
Swin-T’s result.
Learning with Purpose
SUBSTITUTING BN WITH LN
• BatchNorm is an essential component in ConvNets as it improves
the convergence and reduces overfitting.
• Simpler Layer Normalization(LN) has been used in Transformers,
– resulting in good performance across different application scenarios.
• Directly substituting LN for BN in the original ResNet will result in
suboptimal performance.
• With all modifications in network architecture and training techniques,
the authors revisit the impact of using LN in place of BN.
• ConvNeXt model does not have any difficulties training with LN; in
fact, the performance is slightly better, obtaining an accuracy of
81.5%.
Learning with Purpose
SEPARATE DOWNSAMPLING LAYERS
• In ResNet, the spatial downsampling is achieved by the residual
block at the start of each stage, using 3x3 conv with stride 2 (and 1x1
conv with stride 2 at the shortcut connection).
• In Swin Transformers, a separate downsampling layer is added
between stages.
• ConvNeXt uses 2x2 conv layers with stride 2 for spatial
downsampling and adding normalization layers wherever spatial
resolution is changed can help stablize training.
• This can improve the accuracy to 82.0%, significantly exceeding
Swin-T’s 81.3%.
Learning with Purpose
CONVNEXT BLOCK DESIGN
Learning with Purpose
MODERNIZING A STANDARD CONVNET RESULTS
Learning with Purpose
CLOSING REMARKS
• ConvNeXt, a pure ConvNet, that can outperform the Swin
Transformer for ImageNet-1K classification in this compute regime.
• It is also worth noting that none of the design options discussed thus
far is novel—they have all been researched separately, but not
collectively, over the last decade.
• ConvNeXt model has approximately the same FLOPs, #params.,
throughput, and memory use as the Swin Transformer, but does not
require specialized modules such as shifted window attention or
relative position biases.
Learning with Purpose
RESULTS – IMAGENET-1K
Learning with Purpose
RESULTS – IMAGENET-22K
• A widely held view is that vision
Transformers have fewer inductive
biases thus can per form better
than ConvNets when pre-trained
on a larger scale.
• Properly designed ConvNets are
not inferior to vision Transformers
when pre-trained with large dataset
ConvNeXts still perform on par or
better than similarly-sized Swin
Transformers, with slightly higher
throughput.
Learning with Purpose
• Examining if ConvNeXt block design is generalizable to ViT-style
isotropic architectures which have no downsampling layers and keep
the same feature resolutions (e.g. 14x14) at all depths.
• ConvNeXt can perform generally on par with ViT, showing that
ConvNeXt block design is competitive when used in non-hierarchical
models.
Isotropic ConvNeXt vs. ViT
Learning with Purpose
RESULTS – OBJECT DETECTION AND SEGMENTATION ON
COCO, SEMANTIC SEGMENTATION ON ADE20K
Learning with Purpose
RESULTS – ROBUSTNESS EVALUATION OF CONVNEXT
Learning with Purpose
REMARKS ON MODEL EFFICIENCY[1]
• Under similar FLOPs, models with depthwise convolutions are known
to be slower and consume more memory than ConvNets with only
dense convolutions.
• Natural to ask whether the design of ConvNeXt will render it
practically inefficient.
• As demonstrated throughout the paper, the inference throughputs of
ConvNeXts are comparable to or exceed that of Swin Transformers.
This is true for both classification and other tasks requiring higher-
resolution inputs.
Learning with Purpose
REMARKS ON MODEL EFFICIENCY[2]
• Furthermore, training ConvNeXts requires less memory than training
Swin Transformers. Training Cascade Mask-RCNN using ConvNeXt-
B backbone consumes 17.4GB of peak memory with a per-GPU
batch size of 2, while the reference number for Swin-B is 18.5GB.
• It is worth noting that this improved efficiency is a result of the
ConvNet inductive bias, and is not directly related to the self-
attention mechanism in vision Transformers
Learning with Purpose
CONCLUSIONS[1]
• In the 2020s, vision Transformers, particularly hierarchical ones such
as Swin Transformers, began to overtake ConvNets as the favored
choice for generic vision backbones.
• The widely held belief is that vision Transformers are more accurate,
efficient, and scalable than ConvNets.
• ConvNeXts, a pure ConvNet model can compete favorably with
state-of- the-art hierarchical vision Transformers across multiple
computer vision benchmarks, while retaining the simplicity and
efficiency of standard ConvNets.
Learning with Purpose
CONCLUSIONS[2]
• In some ways, these observations are surprising while ConvNeXt
model itself is not completely new—many design choices have all
been examined separately over the last decade, but not collectively.
• The authors hope that the new results reported in this study will
challenge several widely held views and prompt people to rethink the
importance of convolution in computer vision.
Learning with Purpose

More Related Content

What's hot

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from XailientEdge AI and Vision Alliance
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054Jinwon Lee
 
Convolutional neural networks deepa
Convolutional neural networks deepaConvolutional neural networks deepa
Convolutional neural networks deepadeepa4466
 
carrier aggregation for LTE
carrier aggregation for LTEcarrier aggregation for LTE
carrier aggregation for LTEJenil. Jacob
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Automotive network and gateway simulation
Automotive network and gateway simulationAutomotive network and gateway simulation
Automotive network and gateway simulationDeepak Shankar
 
adversarial robustness through local linearization
 adversarial robustness through local linearization adversarial robustness through local linearization
adversarial robustness through local linearizationtaeseon ryu
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 

What's hot (20)

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
 
Resnet
ResnetResnet
Resnet
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
Convolutional neural networks deepa
Convolutional neural networks deepaConvolutional neural networks deepa
Convolutional neural networks deepa
 
carrier aggregation for LTE
carrier aggregation for LTEcarrier aggregation for LTE
carrier aggregation for LTE
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Automotive network and gateway simulation
Automotive network and gateway simulationAutomotive network and gateway simulation
Automotive network and gateway simulation
 
Lte physical layer overview
Lte physical layer overviewLte physical layer overview
Lte physical layer overview
 
adversarial robustness through local linearization
 adversarial robustness through local linearization adversarial robustness through local linearization
adversarial robustness through local linearization
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Cnn
CnnCnn
Cnn
 
Intro to-matv
Intro to-matvIntro to-matv
Intro to-matv
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 

Similar to ConvNeXt: A ConvNet for the 2020s

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxssuser2624f71
 
Mix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional KernelsMix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional KernelsSeunghyun Hwang
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptxZainULABIDIN496386
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Vishal Mishra
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Ontico
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptxMsKiranSingh
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processingYu Huang
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...changedaeoh
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용홍배 김
 
Deformable ConvNets V2, DCNV2
Deformable ConvNets V2, DCNV2Deformable ConvNets V2, DCNV2
Deformable ConvNets V2, DCNV2HaiyanWang16
 

Similar to ConvNeXt: A ConvNet for the 2020s (20)

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
Mix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional KernelsMix Conv: Mixed Depthwise Convolutional Kernels
Mix Conv: Mixed Depthwise Convolutional Kernels
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
Dl
DlDl
Dl
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptx
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Deformable ConvNets V2, DCNV2
Deformable ConvNets V2, DCNV2Deformable ConvNets V2, DCNV2
Deformable ConvNets V2, DCNV2
 

Recently uploaded

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

ConvNeXt: A ConvNet for the 2020s

  • 1. Learning with Purpose ConvNeXt: A ConvNet for the 2020s
  • 2. Learning with Purpose INTRODUCTION: CONVNETS • LeNet/ AlexNet / ResNet • Have several built-in inductive biases – well-suited to a wide variety of computer vision applications. • Translation equivariance • Shared Computations: – inherently efficient in a sliding-window • Fundamental building block in visual recognition system
  • 3. Learning with Purpose TRANSFORMERS IN NLP: BERT • NN design for NLP: – Transformers replaced RNNs – become the dominant backbone architecture
  • 4. Learning with Purpose IMAGINE. . . • But how do we tokenize image?
  • 5. Learning with Purpose INTRODUCTION: VISION TRANSFORMERS • Vision Transformers (ViT) in 2020: – completely altered the landscape of network architecture design. – two streams surprisingly converged • • Transformers outperform ResNets by a significant margin. – on image classification – only on image classification
  • 6. Learning with Purpose VISION TRANSFORMER: PATCHIFY For 3*3 image:
  • 7. Learning with Purpose VISION TRANSFORMER: VIT An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021
  • 11. Learning with Purpose CHALLENGES OF VIT • ViT’s global attention design: – quadratic complexity with respect to the input size. • Hierarchical Transformers – hybrid approach to bridge this gap. – for example, the sliding window strategy (e.g., attention within local windows) – behave more similarly to ConvNets.
  • 12. Learning with Purpose SWIN TRANSFORMER • a milestone work in this direction • demonstrated Transformers as a generic vision backbone • achieved SOTA results across a range of computer vision tasks – beyond image classification. • successful and rapid adoption revealed: – the essence of convolution is not becoming irrelevant; rather, – it remains much desired and has never faded.
  • 13. Learning with Purpose SWIN TRANSFORMER • Still relies on Patches • But instead of choosing one • Start with smaller patch and merge in the deeper transformer layers • Similar to Convolution
  • 14. Learning with Purpose SWIN TRANSFORMER • Shifted windows Transformer • Self-attention in non-overlapped windows • Hierarchical feature maps
  • 15. Learning with Purpose BACK TO THE TOPIC: CONVNEXT Objectives: • bridge the gap between the pre-ViT and post-ViT ConvNets • test the limits of what a pure ConvNet can achieve. • Exploration: How do design decisions in Transformers impact ConvNets’ performance? • Result: a family of pure ConvNets: ConvNeXt.
  • 16. Learning with Purpose MODERNIZING A CONVNET: A ROADMAP • Recipe: Two ResNet model sizes in terms of G-FLOPs – ResNet-50 / Swin-T ~4.5x109 FLOPs – ResNet-200 / Swin-B ~15.0x109 FLOPs • Network Modernization 1. ViT training techniques for ResNet 2. ResNeXt 3. Inverted bottleneck 4. Large kernel size 5. Various layer-wise micro designs
  • 17. Learning with Purpose VIT TRAINING TECHNIQUES • Training epochs: 300 epochs (from the original 90 epochs for ResNets) • Optimizer: AdamW (Weight Decay) • Data augmentation – Mixup – Cutmix – RandAugment – Random Erasing • Regularization – Stochastic Depth – Label Smoothing • This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%)
  • 18. Learning with Purpose CHANGING STAGE COMPUTE RATIO • Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger • Swin-B Transformers, the ratio is 1:1:9:1. • #Conv blocks in ResNet-50: (3, 4, 6, 3) • ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50 – which also aligns the FLOPs with Swin-T. • This improves the model accuracy from 78.8% to 79.4%.
  • 19. Learning with Purpose CHANGING STEM TO “PATCHIFY” • Stem cell of Standard ResNet: 7x7 convolution layer – followed by a max pool, which result in a 4x downsampling. • Stem cell of ViT: a more aggressive “patchify” strategy – corresponds to a large kernel size(14 or 16) + non-overlapping convolution. • Swin Transformer uses a similar “patchfiy” layer – smaller patch size of 4: accommodate the architecture’s multi-stage design. • ConvNeXt replaces the ResNet-style CNN stem cell – with a patchify layer implemented using a 4x4, stride 4 convolution layer. • The accuracy has changed from 79.4% to 79.5%
  • 20. Learning with Purpose RESNEXT-IFY • ResNeXt has a better FLOPs/accuracy trade-off than ResNet. – The core component is grouped convolution. • Guiding principle: “use more groups, expand width”. • ConvNeXt uses depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels. • ConvNeXt increases the network width to the same number of channels as Swin-T’s (from 64 to 96) • This brings the network performance to 80.5% with increased FLOPs (5.3G)
  • 22. Learning with Purpose INVERTED BOTTLENECK • Important design in every Transformer • the hidden dimension of the MLP block is 4 times wider than the input • Connected to the design in ConvNets with an expansion ratio of – popularized by MobileNetV2 and gained traction in advanced ConvNets. – Costly: increased FLOPs for the depth wise convolution layer • Significant FLOPs reduction in the down sampling residual blocks. – Reduces the whole network FLOPs to 4.6G • Interestingly, improved performance (80.5% to 80.6%).
  • 23. Learning with Purpose MOVING UP DEPTHWISE CONV LAYER • Moving up position of depth wise conv layer as in Transformers. • The MSA block is placed prior to the MLP layers. • As ConvNeXt has an inverted bottleneck block, this is a natural design choice. – The complex/inefficien modules (MSA, large-kernel conv) will have fewer channels, while the efficient, dense 1x1 layers will do the heavy lifting. • This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradation to 79.9%.
  • 24. Learning with Purpose INCREASING THE KERNEL SIZE • Swin Transformers reintroduced the local window to the selfattention block, the window size is at least 7x7, significantly larger than the ResNeXt kernel size of 3x3. • The authors experimented with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (3x3) to 80.6% (7x7), while the network’s FLOPs stay roughly the same. • A ResNet-200 regime model does not exhibit further gain when increasing the kernel size beyond 7x7.
  • 25. Learning with Purpose REPLACING RELU WITH GELU • One discrepancy between NLP and vision architectures is the specifics of which activation functions to use. • Gaussian Error Linear Unit, or GELU – a smoother variant of ReLU, – utilized in the most advanced Transformers • Google’s BERT and OpenAI’s GPT-2, ViTs. • ReLU can be substituted with GELU in ConvNeXt too. • But the accuracy stays unchanged (80.6%). GAUSSIAN ERROR LINEAR UNITS (GELUS) 2020
  • 26. Learning with Purpose FEWER ACTIVATION FUNCTIONS • Transformers have fewer activation functions than ResNet • There is only one activation function present in the MLP block. • In comparison, it is common practice to append an activation function to each convolutional layer, including the 1x1 convs. • Replicate the Transformer block: – eliminates all GELU layers from the residual block except for one between two 1x1 layers • This process improves the result by 0.7% to 81.3%, practically matching the performance of Swin-T
  • 27. Learning with Purpose FEWER NORMALIZATION LAYERS • Transformer blocks usually have fewer normalization layers as well. • ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN layer before the conv 1x1 layers. • This further boosts the performance to 81.4%, already surpassing Swin-T’s result.
  • 28. Learning with Purpose SUBSTITUTING BN WITH LN • BatchNorm is an essential component in ConvNets as it improves the convergence and reduces overfitting. • Simpler Layer Normalization(LN) has been used in Transformers, – resulting in good performance across different application scenarios. • Directly substituting LN for BN in the original ResNet will result in suboptimal performance. • With all modifications in network architecture and training techniques, the authors revisit the impact of using LN in place of BN. • ConvNeXt model does not have any difficulties training with LN; in fact, the performance is slightly better, obtaining an accuracy of 81.5%.
  • 29. Learning with Purpose SEPARATE DOWNSAMPLING LAYERS • In ResNet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv with stride 2 at the shortcut connection). • In Swin Transformers, a separate downsampling layer is added between stages. • ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling and adding normalization layers wherever spatial resolution is changed can help stablize training. • This can improve the accuracy to 82.0%, significantly exceeding Swin-T’s 81.3%.
  • 31. Learning with Purpose MODERNIZING A STANDARD CONVNET RESULTS
  • 32. Learning with Purpose CLOSING REMARKS • ConvNeXt, a pure ConvNet, that can outperform the Swin Transformer for ImageNet-1K classification in this compute regime. • It is also worth noting that none of the design options discussed thus far is novel—they have all been researched separately, but not collectively, over the last decade. • ConvNeXt model has approximately the same FLOPs, #params., throughput, and memory use as the Swin Transformer, but does not require specialized modules such as shifted window attention or relative position biases.
  • 33. Learning with Purpose RESULTS – IMAGENET-1K
  • 34. Learning with Purpose RESULTS – IMAGENET-22K • A widely held view is that vision Transformers have fewer inductive biases thus can per form better than ConvNets when pre-trained on a larger scale. • Properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput.
  • 35. Learning with Purpose • Examining if ConvNeXt block design is generalizable to ViT-style isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14x14) at all depths. • ConvNeXt can perform generally on par with ViT, showing that ConvNeXt block design is competitive when used in non-hierarchical models. Isotropic ConvNeXt vs. ViT
  • 36. Learning with Purpose RESULTS – OBJECT DETECTION AND SEGMENTATION ON COCO, SEMANTIC SEGMENTATION ON ADE20K
  • 37. Learning with Purpose RESULTS – ROBUSTNESS EVALUATION OF CONVNEXT
  • 38. Learning with Purpose REMARKS ON MODEL EFFICIENCY[1] • Under similar FLOPs, models with depthwise convolutions are known to be slower and consume more memory than ConvNets with only dense convolutions. • Natural to ask whether the design of ConvNeXt will render it practically inefficient. • As demonstrated throughout the paper, the inference throughputs of ConvNeXts are comparable to or exceed that of Swin Transformers. This is true for both classification and other tasks requiring higher- resolution inputs.
  • 39. Learning with Purpose REMARKS ON MODEL EFFICIENCY[2] • Furthermore, training ConvNeXts requires less memory than training Swin Transformers. Training Cascade Mask-RCNN using ConvNeXt- B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for Swin-B is 18.5GB. • It is worth noting that this improved efficiency is a result of the ConvNet inductive bias, and is not directly related to the self- attention mechanism in vision Transformers
  • 40. Learning with Purpose CONCLUSIONS[1] • In the 2020s, vision Transformers, particularly hierarchical ones such as Swin Transformers, began to overtake ConvNets as the favored choice for generic vision backbones. • The widely held belief is that vision Transformers are more accurate, efficient, and scalable than ConvNets. • ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art hierarchical vision Transformers across multiple computer vision benchmarks, while retaining the simplicity and efficiency of standard ConvNets.
  • 41. Learning with Purpose CONCLUSIONS[2] • In some ways, these observations are surprising while ConvNeXt model itself is not completely new—many design choices have all been examined separately over the last decade, but not collectively. • The authors hope that the new results reported in this study will challenge several widely held views and prompt people to rethink the importance of convolution in computer vision.