SlideShare a Scribd company logo
How to train your ViT?
Data, Augmentation, and Regularization in Vision Transformers
Andreas Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers”
11th July, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics
Introduction
• ViT has recently emerged as a competitive alternative to
convolutional neural networks.
• Without the translational equivariance of CNNs, ViT models are
generally found to perform best in settings with large amounts of
training data[ViT] or to require strong AugReg(Augmentation and
Reegularization) schemes[DeiT] to avoid overfitting.
• There was no comprehensive study of the trade-offs between model
regularization, data augmentation, training data size and compute
budget in Vision Transformers.
Introduction
• The authors pre-train a large collection of ViT models on datasets of
different sizes, while at the same time performing comparisons across
different amounts of regularization and data augmentation.
• The homogeneity of the performed study constitutes one of the key
contributions of this paper.
• The insights from this study constitute another important
contribution of this paper.
More than 50,000 ViT Models
• https://github.com/google-research/vision_transformer
• https://github.com/rwightman/pytorch-image-models
Scope of the Study
• Pre-training models on large datasets once and re-using their
parameters as initialization or a part of the model as feature
extractors in models trained on a broad variety of other tasks become
common practice in computer vision.
• In this setup, there are multiple ways to characterize computational
and sample efficiency.
▪ One approach is to look at the overall computational and sample cost of both
pre-training and fine-tuning. Normally, pre-training cost will dominate overall
costs. This interpretation is valid in specific scenarios, especially when pre-
training needs to be done repeatedly or reproduced for academic/industrial
purposes.
Scope of the Study
▪ However, in the majority of cases the pre-trained model can be downloaded
or, in the worst case, trained once in a while. Contrary, in these cases, the
budget required for adapting this model may become the main bottleneck.
▪ A more extreme viewpoint is that the training cost is not crucial, and all what
matters is eventual inference cost of the trained model, deployment cost
which will amortize all other costs.
• Overall, there are three major viewpoints on what is considered to be
the central cost of training a vision model. In this study we touch on
all three of them, but mostly concentrate on “practioner’s” and
“deployment” costs.
Experimental Setup
• Datasets and metrics
▪ For pre-training
➢ImageNet-21k – approximately 14M images with about 21,000 categories.
➢ImageNet-1k – a subset of ImageNet-21k consisting of about 1.3M training images and
1000 categories.
➢De-duplicate images in ImageNet-21k with respect to the test sets of the downstream
tasks
➢ImageNet V2 are used for valuation purposes.
▪ For transfer learning
➢4 popular computer vision datasets from the VTAB benchmark
• CIFAR-100, Oxford IIIT Pets(or Pets37 for short), Resisc45 and Kitti-distance
▪ Top-1 classification accuracy is used as main metric.
Experimental Setup
• Models
▪ 4 different configuration: ViT-Ti, ViT-S, ViT-B and ViT-L.
▪ Patch-size 16 for all models, and additionally patch-size 32 for the ViT-S and ViT-B.
▪ The hidden layer in the head of ViT is dropped, as empirically it does not lead to
more accurate models and often results in optimization instabilities.
▪ Hybrid models that first process images with ResNet and then feed the spatial
output to a ViT as the initial patch embeddings are used.
▪ Rn+{Ti,S,L}/p when n counts the number of convolutions and p denotes the
patch-size in the input image.
Experimental Setup
• Regularization and data augmentations
▪ Dropout to intermediate activations of ViT and the stochastic depth
regularization technique are applied.
▪ For data augmentation, Mixup and RandAugment are applied. 𝛼 is a Mixup
parameter and 𝑙, 𝑚 are number of augmentation layers and magnitude
respectively in RandAugment.
▪ Weight decay is used too.
▪ Sweep contains 28 configurations, which is a cross-product of the followings.
➢No dropout/no stochastic depth or dropout with prob. 0.1 and stochastic depth with
maximal layer dropping prob. 0.1
➢7 data augmentation setups for (𝑙, 𝑚, 𝛼): none (0,0,0), light1 (2,0,0), light2 (2,10,0.2),
medium1 (2,15,0.2), medium2 (2,15,0.5), strong1 (2,20,0.5), strong(2,20,0.8)
➢Weight decay: 0.1 or 0.03
Experimental Setup
• Pre-training
▪ Models were pre-trained with Adam, with a batch size of 4096 and a cosine
learning rate schedule with a linear warmup.
▪ To stabilize training, gradients were clipped at global norm 1.
▪ The images are pre-processed by Inception-style cropping and random
horizontal flipping.
▪ ImageNet-1k was trained for 300 epochs and ImageNet-21k dataset was
trained for 30 and 300 epochs. This allows us to examine the effects of the
increased dataset size also with a roughly constant total compute used for
pre-training.
Experimental Setup
• Fine-tuning
▪ Models were fine-tuned with SGD with a momentum of 0.9, sweeping over 2-
3 learning rates and 1-2 training durations per dataset.
▪ A fixed batch size of 512 was used, gradients were clipped at global norm 1
and a cosine learning rate schedule with linear warmup was also used.
▪ Fine-tuning was done both at the original resolution (224), as well as at a
higher resolution (384).
Findings - Scaling datasets with AugReg and
compute
• Best models trained on AugReg ImageNet-
1k perform about equal to the same
models pre-trained on the 10x larger plain
ImageNet-21k dataset. Similarly, best
models trained on AugReg ImageNet-21k,
when compute is also increased, match or
outperform those from which were trained
on the plain JFT-300M dataset with 25x
more images.
Findings - Scaling datasets with AugReg and
compute
• It is possible to match these private results with a publicly available
dataset, and it is imaginable that training longer and with AugReg on
JFT-300M might further increase performance.
• These results cannot hold for arbitrarily small datasets. Training a
ResNet50 on only 10% of ImageNet-1k with heavy data augmentation
improves results, but does not recover training on the full dataset.
Table 5. from “Unsupervised Data Augmentation for Consistency Training”
Findings – Transfer is the better option
• For most practical purposes, transferring a pre-trained model is both
more cost-efficient and leads to better results.
• The most striking finding is that, no matter how much training time is
spent, for the tiny Pet37 dataset, it does not seem possible to train
ViT models from scratch to reach accuracy anywhere near that of
transferred models.
Findings – Transfer is the better option
• For the larger Resisc45 dataset, this result still holds, although spending
two orders of magnitude more compute and performing a heavy search
may come close (but not reach) to the accuracy of pre-trained models.
• Notably, this does not account for the exploration cost which is difficult
to quantify.
Findings – More data yields more generic models
• Interestingly, the model pre-trained on ImageNet-21k(30 ep) is
significantly better than the ImageNet-1k(300 ep) one, across all the
three VTAB categories.
• As the compute budget keeps growing, we observe consistent
improvements on ImageNet-21k dataset with 10x longer schedule.
• Overall, we conclude that more data yields more generic models, the
trend holds across very diverse tasks.
Findings – Prefer augmentation to regularization
• The authors aim to discover general patterns for data augmentation and
regularization that can be used as rules of thumb when applying Vision
Transformers to a new task.
• The colour of a cell encodes its improvement or deterioration in score
when compared to the unregularized, unaugmented setting.
Findings – Prefer augmentation to regularization
• The first observation that becomes visible, is that for the mid-sized
ImageNet-1k dataset, any kind of AugReg helps.
• However, when using the 10x larger ImageNet-21k dataset and
keeping compute fixed, i.e. running for 30 epochs, any kind of AugReg
hurts performance for all but the largest models.
Findings – Prefer augmentation to regularization
• It is only when also increasing the computation budget to 300 epochs
that AugReg helps more models, although even then, it continues
hurting the smaller ones.
• Generally speaking, there are significantly more cases where adding
augmentation helps, than where adding regularization helps.
Findings – Prefer augmentation to regularization
• Below figure tells us that when using ImageNet-21k, regularization
hurts almost across the board.
Findings – Choosing which pre-trained model to
transfer
• When pre-training ViT models, various regularization and data
augmentation settings result in models with drastically different
performance.
• Then, from the practitioner’s point of view, a natural question
emerges: how to select a model for further adaption for an end
application?
• One way is to run for all available pre-trained models and then select
the best performing model, based on the validation score on the
downstream task of interest. This could be quite expensive in practice.
• Alternatively, one can select a single pre-trained model based on the
upstream validation accuracy and then only use this model for
adaptation, which is much cheaper.
Findings – Choosing which pre-trained model to
transfer
• Below figure shows the performance difference between the cheaper
strategy and the more expensive strategy.
• The results are mixed, but generally reflect that the cheaper strategy works
equally well as the more expensive strategy in the majority of scenarios.
• Selecting a single pre-trained model based on the upstream score is a cost-
effective practical strategy.
Findings – Choosing which pre-trained model to
transfer
• For every architecture and upstream dataset, the best model selected
by upstream validation accuracy.
• Bold numbers indicate results that are on par or surpass the
published JFT-300M results without AugReg for the same models.
Findings – Prefer increasing patch-size to
shrinking model-size
• Models containing the “Tiny” variants perform
significantly worse than the similarly fast larger
models with “/32” patch-size.
• For a given resolution, the patch-size influences
the amount of tokens on which self-attention is
performed and, thus, is a contributor to model
capacity which is not reflected by parameter
count.
• Parameter count is reflective neither of speed,
nor of capacity.
Conclusion
• This paper conduct the first systematic, large scale study of the
interplay between regularization, data augmentation, model size, and
training data size when pre-training ViTs.
• These experiments yield a number of surprising insights around the
impact of various techniques and the situations when augmentation
and regularization are beneficial and when not.
• Across a wide range of datasets, even if the downstream data of
interest appears to only be weakly related to the data used for pre-
training, transfer learning remains the best available option.
• Among similarly performing pre-trained models, for transfer learning
a model with more training data should likely be preferred over one
with more data augmentation.
Thank you

More Related Content

What's hot

Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
YanhuaSi
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
Yasar Hayat
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Image-to-Image Translation
Image-to-Image TranslationImage-to-Image Translation
Image-to-Image Translation
Junho Kim
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
Rouyun Pan
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
Changjin Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
Jason Anderson
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
taeseon ryu
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
Antonio Rueda-Toicen
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
Sushant Gautam
 
2020 08 05_dl_DETR
2020 08 05_dl_DETR2020 08 05_dl_DETR
2020 08 05_dl_DETR
harmonylab
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
Rishabh Indoria
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 

What's hot (20)

Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Image-to-Image Translation
Image-to-Image TranslationImage-to-Image Translation
Image-to-Image Translation
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
2020 08 05_dl_DETR
2020 08 05_dl_DETR2020 08 05_dl_DETR
2020 08 05_dl_DETR
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 

Similar to PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers

Mnist soln
Mnist solnMnist soln
Mnist soln
DanishFaisal4
 
How useful is self-supervised pretraining for Visual tasks?
How useful is self-supervised pretraining for Visual tasks?How useful is self-supervised pretraining for Visual tasks?
How useful is self-supervised pretraining for Visual tasks?
Seunghyun Hwang
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Devansh16
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
nnUNet
nnUNetnnUNet
Deeplearning
Deeplearning Deeplearning
Deeplearning
Nimrita Koul
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
LEE HOSEONG
 
R in Insurance 2014
R in Insurance 2014R in Insurance 2014
R in Insurance 2014
Giorgio Alfredo Spedicato
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
Matthew Evans
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
Sunghoon Joo
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
FEG
 
IRJET- Automatic Object Sorting using Deep Learning
IRJET- Automatic Object Sorting using Deep LearningIRJET- Automatic Object Sorting using Deep Learning
IRJET- Automatic Object Sorting using Deep Learning
IRJET Journal
 
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
Edge AI and Vision Alliance
 
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image DatabaseIRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET Journal
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
ssuser2624f71
 
SigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt for Machine Learning and AI
SigOpt for Machine Learning and AI
SigOpt
 

Similar to PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers (20)

Mnist soln
Mnist solnMnist soln
Mnist soln
 
How useful is self-supervised pretraining for Visual tasks?
How useful is self-supervised pretraining for Visual tasks?How useful is self-supervised pretraining for Visual tasks?
How useful is self-supervised pretraining for Visual tasks?
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
nnUNet
nnUNetnnUNet
nnUNet
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
R in Insurance 2014
R in Insurance 2014R in Insurance 2014
R in Insurance 2014
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 
IRJET- Automatic Object Sorting using Deep Learning
IRJET- Automatic Object Sorting using Deep LearningIRJET- Automatic Object Sorting using Deep Learning
IRJET- Automatic Object Sorting using Deep Learning
 
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
“DNN Training Data: How to Know What You Need and How to Get It,” a Presentat...
 
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image DatabaseIRJET-  	  Review of Tencent ML-Images Large-Scale Multi-Label Image Database
IRJET- Review of Tencent ML-Images Large-Scale Multi-Label Image Database
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
SigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt for Machine Learning and AI
SigOpt for Machine Learning and AI
 

More from Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
Jinwon Lee
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
Jinwon Lee
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
Jinwon Lee
 

More from Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers

  • 1. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers Andreas Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers” 11th July, 2021 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. Introduction • ViT has recently emerged as a competitive alternative to convolutional neural networks. • Without the translational equivariance of CNNs, ViT models are generally found to perform best in settings with large amounts of training data[ViT] or to require strong AugReg(Augmentation and Reegularization) schemes[DeiT] to avoid overfitting. • There was no comprehensive study of the trade-offs between model regularization, data augmentation, training data size and compute budget in Vision Transformers.
  • 3. Introduction • The authors pre-train a large collection of ViT models on datasets of different sizes, while at the same time performing comparisons across different amounts of regularization and data augmentation. • The homogeneity of the performed study constitutes one of the key contributions of this paper. • The insights from this study constitute another important contribution of this paper.
  • 4. More than 50,000 ViT Models • https://github.com/google-research/vision_transformer • https://github.com/rwightman/pytorch-image-models
  • 5. Scope of the Study • Pre-training models on large datasets once and re-using their parameters as initialization or a part of the model as feature extractors in models trained on a broad variety of other tasks become common practice in computer vision. • In this setup, there are multiple ways to characterize computational and sample efficiency. ▪ One approach is to look at the overall computational and sample cost of both pre-training and fine-tuning. Normally, pre-training cost will dominate overall costs. This interpretation is valid in specific scenarios, especially when pre- training needs to be done repeatedly or reproduced for academic/industrial purposes.
  • 6. Scope of the Study ▪ However, in the majority of cases the pre-trained model can be downloaded or, in the worst case, trained once in a while. Contrary, in these cases, the budget required for adapting this model may become the main bottleneck. ▪ A more extreme viewpoint is that the training cost is not crucial, and all what matters is eventual inference cost of the trained model, deployment cost which will amortize all other costs. • Overall, there are three major viewpoints on what is considered to be the central cost of training a vision model. In this study we touch on all three of them, but mostly concentrate on “practioner’s” and “deployment” costs.
  • 7. Experimental Setup • Datasets and metrics ▪ For pre-training ➢ImageNet-21k – approximately 14M images with about 21,000 categories. ➢ImageNet-1k – a subset of ImageNet-21k consisting of about 1.3M training images and 1000 categories. ➢De-duplicate images in ImageNet-21k with respect to the test sets of the downstream tasks ➢ImageNet V2 are used for valuation purposes. ▪ For transfer learning ➢4 popular computer vision datasets from the VTAB benchmark • CIFAR-100, Oxford IIIT Pets(or Pets37 for short), Resisc45 and Kitti-distance ▪ Top-1 classification accuracy is used as main metric.
  • 8. Experimental Setup • Models ▪ 4 different configuration: ViT-Ti, ViT-S, ViT-B and ViT-L. ▪ Patch-size 16 for all models, and additionally patch-size 32 for the ViT-S and ViT-B. ▪ The hidden layer in the head of ViT is dropped, as empirically it does not lead to more accurate models and often results in optimization instabilities. ▪ Hybrid models that first process images with ResNet and then feed the spatial output to a ViT as the initial patch embeddings are used. ▪ Rn+{Ti,S,L}/p when n counts the number of convolutions and p denotes the patch-size in the input image.
  • 9. Experimental Setup • Regularization and data augmentations ▪ Dropout to intermediate activations of ViT and the stochastic depth regularization technique are applied. ▪ For data augmentation, Mixup and RandAugment are applied. 𝛼 is a Mixup parameter and 𝑙, 𝑚 are number of augmentation layers and magnitude respectively in RandAugment. ▪ Weight decay is used too. ▪ Sweep contains 28 configurations, which is a cross-product of the followings. ➢No dropout/no stochastic depth or dropout with prob. 0.1 and stochastic depth with maximal layer dropping prob. 0.1 ➢7 data augmentation setups for (𝑙, 𝑚, 𝛼): none (0,0,0), light1 (2,0,0), light2 (2,10,0.2), medium1 (2,15,0.2), medium2 (2,15,0.5), strong1 (2,20,0.5), strong(2,20,0.8) ➢Weight decay: 0.1 or 0.03
  • 10. Experimental Setup • Pre-training ▪ Models were pre-trained with Adam, with a batch size of 4096 and a cosine learning rate schedule with a linear warmup. ▪ To stabilize training, gradients were clipped at global norm 1. ▪ The images are pre-processed by Inception-style cropping and random horizontal flipping. ▪ ImageNet-1k was trained for 300 epochs and ImageNet-21k dataset was trained for 30 and 300 epochs. This allows us to examine the effects of the increased dataset size also with a roughly constant total compute used for pre-training.
  • 11. Experimental Setup • Fine-tuning ▪ Models were fine-tuned with SGD with a momentum of 0.9, sweeping over 2- 3 learning rates and 1-2 training durations per dataset. ▪ A fixed batch size of 512 was used, gradients were clipped at global norm 1 and a cosine learning rate schedule with linear warmup was also used. ▪ Fine-tuning was done both at the original resolution (224), as well as at a higher resolution (384).
  • 12. Findings - Scaling datasets with AugReg and compute • Best models trained on AugReg ImageNet- 1k perform about equal to the same models pre-trained on the 10x larger plain ImageNet-21k dataset. Similarly, best models trained on AugReg ImageNet-21k, when compute is also increased, match or outperform those from which were trained on the plain JFT-300M dataset with 25x more images.
  • 13. Findings - Scaling datasets with AugReg and compute • It is possible to match these private results with a publicly available dataset, and it is imaginable that training longer and with AugReg on JFT-300M might further increase performance. • These results cannot hold for arbitrarily small datasets. Training a ResNet50 on only 10% of ImageNet-1k with heavy data augmentation improves results, but does not recover training on the full dataset. Table 5. from “Unsupervised Data Augmentation for Consistency Training”
  • 14. Findings – Transfer is the better option • For most practical purposes, transferring a pre-trained model is both more cost-efficient and leads to better results. • The most striking finding is that, no matter how much training time is spent, for the tiny Pet37 dataset, it does not seem possible to train ViT models from scratch to reach accuracy anywhere near that of transferred models.
  • 15. Findings – Transfer is the better option • For the larger Resisc45 dataset, this result still holds, although spending two orders of magnitude more compute and performing a heavy search may come close (but not reach) to the accuracy of pre-trained models. • Notably, this does not account for the exploration cost which is difficult to quantify.
  • 16. Findings – More data yields more generic models • Interestingly, the model pre-trained on ImageNet-21k(30 ep) is significantly better than the ImageNet-1k(300 ep) one, across all the three VTAB categories. • As the compute budget keeps growing, we observe consistent improvements on ImageNet-21k dataset with 10x longer schedule. • Overall, we conclude that more data yields more generic models, the trend holds across very diverse tasks.
  • 17. Findings – Prefer augmentation to regularization • The authors aim to discover general patterns for data augmentation and regularization that can be used as rules of thumb when applying Vision Transformers to a new task. • The colour of a cell encodes its improvement or deterioration in score when compared to the unregularized, unaugmented setting.
  • 18. Findings – Prefer augmentation to regularization • The first observation that becomes visible, is that for the mid-sized ImageNet-1k dataset, any kind of AugReg helps. • However, when using the 10x larger ImageNet-21k dataset and keeping compute fixed, i.e. running for 30 epochs, any kind of AugReg hurts performance for all but the largest models.
  • 19. Findings – Prefer augmentation to regularization • It is only when also increasing the computation budget to 300 epochs that AugReg helps more models, although even then, it continues hurting the smaller ones. • Generally speaking, there are significantly more cases where adding augmentation helps, than where adding regularization helps.
  • 20. Findings – Prefer augmentation to regularization • Below figure tells us that when using ImageNet-21k, regularization hurts almost across the board.
  • 21. Findings – Choosing which pre-trained model to transfer • When pre-training ViT models, various regularization and data augmentation settings result in models with drastically different performance. • Then, from the practitioner’s point of view, a natural question emerges: how to select a model for further adaption for an end application? • One way is to run for all available pre-trained models and then select the best performing model, based on the validation score on the downstream task of interest. This could be quite expensive in practice. • Alternatively, one can select a single pre-trained model based on the upstream validation accuracy and then only use this model for adaptation, which is much cheaper.
  • 22. Findings – Choosing which pre-trained model to transfer • Below figure shows the performance difference between the cheaper strategy and the more expensive strategy. • The results are mixed, but generally reflect that the cheaper strategy works equally well as the more expensive strategy in the majority of scenarios. • Selecting a single pre-trained model based on the upstream score is a cost- effective practical strategy.
  • 23. Findings – Choosing which pre-trained model to transfer • For every architecture and upstream dataset, the best model selected by upstream validation accuracy. • Bold numbers indicate results that are on par or surpass the published JFT-300M results without AugReg for the same models.
  • 24. Findings – Prefer increasing patch-size to shrinking model-size • Models containing the “Tiny” variants perform significantly worse than the similarly fast larger models with “/32” patch-size. • For a given resolution, the patch-size influences the amount of tokens on which self-attention is performed and, thus, is a contributor to model capacity which is not reflected by parameter count. • Parameter count is reflective neither of speed, nor of capacity.
  • 25. Conclusion • This paper conduct the first systematic, large scale study of the interplay between regularization, data augmentation, model size, and training data size when pre-training ViTs. • These experiments yield a number of surprising insights around the impact of various techniques and the situations when augmentation and regularization are beneficial and when not. • Across a wide range of datasets, even if the downstream data of interest appears to only be weakly related to the data used for pre- training, transfer learning remains the best available option. • Among similarly performing pre-trained models, for transfer learning a model with more training data should likely be preferred over one with more data augmentation.