Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models

PR-373
NeurIPS 2021 주성훈, VUNO Inc.

2022. 2. 27.

2. Methods
1. Research Background 3
The performance of a vision model
•최신 training methodology를 적용한 새로운 architectures에서의 성능과,
 
이전 방식의 training methodology를 적용한 오래된 architectures성능과 비교하는 것을 지적함
•Architecture •Training methods •Scaling strategy
/ 24

2. Methods
Objective:
 
Vision architecture를 scaling 하는데에 new perspectives와 practical advice 제공
•ResNet에 modern training and regularization techniques를 적용. Large Epoch.
•Re-scaled (Depth, Width, Resolution) ResNet (ResNet-RS)
/ 24

2. Methods
ResNeXt
Automated architecture search를 활용한 구조
[67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)].
Adapting self-attention to the visual domain
AA-ResNet-152, 79.1%, 2019
ViT-L/16 87.76±0.03%, 2020
LambdaResNet200 84.3%, 2021
Previous works
•Architecture
VGG
ResNet
Inception
ViT-L/16 87.76±0.03%, 2020
/ 24

2. Methods
Previous works
Compound Scaling
•이 논문에서 지적하는 단점들
• Small Epoch
• FLOPS 를 최적화해 설계된 baseline network (EfficientNet-B0)
•Architecture - EfficientNet
•EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, 2019, ICML 2019)
•PR-169
/ 24

2. Methods
•Innovations in training (e.g. improved learning rate schedules)
•regularization methods
•dropout
•label smoothing
•stochastic depth
•dropblock
•data augmentation
Regularization methods have become especially useful to prevent overfitting when training ever-increasingly
larger models on limited data (e.g. 1.2M ImageNet images).
Previous works
•training/regularization methodology
https://arxiv.org/pdf/2011.12562.pdf
Label smoothing
/ 24

2. Methods
Previous works
NFNets : 89.2% (Brock, A. et al. High-Performance Large-Scale Image Recognition Without
Normalization. (2021).)
• ResNet
• JFT-300M
• Batch normalization -> Adaptive Gradient Clipping(AGC)
ViT: An Image is Worth 16x16 Words:
Transformers for Image Recognition
at Scale (ICLR 2021): 88.6%
• Transformer
• ImageNet-21k (1,400만 개, 10배)
• JFT-300M
•Additional training data.
•Pre-training on large-scale datasets
76.4 -> 79.2
Revisiting Unreasonable Effectiveness of
Data in Deep Learning Era. ICCV. 2017
/ 24

2. Methods
NoisyStudent (EfficientNet-L2) 88.4%, CVPR
-Self-training with Noisy Student improves ImageNet classification
Previous works
•Additional training data.
•Semi-supervised learning
Meta Pseudo Labels (EfficientNet-L2), 90.2%,
PR-336
/ 24

2. Experimental Results & Methodology

2. Methods
2. Experimental Results & Methodology 11
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
•Baseline ResNet-200
•256x256
•90 epochs
•Stepwise learning rate decay
•Hyperparameter selection:
•held-out validation set comprising 2% of the
ImageNet training set (minival-set)
•Increase training epochs (90->350 epoch)
•Regularization을 적용한 후에는 유용함
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Training Regularization Architecture
/ 24

2. Methods
Regularization and Data Augmentation
•Dropout: after the global average pooling
•Decrease weight decay 적용 유무에 따라 결과 달라짐
•Stochastic depth (G. Huang et al., 2016, ECCV)
•RandAugment: random image transformations (e.g.
translate, shear, color distortions)
•Momentum with cosine learning rate schedule
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
/ 24

2. Methods
•간단한 architecture change (SE, ResNet-D) 를 추가해 Top-1 accuracy 83.4% (+4.4%) 달성
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
Architecture changes
•ResNet-D (PR-201)
•Squeeze-and-Excitation (Hu, J. et al., 2020)
/ 24

2. Methods
Importance of decreasing weight decay when combining regularization methods
•RA나 LS를 적용하면 default weight decay (1e-4)에서 변경 할 필요 없음
RA: RandAugment, LS: Label Smoothing, DO: Dropout on FC, SD: Stochastic Depth
•더 많은 regularization을 적용할 때 weight decay 수준을 낮추지 않으면 성능이 좋아지지 않는다.
 
-> 여러 regularization을 적용했을 때 모델을 과도하게 regularization이 되지 않도록 조절이 필요함.
/ 24

2. Methods
Extensive search on ImageNet over scaling strategies
•Extensive search on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26, 50, 101, 200, 300, 350,
400] and resolutions of [128,160,224,320,448].
•350 epochs, increase regularization with model size in an effort to limit overfitting
Smaller model - power law trend
동일 FLOPs에서 image resolution의 영향을 받음 ->
slow image resolution scaling의 motivation이 됨
•RandAugment:
•magnitude is set to 10 for filter multipliers in [0.25, 0.5]
 
or image resolution in [64, 160],
•15 for image resolution in [224, 320]
•20 otherwise.
•Stochastic depth
•Drop rate of 0.2 for image resolutions 224 and above.
•We do not apply stochastic depth filter multiplier 0.25 (or images smaller than 224).
•All models use a label smoothing of 0.1 and a weight decay of 4e-5.
/ 24

2. Methods
Depth Scaling in Regimes Where Overfitting Can Occur
•Depth scaling은 longer epoch 조건에서, width scaling은 short epoch 조건에서 성능 향상
•Width scaling하면 overfitting 발생: Model Width를 확장할 때 hyperparameter가 더 크게 증가
•이전 연구들은 40 epoch정도의 training regime에서도 width scaling을 택함을 지적
•BiT (2019) scales the width of ResNet-152 with 4x filter multiplier
•FNnet (2021) scales the width with ～1.5x filter multiplier
Image resolution [128, 160, 224, 320]
Depth [101, 200, 300, 400]
Width multiplier [1.0, 1.5, 2.0]
/ 24

2. Methods
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
•GPU/TPU에서 memory access cost가 latency에 더 큰 영향을 주는데, FLOPs는 이를 반영하지 못함
•EfficientNet의 더 큰 activations을 활용하는 Depthwise convolution 으로 인해 ResNet보다 GPU/TPU에서 latency 손해를 봄
•더 많은 Memory 소비 -> memory access 증가 -> latency 증가
• RS-350의 FLOPs가 1.8배 많지만
TPU-v3 latency가 2.7배 낮음
• RS-350의 Params가 3.8배 많지만
2.3배 적은 메모리를 사용함
• Depth-resolution
/ 24

2. Methods
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•Width, depth보다 천천히 scale-up되는 image resolution을 활용해 activation 수를 줄이고 그 결과 메모리 사용량을 줄이는 것이
ResNet-RS의 핵심
/ 24

2. Methods
Improving scaling of EfficientNet
•EfficientNets(EfficientNet-RS)의 성능 개선
•EfficientNet의 jointly increases model depth, width and resolution at a constant rate 전략이 차선책
이라고 주장
• Architecture - resolution
Slow image resolution scaling strategy를 EfficientNets에 적용 (width와 depth 변경 없음)
/ 24

2. Methods
ResNet-RS performance in a large scale semi-supervised learning setup
•Semi-supervised learning setup에서도 높은 정확도를 보이면서 EfficientNet보다 4.7x - 5.5x 더 빠른 inference
speed를 보인다.
•ResNets-RS 를 1.3M labeled ImageNet images 와 130M pseudo-labeled images로 학습 (Noisy Student 처럼)
•130M pseudo-labeled image는 Noisy Student 논문의 EfficientNet-L2 (88.4%)로 생성
/ 24

2. Methods
Transfer Learning to Downstream Tasks with ResNet-RS
•re-scale을 적용한 supervised learning이 SimCLR/SImCLRv2 (Self-supervised learning)을 outperform함.
•Self-supervised learning이 supervised learning보다 universal representations을 학습한다는 개념과 상반된 결과.
•SimCLR와의 공정한 비교를 위한 실험 셋팅:
•ResNet-RS: 400 epochs with cosine learning rate decay, data augmentation (RandAugment),
label smoothing, dropout and decreased weight decay but do not use stochastic depth or
exponential moving average (EMA) of the weights.
•SimCLR/SimCLRv2: longer training (800 epochs) with cosine learning rate decay, a tailored data
augmentation strategy, a tuned temperature parameter in the contrastive loss and a tuned
weight decay.
•Vanilla ResNet architecture pre-trained on ImageNet.
/ 24

2. Methods
Revised 3D ResNet for Video Classification
Kinetics-400 video classification task
•Image classification과 비슷한 결과. Architecture 변경 없이도 높은 성능 개선 효과가 있었음.
/ 24

2. Methods
3. Conclusions 24
• 복잡한 architectural change에서 얻을 수 있는 성능 향상만큼 scaling and training strategies만으로도
달성할 수 있다는 것을 보임
• 모델 아키텍처를 변경하지 않고도 강력한 성능(예: +3.2% top-1 ImageNet 정확도, +4.0% top-1
Kinetics-400 정확도)을 달성하는 regularization techniques와 training strategy 에 대한 실증적 연구
• 새로운 network scaling strategy:
• (1) Scale depth when overfitting can occur (scaling width can be preferable otherwise)
• (2) Scale the image resolution more slowly
• FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
• Our work suggests that the field has myopically overemphasized architectural innovations at
the expense of experimental diligence, and we hope it encourages further scrutiny in
maintaining consistent methodology for both proposed innovations and baselines alike.
/ 24

Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models

Similar to Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models (20)

More from Sunghoon Joo

More from Sunghoon Joo (19)

Recently uploaded

Recently uploaded (20)

Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models