Review : Rethinking Pre-training and Self-training

Rethinking Pre-training and Self-training
Google Research, Brain Team
Yonsei University Severance Hospital CCIDS
Choi Dongmin

Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training 
- a dominant paradigm in computer vision (ex. ImageNet pre-training) 
- However, ImageNet pre-training does not improve accuracy on COCO  
[Kaiming He, ICCV 2019]

Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training 
- a dominant paradigm in computer vision (ex. ImageNet pre-training) 
- However, ImageNet pre-training does not improve accuracy on COCO  
[Kaiming He, ICCV 2019]

• Self-training 
- Steps (ex. Use ImageNet to help COCO object detection) 
1) Discard the labels on ImageNet 
2) Train an object detection on COCO, and use it to generate pseudo labels  
on ImageNet 
3) A new model is trained on the combined pseudo-labeled ImageNet and  
labeled COCO data

Introduction
• Generality and Flexibility of Self-training with three insights 
 
1) Stronger data augmentation & More labeled data 
→ diminish the value of pre-training 
 
2) Unlike pre-training, self-training is always helpful 
 
3) Self-training improves upon pre-training

Methodology
• Methods and Control Factors 
1. Data Augmentation 
2. Pre-training 
3. Self-training

Methodology
1. Data Augmentation

Methodology
1. Data Augmentation
AutoAugment RandAugment
Automatically search for 
improved data augmentation policies Remove a separate

search space phase on a proxy task
more stronger

Methodology
2. Pre-training (EﬃcientNet-B7 baseline)

Methodology
2. Pre-training (EfficientNet-B7 baseline)
ImageNet++ Init : EfficientNet-B7 + Noisy Student Method
M Tan et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019 
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
- A semi-supervised learning 
- Self-training + Distillation

Methodology
3. Self-training (based on Noise Student Method)
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classiﬁcation. arXiv:1911.04252

Experiments
1. The eﬀects of augmentation and labeled dataset size on pre-training
- Task : COCO object detection 
- Network : RetinaNet with the EﬃcientNet-B7 backbone 
 
- Left : under various ImageNet pre-trained checkpoint and data augmentation strengths
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 1. Pre-training hurts performance when stronger data augmentation is used

Experiments
1. The eﬀects of augmentation and labeled dataset size on pre-training
 
- Right : under various COCO dataset sizes and ImageNet pre-trained checkpoint
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 2. More labeled data diminishes the value of pre-training

Experiments
2. The eﬀects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data) 
 
Finding 1. Self-training helps in high data/strong augmentation regimes, 
even when pre-training hurts
= Pre-training

Experiments
2. The eﬀects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data) 
 
Finding 2. Self-training works across dataset sizes and 
is additive to pre-training.

Experiments
3. Self-supervised pre-training also hurts when self-training helps in high 
data/strong augmentation regimes
- Network : RetinaNet with the ResNet-50
backbone 
- All models use Augment-S4 
T Chen et al. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709
https://amitness.com/2020/03/illustrated-simclr/

Experiments
4. Exploring the limits of self-training and pre-training
- Network : SpineNet (closer to SOTA) 
- Self-training dataset : Open Images Dataset
X Du et al. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020
SpineNet with Self-training 
achieves the best performance

Experiments
- Task : PASCAL VOC Semantic Segmentation 
- Network : NAS-FPN (EﬃcientNet backbone) 
- Pre-training + Self-training + Augment-S4

- Pre-training dataset : ImageNet

- Self-training dataset : aug set of PASCAL
G Ghiasi et al. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. CVPR 2019
Improves SOTA by +1.5% mIOU 
w/ much less human labels

Experiments
- Task : PASCAL VOC Semantic Segmentation 
- Network : NAS-FPN (EﬃcientNet backbone) 
- Pre-training + Self-training + Augment-S4

- Pre-training dataset : ImageNet

- Self-training dataset : aug set of PASCAL
Pre-training with a good checkpoint is crucial 
due to PASCAL’s small dataset size
< Appendix C >

Discussion
1. Rethinking pre-training and universal feature representations
- Requirements of universal feature representations that can solve many tasks 
 
- Weak performance of pre-training 
: Pre-training is not aware of the task of interest and can fail to adapt 
(ex. good features for ImageNet may discard positional information which is needed for COCO) 
 
- Self-training is more adaptive to the task of interest (generally more beneﬁcial)

Discussion
2. The beneﬁt of joint-training
- Joint-training : jointly train ImageNet classiﬁcation with COCO object detection 
 
- Random Initialization + Self-training + Joint Training : +4.4 improvement 
 
- Joint Training (+2.9) and Pre-training (+2.6) gives similar improvements, 
but Joint Training is achieved by training 19 epochs while Pre-training needed  
to be trained for 350 epochs.

Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels 
- Training with aug dataset hurts performance when strong augmentation 
- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling

Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels 
- Training with aug dataset hurts performance when strong augmentation 
- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling
Shao et al : Pre-training on Open Images hurts performance on COCO, despite
both of them being annotated with bounding boxes
Shao et al. Objects365: A Large-scale, High-quality Dataset for Object Detection. ICCV 2019
Not only the task but the annotations to be same for 
pre-training to be beneﬁcial (but self-training is very general)

Discussion
4. Limitations
- Self-training requires more compute than pre-training 
 
- Good pre-trained models are also needed for low-data applications 
(ex. PASCAL segmentation)

Discussion
4. Limitations
- Self-training requires more compute than pre-training 
 
- Good pre-trained models are also needed for low-data applications 
(ex. PASCAL segmentation)
5. The scalability, generality and ﬂexibility of self-training
- Scalability : works well as we have more labeled data 
- Generality : works well even when pre-training fails but also when pre-training 
succeeds 
- Flexibility : works well in every setup (low or high data / weak or strong aug) 
and with diﬀerent architectures, data sources, and tasks
The most methods fail when we have more labeled data or
more compute or better supervised training recipes, 
but that does not seem to self-training

Review : Rethinking Pre-training and Self-training

More Related Content

What's hot

Similar to Review : Rethinking Pre-training and Self-training

More from Dongmin Choi

Recently uploaded

Review : Rethinking Pre-training and Self-training