Review : Rethinking Pre-training and Self-training
1. Rethinking Pre-training and Self-training
Google Research, Brain Team
Yonsei University Severance Hospital CCIDS
Choi Dongmin
2. Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training
- a dominant paradigm in computer vision (ex. ImageNet pre-training)
- However, ImageNet pre-training does not improve accuracy on COCO
[Kaiming He, ICCV 2019]
3. Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training
- a dominant paradigm in computer vision (ex. ImageNet pre-training)
- However, ImageNet pre-training does not improve accuracy on COCO
[Kaiming He, ICCV 2019]
• Self-training
- Steps (ex. Use ImageNet to help COCO object detection)
1) Discard the labels on ImageNet
2) Train an object detection on COCO, and use it to generate pseudo labels
on ImageNet
3) A new model is trained on the combined pseudo-labeled ImageNet and
labeled COCO data
4. Introduction
• Generality and Flexibility of Self-training with three insights
1) Stronger data augmentation & More labeled data
→ diminish the value of pre-training
2) Unlike pre-training, self-training is always helpful
3) Self-training improves upon pre-training
7. Methodology
• Methods and Control Factors
1. Data Augmentation
AutoAugment RandAugment
Automatically search for
improved data augmentation policies Remove a separate
search space phase on a proxy task
more stronger
9. Methodology
• Methods and Control Factors
2. Pre-training (EfficientNet-B7 baseline)
ImageNet++ Init : EfficientNet-B7 + Noisy Student Method
M Tan et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
- A semi-supervised learning
- Self-training + Distillation
10. Methodology
• Methods and Control Factors
3. Self-training (based on Noise Student Method)
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
11. Experiments
1. The effects of augmentation and labeled dataset size on pre-training
- Task : COCO object detection
- Network : RetinaNet with the EfficientNet-B7 backbone
- Left : under various ImageNet pre-trained checkpoint and data augmentation strengths
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 1. Pre-training hurts performance when stronger data augmentation is used
12. Experiments
1. The effects of augmentation and labeled dataset size on pre-training
- Task : COCO object detection
- Network : RetinaNet with the EfficientNet-B7 backbone
- Right : under various COCO dataset sizes and ImageNet pre-trained checkpoint
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 2. More labeled data diminishes the value of pre-training
13. Experiments
2. The effects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data)
- Network : RetinaNet with the EfficientNet-B7 backbone
Finding 1. Self-training helps in high data/strong augmentation regimes,
even when pre-training hurts
= Pre-training
14. Experiments
2. The effects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data)
- Network : RetinaNet with the EfficientNet-B7 backbone
Finding 2. Self-training works across dataset sizes and
is additive to pre-training.
15. Experiments
3. Self-supervised pre-training also hurts when self-training helps in high
data/strong augmentation regimes
- Task : COCO object detection
- Network : RetinaNet with the ResNet-50
backbone
- All models use Augment-S4
T Chen et al. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709
https://amitness.com/2020/03/illustrated-simclr/
16. Experiments
4. Exploring the limits of self-training and pre-training
- Task : COCO object detection
- Network : SpineNet (closer to SOTA)
- Self-training dataset : Open Images Dataset
X Du et al. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020
SpineNet with Self-training
achieves the best performance
17. Experiments
4. Exploring the limits of self-training and pre-training
- Task : PASCAL VOC Semantic Segmentation
- Network : NAS-FPN (EfficientNet backbone)
- Pre-training + Self-training + Augment-S4
- Pre-training dataset : ImageNet
- Self-training dataset : aug set of PASCAL
G Ghiasi et al. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. CVPR 2019
Improves SOTA by +1.5% mIOU
w/ much less human labels
18. Experiments
4. Exploring the limits of self-training and pre-training
- Task : PASCAL VOC Semantic Segmentation
- Network : NAS-FPN (EfficientNet backbone)
- Pre-training + Self-training + Augment-S4
- Pre-training dataset : ImageNet
- Self-training dataset : aug set of PASCAL
Pre-training with a good checkpoint is crucial
due to PASCAL’s small dataset size
< Appendix C >
19. Discussion
1. Rethinking pre-training and universal feature representations
- Requirements of universal feature representations that can solve many tasks
- Weak performance of pre-training
: Pre-training is not aware of the task of interest and can fail to adapt
(ex. good features for ImageNet may discard positional information which is needed for COCO)
- Self-training is more adaptive to the task of interest (generally more beneficial)
20. Discussion
1. Rethinking pre-training and universal feature representations
- Requirements of universal feature representations that can solve many tasks
- Weak performance of pre-training
: Pre-training is not aware of the task of interest and can fail to adapt
(ex. good features for ImageNet may discard positional information which is needed for COCO)
- Self-training is more adaptive to the task of interest (generally more beneficial)
21. Discussion
2. The benefit of joint-training
- Joint-training : jointly train ImageNet classification with COCO object detection
- Random Initialization + Self-training + Joint Training : +4.4 improvement
- Joint Training (+2.9) and Pre-training (+2.6) gives similar improvements,
but Joint Training is achieved by training 19 epochs while Pre-training needed
to be trained for 350 epochs.
22. Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels
- Training with aug dataset hurts performance when strong augmentation
- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling
23. Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels
- Training with aug dataset hurts performance when strong augmentation
- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling
Shao et al : Pre-training on Open Images hurts performance on COCO, despite
both of them being annotated with bounding boxes
Shao et al. Objects365: A Large-scale, High-quality Dataset for Object Detection. ICCV 2019
Not only the task but the annotations to be same for
pre-training to be beneficial (but self-training is very general)
24. Discussion
4. Limitations
- Self-training requires more compute than pre-training
- Good pre-trained models are also needed for low-data applications
(ex. PASCAL segmentation)
25. Discussion
4. Limitations
- Self-training requires more compute than pre-training
- Good pre-trained models are also needed for low-data applications
(ex. PASCAL segmentation)
5. The scalability, generality and flexibility of self-training
- Scalability : works well as we have more labeled data
- Generality : works well even when pre-training fails but also when pre-training
succeeds
- Flexibility : works well in every setup (low or high data / weak or strong aug)
and with different architectures, data sources, and tasks
The most methods fail when we have more labeled data or
more compute or better supervised training recipes,
but that does not seem to self-training