Rethinking Pre-training and Self-training
Google Research, Brain Team
Yonsei University Severance Hospital CCIDS
Choi Dongmin
Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training

- a dominant paradigm in computer vision (ex. ImageNet pre-training)

- However, ImageNet pre-training does not improve accuracy on COCO 

[Kaiming He, ICCV 2019]
Introduction
He et al. Rethinking ImageNet Pre-training. ICCV 2019
• Pre-training

- a dominant paradigm in computer vision (ex. ImageNet pre-training)

- However, ImageNet pre-training does not improve accuracy on COCO 

[Kaiming He, ICCV 2019]

• Self-training

- Steps (ex. Use ImageNet to help COCO object detection)

1) Discard the labels on ImageNet

2) Train an object detection on COCO, and use it to generate pseudo labels 

on ImageNet

3) A new model is trained on the combined pseudo-labeled ImageNet and 

labeled COCO data
Introduction
• Generality and Flexibility of Self-training with three insights



1) Stronger data augmentation & More labeled data

→ diminish the value of pre-training



2) Unlike pre-training, self-training is always helpful



3) Self-training improves upon pre-training
Methodology
• Methods and Control Factors

1. Data Augmentation

2. Pre-training

3. Self-training
Methodology
• Methods and Control Factors

1. Data Augmentation
Methodology
• Methods and Control Factors

1. Data Augmentation
AutoAugment RandAugment
Automatically search for

improved data augmentation policies Remove a separate 

search space phase on a proxy task
more stronger
Methodology
• Methods and Control Factors

2. Pre-training (EfficientNet-B7 baseline)
Methodology
• Methods and Control Factors

2. Pre-training (EfficientNet-B7 baseline)
ImageNet++ Init : EfficientNet-B7 + Noisy Student Method
M Tan et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019

Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
- A semi-supervised learning

- Self-training + Distillation
Methodology
• Methods and Control Factors

3. Self-training (based on Noise Student Method)
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
Experiments
1. The effects of augmentation and labeled dataset size on pre-training
- Task : COCO object detection

- Network : RetinaNet with the EfficientNet-B7 backbone



- Left : under various ImageNet pre-trained checkpoint and data augmentation strengths
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 1. Pre-training hurts performance when stronger data augmentation is used
Experiments
1. The effects of augmentation and labeled dataset size on pre-training
- Task : COCO object detection

- Network : RetinaNet with the EfficientNet-B7 backbone



- Right : under various COCO dataset sizes and ImageNet pre-trained checkpoint
TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
Finding 2. More labeled data diminishes the value of pre-training
Experiments
2. The effects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data)

- Network : RetinaNet with the EfficientNet-B7 backbone



Finding 1. Self-training helps in high data/strong augmentation regimes,

even when pre-training hurts
= Pre-training
Experiments
2. The effects of augmentation and labeled dataset size on self-training
- Task : COCO object detection (self-training only treats ImageNet as unlabeled data)

- Network : RetinaNet with the EfficientNet-B7 backbone



Finding 2. Self-training works across dataset sizes and

is additive to pre-training.
Experiments
3. Self-supervised pre-training also hurts when self-training helps in high

data/strong augmentation regimes
- Task : COCO object detection

- Network : RetinaNet with the ResNet-50
backbone

- All models use Augment-S4

T Chen et al. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709
https://amitness.com/2020/03/illustrated-simclr/
Experiments
4. Exploring the limits of self-training and pre-training
- Task : COCO object detection

- Network : SpineNet (closer to SOTA)

- Self-training dataset : Open Images Dataset
X Du et al. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020
SpineNet with Self-training

achieves the best performance
Experiments
4. Exploring the limits of self-training and pre-training
- Task : PASCAL VOC Semantic Segmentation

- Network : NAS-FPN (EfficientNet backbone)

- Pre-training + Self-training + Augment-S4

- Pre-training dataset : ImageNet

- Self-training dataset : aug set of PASCAL
G Ghiasi et al. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. CVPR 2019
Improves SOTA by +1.5% mIOU

w/ much less human labels
Experiments
4. Exploring the limits of self-training and pre-training
- Task : PASCAL VOC Semantic Segmentation

- Network : NAS-FPN (EfficientNet backbone)

- Pre-training + Self-training + Augment-S4

- Pre-training dataset : ImageNet

- Self-training dataset : aug set of PASCAL
Pre-training with a good checkpoint is crucial

due to PASCAL’s small dataset size
< Appendix C >
Discussion
1. Rethinking pre-training and universal feature representations
- Requirements of universal feature representations that can solve many tasks



- Weak performance of pre-training

: Pre-training is not aware of the task of interest and can fail to adapt

(ex. good features for ImageNet may discard positional information which is needed for COCO)



- Self-training is more adaptive to the task of interest (generally more beneficial)
Discussion
1. Rethinking pre-training and universal feature representations
- Requirements of universal feature representations that can solve many tasks



- Weak performance of pre-training

: Pre-training is not aware of the task of interest and can fail to adapt

(ex. good features for ImageNet may discard positional information which is needed for COCO)



- Self-training is more adaptive to the task of interest (generally more beneficial)
Discussion
2. The benefit of joint-training
- Joint-training : jointly train ImageNet classification with COCO object detection



- Random Initialization + Self-training + Joint Training : +4.4 improvement



- Joint Training (+2.9) and Pre-training (+2.6) gives similar improvements,

but Joint Training is achieved by training 19 epochs while Pre-training needed 

to be trained for 350 epochs.
Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels

- Training with aug dataset hurts performance when strong augmentation

- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling
Discussion
3. The importance of the task alignment
- aug : additional PASCAL VOC dataset with much noisier labels

- Training with aug dataset hurts performance when strong augmentation

- Self-training (pseudo-label on aug dataset) improves accuracy
Noisy (PASCAL) or un-targeted (ImageNet)
labeling is worse than targeted pseudo labeling
Shao et al : Pre-training on Open Images hurts performance on COCO, despite
both of them being annotated with bounding boxes
Shao et al. Objects365: A Large-scale, High-quality Dataset for Object Detection. ICCV 2019
Not only the task but the annotations to be same for

pre-training to be beneficial (but self-training is very general)
Discussion
4. Limitations
- Self-training requires more compute than pre-training



- Good pre-trained models are also needed for low-data applications

(ex. PASCAL segmentation)
Discussion
4. Limitations
- Self-training requires more compute than pre-training



- Good pre-trained models are also needed for low-data applications

(ex. PASCAL segmentation)
5. The scalability, generality and flexibility of self-training
- Scalability : works well as we have more labeled data

- Generality : works well even when pre-training fails but also when pre-training

succeeds

- Flexibility : works well in every setup (low or high data / weak or strong aug)

and with different architectures, data sources, and tasks
The most methods fail when we have more labeled data or
more compute or better supervised training recipes,

but that does not seem to self-training
Thank you

Review : Rethinking Pre-training and Self-training

  • 1.
    Rethinking Pre-training andSelf-training Google Research, Brain Team Yonsei University Severance Hospital CCIDS Choi Dongmin
  • 2.
    Introduction He et al.Rethinking ImageNet Pre-training. ICCV 2019 • Pre-training
 - a dominant paradigm in computer vision (ex. ImageNet pre-training)
 - However, ImageNet pre-training does not improve accuracy on COCO 
 [Kaiming He, ICCV 2019]
  • 3.
    Introduction He et al.Rethinking ImageNet Pre-training. ICCV 2019 • Pre-training
 - a dominant paradigm in computer vision (ex. ImageNet pre-training)
 - However, ImageNet pre-training does not improve accuracy on COCO 
 [Kaiming He, ICCV 2019] • Self-training
 - Steps (ex. Use ImageNet to help COCO object detection)
 1) Discard the labels on ImageNet
 2) Train an object detection on COCO, and use it to generate pseudo labels 
 on ImageNet
 3) A new model is trained on the combined pseudo-labeled ImageNet and 
 labeled COCO data
  • 4.
    Introduction • Generality andFlexibility of Self-training with three insights
 
 1) Stronger data augmentation & More labeled data
 → diminish the value of pre-training
 
 2) Unlike pre-training, self-training is always helpful
 
 3) Self-training improves upon pre-training
  • 5.
    Methodology • Methods andControl Factors
 1. Data Augmentation
 2. Pre-training
 3. Self-training
  • 6.
    Methodology • Methods andControl Factors
 1. Data Augmentation
  • 7.
    Methodology • Methods andControl Factors
 1. Data Augmentation AutoAugment RandAugment Automatically search for
 improved data augmentation policies Remove a separate search space phase on a proxy task more stronger
  • 8.
    Methodology • Methods andControl Factors
 2. Pre-training (EfficientNet-B7 baseline)
  • 9.
    Methodology • Methods andControl Factors
 2. Pre-training (EfficientNet-B7 baseline) ImageNet++ Init : EfficientNet-B7 + Noisy Student Method M Tan et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019
 Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252 - A semi-supervised learning
 - Self-training + Distillation
  • 10.
    Methodology • Methods andControl Factors
 3. Self-training (based on Noise Student Method) Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
  • 11.
    Experiments 1. The effectsof augmentation and labeled dataset size on pre-training - Task : COCO object detection
 - Network : RetinaNet with the EfficientNet-B7 backbone
 
 - Left : under various ImageNet pre-trained checkpoint and data augmentation strengths TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017 Finding 1. Pre-training hurts performance when stronger data augmentation is used
  • 12.
    Experiments 1. The effectsof augmentation and labeled dataset size on pre-training - Task : COCO object detection
 - Network : RetinaNet with the EfficientNet-B7 backbone
 
 - Right : under various COCO dataset sizes and ImageNet pre-trained checkpoint TY Lin et al. Focal Loss for Dense Object Detection. ICCV 2017 Finding 2. More labeled data diminishes the value of pre-training
  • 13.
    Experiments 2. The effectsof augmentation and labeled dataset size on self-training - Task : COCO object detection (self-training only treats ImageNet as unlabeled data)
 - Network : RetinaNet with the EfficientNet-B7 backbone
 
 Finding 1. Self-training helps in high data/strong augmentation regimes,
 even when pre-training hurts = Pre-training
  • 14.
    Experiments 2. The effectsof augmentation and labeled dataset size on self-training - Task : COCO object detection (self-training only treats ImageNet as unlabeled data)
 - Network : RetinaNet with the EfficientNet-B7 backbone
 
 Finding 2. Self-training works across dataset sizes and
 is additive to pre-training.
  • 15.
    Experiments 3. Self-supervised pre-trainingalso hurts when self-training helps in high
 data/strong augmentation regimes - Task : COCO object detection
 - Network : RetinaNet with the ResNet-50 backbone
 - All models use Augment-S4
 T Chen et al. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 https://amitness.com/2020/03/illustrated-simclr/
  • 16.
    Experiments 4. Exploring thelimits of self-training and pre-training - Task : COCO object detection
 - Network : SpineNet (closer to SOTA)
 - Self-training dataset : Open Images Dataset X Du et al. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020 SpineNet with Self-training
 achieves the best performance
  • 17.
    Experiments 4. Exploring thelimits of self-training and pre-training - Task : PASCAL VOC Semantic Segmentation
 - Network : NAS-FPN (EfficientNet backbone)
 - Pre-training + Self-training + Augment-S4 - Pre-training dataset : ImageNet - Self-training dataset : aug set of PASCAL G Ghiasi et al. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. CVPR 2019 Improves SOTA by +1.5% mIOU
 w/ much less human labels
  • 18.
    Experiments 4. Exploring thelimits of self-training and pre-training - Task : PASCAL VOC Semantic Segmentation
 - Network : NAS-FPN (EfficientNet backbone)
 - Pre-training + Self-training + Augment-S4 - Pre-training dataset : ImageNet - Self-training dataset : aug set of PASCAL Pre-training with a good checkpoint is crucial
 due to PASCAL’s small dataset size < Appendix C >
  • 19.
    Discussion 1. Rethinking pre-trainingand universal feature representations - Requirements of universal feature representations that can solve many tasks
 
 - Weak performance of pre-training
 : Pre-training is not aware of the task of interest and can fail to adapt
 (ex. good features for ImageNet may discard positional information which is needed for COCO)
 
 - Self-training is more adaptive to the task of interest (generally more beneficial)
  • 20.
    Discussion 1. Rethinking pre-trainingand universal feature representations - Requirements of universal feature representations that can solve many tasks
 
 - Weak performance of pre-training
 : Pre-training is not aware of the task of interest and can fail to adapt
 (ex. good features for ImageNet may discard positional information which is needed for COCO)
 
 - Self-training is more adaptive to the task of interest (generally more beneficial)
  • 21.
    Discussion 2. The benefitof joint-training - Joint-training : jointly train ImageNet classification with COCO object detection
 
 - Random Initialization + Self-training + Joint Training : +4.4 improvement
 
 - Joint Training (+2.9) and Pre-training (+2.6) gives similar improvements,
 but Joint Training is achieved by training 19 epochs while Pre-training needed 
 to be trained for 350 epochs.
  • 22.
    Discussion 3. The importanceof the task alignment - aug : additional PASCAL VOC dataset with much noisier labels
 - Training with aug dataset hurts performance when strong augmentation
 - Self-training (pseudo-label on aug dataset) improves accuracy Noisy (PASCAL) or un-targeted (ImageNet) labeling is worse than targeted pseudo labeling
  • 23.
    Discussion 3. The importanceof the task alignment - aug : additional PASCAL VOC dataset with much noisier labels
 - Training with aug dataset hurts performance when strong augmentation
 - Self-training (pseudo-label on aug dataset) improves accuracy Noisy (PASCAL) or un-targeted (ImageNet) labeling is worse than targeted pseudo labeling Shao et al : Pre-training on Open Images hurts performance on COCO, despite both of them being annotated with bounding boxes Shao et al. Objects365: A Large-scale, High-quality Dataset for Object Detection. ICCV 2019 Not only the task but the annotations to be same for
 pre-training to be beneficial (but self-training is very general)
  • 24.
    Discussion 4. Limitations - Self-trainingrequires more compute than pre-training
 
 - Good pre-trained models are also needed for low-data applications
 (ex. PASCAL segmentation)
  • 25.
    Discussion 4. Limitations - Self-trainingrequires more compute than pre-training
 
 - Good pre-trained models are also needed for low-data applications
 (ex. PASCAL segmentation) 5. The scalability, generality and flexibility of self-training - Scalability : works well as we have more labeled data
 - Generality : works well even when pre-training fails but also when pre-training
 succeeds
 - Flexibility : works well in every setup (low or high data / weak or strong aug)
 and with different architectures, data sources, and tasks The most methods fail when we have more labeled data or more compute or better supervised training recipes,
 but that does not seem to self-training
  • 26.