Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GNorm and Rethinking pre training-ruijie

16 views

Published on

Ruijie Quan

Published in: Technology
  • Be the first to comment

  • Be the first to like this

GNorm and Rethinking pre training-ruijie

  1. 1. GROUP NORMALIZATION & RETHINKING IMAGENET PRE-TRAINING Ruijie Quan 2018/11/25
  2. 2. I. GROUP NORMALIZATION GROUP NORMALIZATION & RETHINKING IMAGENET PRE-TRAINING II. RETHINKING IMAGENET PRE-TRAINING • METHODOLOGY • EXPERIMENTS • METHODOLOGY • EXPERIMENTS
  3. 3. I. GROUP NORMALIZATION 125.11.2018 BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. Group Normalization ImageNet classification error vs. batch sizes.
  4. 4. 225.11.2018 GROUP NORMALIZATION Group Normalization A general formulation of feature normalization:
  5. 5. 325.11.2018 GROUP NORMALIZATION Group Normalization
  6. 6. 425.11.2018 GROUP NORMALIZATION Group Normalization (C//G , H, W)
  7. 7. 525.11.2018 Group Normalization Only need to specify how the mean and variance (“moments”) are computed, along the appropriate axes as defined by the normalization method. IMPLEMENTATION
  8. 8. 625.11.2018 Group Normalization EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET Comparison of error curves with a batch size of 32 images/GPU.(Model: Resnet-50)
  9. 9. 725.11.2018 Group Normalization EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET
  10. 10. 825.11.2018 Group Normalization Evolution of feature distributions of conv5-3’s output (before normalization and ReLU) from VGG-16, shown as the {1, 20, 80, 99} percentile of responses. The table on the right shows the ImageNet validation error (%). Models are trained with 32 images/GPU. EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET VGG models: For VGG-16, GN is better than BN by 0.4%. This possibly implies that VGG-16 benefits less from BN’s regularization effect.
  11. 11. 925.11.2018 Group Normalization EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET With a given fixed group number, GN performs reasonably well for all values of G we studied. Fixing the number of channels per group. Note that because the layers can have different channel numbers, the group number G can change across layers in this setting. Deeper models: ResNet-101 Batch size=32 BN baseline: error 22.0% GN: error 22.4% Batch size=2 BN baseline: error 31.9% GN: error 23.0%
  12. 12. 1025.11.2018 Group Normalization OBJECT DETECTION AND SEGMENTATION IN COCO GN is not fully trained with the default schedule, so we also tried increasing the iterations from 180k to 270k (BN* does not benefit from longer training).
  13. 13. 1125.11.2018 Group Normalization OBJECT DETECTION AND SEGMENTATION IN COCO Error curves in Kinetics with an input length of 32 frames. We show ResNet-50 I3D’s validation error of BN (left) and GN (right) using a batch size of 8 and 4 clips/GPU.
  14. 14. 1225.11.2018 Group Normalization VIDEO CLASSIFICATION IN KINETICS Video classification results in Kinetics: ResNet-50 I3D baseline’s top-1 / top-5 accuracy (%). Detection and segmentation results trained from scratch in COCO using Mask R-CNN and FPN. Here the BN is synced across GPUs and is not frozen.
  15. 15. II. RETHINKING IMAGENET PRE-TRAINING 1325.11.2018 Rethinking ImageNet Pre-training
  16. 16. 1425.11.2018 Rethinking ImageNet Pre-training RETHINKING IMAGENET PRE-TRAINING Get competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. NO worse than their ImageNet pre-training counterparts ONLY !!! increase the number of training iterations so the randomly initialized models may converge (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. EVEN WHEN
  17. 17. 1525.11.2018 Rethinking ImageNet Pre-training RETHINKING IMAGENET PRE-TRAINING We train Mask R-CNN with a ResNet-50 FPN and GroupNorm backbone on the COCO train2017 set and evaluate bounding box AP on the val2017 set. (i) ImageNet pre-training speeds up convergence (ii) ImageNet pre-training does not automatically give better regularization. (iii) ImageNet pre-training shows no benefit when the target tasks/metrics are more sensitive to spatially welllocalized predictions. Observation:
  18. 18. 1625.11.2018 Rethinking ImageNet Pre-training METHODOLOGY 1. Normalization (i) Group Normalization (GN) (ii) Synchronized Batch Normalization (SyncBN) Small batch sizes severely degrade the accuracy of BN. This issue can be circumvented if pre-training is used, because fine-tuning can adopt the pretraining batch statistics as fixed parameters; however, freezing BN is invalid when training from scratch. 2. Convergence trained for longer than typical fine-tuning...
  19. 19. 1725.11.2018 Rethinking ImageNet Pre-training METHODOLOGY 2. Convergence trained for longer than typical fine-tuning This suggests that a sufficiently large number of total samples (arguably in terms of pixels) are required for the models trained from random initialization to converge well
  20. 20. 125.11.2018 Rethinking ImageNet Pre-training TRAINING FROM SCRATCH TO MATCH ACCURACY Our first surprising discovery is that when only using the COCO data, models trained from scratch can catch up in accuracy with ones that are fine-tuned.
  21. 21. 1925.11.2018 Rethinking ImageNet Pre-training TRAINING FROM SCRATCH TO MATCH ACCURACY (i) Typical fine-tuning schedules (2×) work well for the models with pre-training to converge to near optimum.But these schedules are not enough for models trained from scratch. (ii) Models trained from scratch can catch up with their fine-tuning counterparts, their detection AP is no worse than their fine-tuning counterparts. The models trained from scratch catch up not only by chance for a single metric.
  22. 22. 2025.11.2018 Rethinking ImageNet Pre-training TRAINING FROM SCRATCH TO MATCH ACCURACY X152: Large models trained from scratch
  23. 23. 2125.11.2018 Rethinking ImageNet Pre-training TRAINING FROM SCRATCH TO MATCH ACCURACY ImageNet pre-training, which has little explicit localization information, does not help keypoint detection
  24. 24. 2225.11.2018 Rethinking ImageNet Pre-training TRAINING FROM SCRATCH WITH LESS DATA
  25. 25. 2325.11.2018 Rethinking ImageNet Pre-training BREAKDOWN REGIME I. 1k COCO training images. Training with 1k COCO images (shown as the loss in the training set). The randomly initialized model can catch up for the training loss, but has lower validation accuracy (3.4 AP) than the pre-training counterpart (9.9 AP). A sign of strong overfitting due to the severe lack of data. The breakdown point in the COCO dataset is somewhere between 3.5k to 10k training images
  26. 26. 2425.11.2018 Rethinking ImageNet Pre-training BREAKDOWN REGIME II. PASCAL VOC There are 15k VOC images used for training. But these images have on average 2.3 instances per image (vs. COCO’s ∼7) and 20 categories (vs. COCO’s 80). We suspect that the fewer instances (and categories) has a similar negative impact as insufficient training data, which can explain why training from scratch on VOC is not able to catch up as observed on COCO. Using ImageNet pre-training: 82.7 mAP at 18k iterations Trained from scratch: 77.6 mAP at 144k iterations
  27. 27. 2525.11.2018 Rethinking ImageNet Pre-training MAIN OBSERVATIONS  Training from scratch on target tasks is possible without architectural changes.  Training from scratch requires more iterations to sufficiently converge.  Training from scratch can be no worse than its ImageNet pre-training counterparts under many circumstances, down to as few as 10k COCO images.  ImageNet pre-training speeds up convergence on the target task.  ImageNet pre-training does not necessarily help reduce overfitting unless we enter a very small data regime.  ImageNet pre-training helps less if the target task is more sensitive to localization than classification.
  28. 28. 2625.11.2018 Rethinking ImageNet Pre-training A FEW IMPORTANT QUESTIONS Is ImageNet pre-training necessary? -No Is ImageNet helpful? -Yes Do we need big data? -Yes Shall we pursuit universal representations? -Yes
  29. 29. Thank you for your attention.

×