the deep bootstrap: good online learners are good offline generalizers 논문 리뷰
ICLR 2021 채택
하바드/구글 합동 연구
https://arxiv.org/abs/2010.08127
The deep bootstrap 논문 리뷰
2. 요약
• Reasoning about generalization in deep learning
• Couple the Real World, where optimizers take stochastic
gradient seps on empirical loss, to the Ideal World, where
optimizers take stochastic gradient seps on population loss
• Decomposition of test error into: (1) Ideal world test error
plus (2) the gap between the two worlds
• Evidence that this can be small in realistic deep learning
3. Generalization Gap
• The goal of a generalization theory in supervised learning is
to understand when and why trained models have small test
error
기존
우리의 해석
Reuse
Fresh
4. Experimental Validation
• How to construct Ideal world (infinite population data)
• CIFAR-5m
• 6 million synthetic CIFAR-10-like images from GAN
• Labeling samples by a 98.5% accurate classification model
• ImageNet-DogBird
• More complex/real images
• Collapsing classes into superclass => 155K images
• Image-based data-augmentation
5. Experimental setup
Ex) for CIFAR-5m, n=50K, infinite=5M
for ImageNet, n=10K, infinite=155K
Soft error instead of hard error
6. Claim: Bootstrap error is not BIG !!
Naive한 해석
1. Data가 그렇게 많을 필요가 없다.
2. Algorithm, Architecture choice가
중요할수도 있다.
10. Effect of Data augmentation
• data augmentation does typically reduce the bootstrap gap
• Good data augmentations should (1) not hurt optimization in the Ideal
World (i.e.,not destroy true samples much), and (2) obstruct
optimization in the Real World (so the Real World can improve for
longer before converging)
11. Effect of pretrained model
Stopping image 의 차이만 있고, generalization gap에는 영향이 없다.
12. Implicit Bias v.s Explicit optimization
• Current research of Behnam suggest that
• Convet is better generalized well than fully-connected
• Implicit bias of SGD toword convet in the real-world setting (n=50k)
• Instead of studying implicit bias of optimization on the empirical loss, we could study
explicit properties of optimization on the population loss.
• We show that, in fact, this generalization is captured by the fact that convet optimizes
much faster on the population loss than fully-connected
https://youtu.be/xu6fz0Z5RiU
Behnam Neyshabur. Towards learning convolutions from scratch. arXiv preprint arXiv:2007.13657, 2020.
13. Model Selection
in the Over- and Under-parameterized Regimes
• The same techniques (architectures and training methods) are
used in practice in both over- and under-parameterized
regimes.
• ResNet-101 is competitive both on 1 billion images of
Instagram (when it is under-parameterized) and on 50k
images of CIFAR-10 (when it is overparameterized
14. Model Selection
in the Over- and Under-parameterized Regimes
• very different considerations in each regime
• In the overparameterized regime, architecture matters for
generalization reasons: there are many ways to fit the train set,
and some architectures lead SGD to minima that generalize better
• In the underparameterized regime, architecture matters for purely
optimization reasons: all models will have small generalization gap
with 1 billion+ samples, but we seek models which are capable of
reaching low values of test loss, and which do so quickly (with few
optimization steps)
15. Our unified framework
• Our work suggests that these phenomena are closely related:
If the boostrap error is small, then we should expect that
architectures which optimize well in the infinite-data
(underparameterized) regime also generalize well in the
finite-data (overparameterized) regime.
• This unifies the two apriori different principles guiding
model-selection in over and under-parameterized regimes,
and helps understand why the same architectures are used in
both regimes
16. Conclusion
• Deep Bootstrap framework for understanding generalization
in deep learning
• Compare two worlds…. Gap is small in deep learning setting.
• Real World, finite, reuse
• Ideal World, infinite, fresh
• First step towards characterizing the bootstrap error
• Need further study…