Deep Double Descent

Modern Learning Theory
● Bigger models tend to overfit

○ Bias-Variance trade-off
○ Weight Regularization
○ Augmentation
○ Dropout
○ BatchNorm
○ Early stop
○ Data-dependent regularization (mixup, etc.)
○ ...

● Bigger models are always better
Reconciling modern machine learning practice and the bias-variance trade-off

● Bigger models not good in some regime
https://mltheory.org/deep.pdf

● Bigger models not good in some regime
● Even more data hurt!
https://mltheory.org/deep.pdf

TL;DR
- Model-wise double descent
- There is a regime where bigger models are worse
- Sample-wise non-monotonicity
- There is a regime where more samples hurts
- Epoch-wise double descent
- There is a regime where training longer reverses overfitting

Generalization in Deep Learning Era
- Network can fit `anything` even random noise
- Larger capacity than people imagine before
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

- Over-parameterized network performs
IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING

- Deep network regulairze itself (has better loss landscape)
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY

Model-wise double descent
Architecture
- ResNet18, CNN, Transformers
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch

Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch

- Model-wise double descent
across different architectures,
datasets, optimizers, and
training procedures
- Also in adversarial training

Model-wise & Epoch-wise double descent

Epoch-wise double descent
Sufficiently large models can undergo a “double descent” behavior where test error first decreases then
increases near the interpolation threshold, and then decreases again.
Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under-
to over-parameterized over the course of training.

Conventional training is split into two phases:
1. In the first phase, the network learns a function with a small generalization gap
2. In the second phase, the network starts to over-fit the data leading to an increase in test error
Not the complete picture
- Some regimes, the test error decreases again and may achieve a lower value at the end of training
as compared to the first minimum
Reminds
- Information bottleneck
- Lottery ticket hypothesis

Cifar10 Cifar100

Sample-wise non-monotonicity
More data doesn’t improve
For both models, more data hurt performance

Sample-wise non-monotonicity
Transformers
- language-translation task with no
added label noise.
Two effects combined
- More samples
- Larger models
4.5x more samples hurts performance
for intermediate model

Conclusion
Take home message :
Model behaves unexpectedly in transition regime
- Training longer reverses overfitting
- Double the training epoch is a technique in some task
(eg. object detection)
- Bigger models are worse
- Fitting training set is an indicator
- Also called Effective Model Complexity (EMC)
- More data hurts
- sticky :(
- Generalization is still the Holy Grail in deep learning
- remains the open question (both exp. & theory)
- Connect data complexity with model complexity is still difficult
- NAS in some sence systematically solve this problem
Know your data & model
- noise level (problem difficulty)
- model capacity (fitting power)

Deep Double Descent

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Deep Double Descent

Similar to Deep Double Descent (20)

More from Kai-Wen Zhao

More from Kai-Wen Zhao (8)

Recently uploaded

Recently uploaded (20)

Deep Double Descent