Deep Double Descent
kevin
Modern Learning Theory
● Bigger models tend to overfit
Modern Learning Theory
● Bigger models tend to overfit
○ Bias-Variance trade-off
○ Weight Regularization
○ Augmentation
○ Dropout
○ BatchNorm
○ Early stop
○ Data-dependent regularization (mixup, etc.)
○ ...
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
Reconciling modern machine learning practice and the bias-variance trade-off
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
● Bigger models not good in some regime
https://mltheory.org/deep.pdf
Modern Learning Theory
● Bigger models tend to overfit
● Bigger models are always better
● Bigger models not good in some regime
● Even more data hurt!
https://mltheory.org/deep.pdf
TL;DR
- Model-wise double descent
- There is a regime where bigger models are worse
- Sample-wise non-monotonicity
- There is a regime where more samples hurts
- Epoch-wise double descent
- There is a regime where training longer reverses overfitting
Generalization in Deep Learning Era
- Network can fit `anything` even random noise
- Larger capacity than people imagine before
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
Generalization in Deep Learning Era
- Over-parameterized network performs
IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING
Generalization in Deep Learning Era
- Deep network regulairze itself (has better loss landscape)
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
Generalization in Deep Learning Era
SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY
Model-wise double descent
Architecture
- ResNet18, CNN, Transformers
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch
Model-wise double descent
Noise Label
● `Hard` Distribution
● label noise is sampled only
once and not per epoch
Model-wise double descent
- Model-wise double descent
across different architectures,
datasets, optimizers, and
training procedures
- Also in adversarial training
Model-wise double descent
Model-wise & Epoch-wise double descent
Epoch-wise double descent
Sufficiently large models can undergo a “double descent” behavior where test error first decreases then
increases near the interpolation threshold, and then decreases again.
Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under-
to over-parameterized over the course of training.
Epoch-wise double descent
Conventional training is split into two phases:
1. In the first phase, the network learns a function with a small generalization gap
2. In the second phase, the network starts to over-fit the data leading to an increase in test error
Not the complete picture
- Some regimes, the test error decreases again and may achieve a lower value at the end of training
as compared to the first minimum
Reminds
- Information bottleneck
- Lottery ticket hypothesis
Epoch-wise double descent
Epoch-wise double descent
Cifar10 Cifar100
Sample-wise non-monotonicity
More data doesn’t improve
For both models, more data hurt performance
Sample-wise non-monotonicity
Transformers
- language-translation task with no
added label noise.
Two effects combined
- More samples
- Larger models
4.5x more samples hurts performance
for intermediate model
Sample-wise non-monotonicity
Conclusion
Take home message :
Model behaves unexpectedly in transition regime
- Training longer reverses overfitting
- Double the training epoch is a technique in some task
(eg. object detection)
- Bigger models are worse
- Fitting training set is an indicator
- Also called Effective Model Complexity (EMC)
- More data hurts
- sticky :(
- Generalization is still the Holy Grail in deep learning
- remains the open question (both exp. & theory)
- Connect data complexity with model complexity is still difficult
- NAS in some sence systematically solve this problem
Know your data & model
- noise level (problem difficulty)
- model capacity (fitting power)

Deep Double Descent

  • 1.
  • 2.
    Modern Learning Theory ●Bigger models tend to overfit
  • 3.
    Modern Learning Theory ●Bigger models tend to overfit ○ Bias-Variance trade-off ○ Weight Regularization ○ Augmentation ○ Dropout ○ BatchNorm ○ Early stop ○ Data-dependent regularization (mixup, etc.) ○ ...
  • 4.
    Modern Learning Theory ●Bigger models tend to overfit ● Bigger models are always better Reconciling modern machine learning practice and the bias-variance trade-off
  • 5.
    Modern Learning Theory ●Bigger models tend to overfit ● Bigger models are always better ● Bigger models not good in some regime https://mltheory.org/deep.pdf
  • 6.
    Modern Learning Theory ●Bigger models tend to overfit ● Bigger models are always better ● Bigger models not good in some regime ● Even more data hurt! https://mltheory.org/deep.pdf
  • 7.
    TL;DR - Model-wise doubledescent - There is a regime where bigger models are worse - Sample-wise non-monotonicity - There is a regime where more samples hurts - Epoch-wise double descent - There is a regime where training longer reverses overfitting
  • 8.
    Generalization in DeepLearning Era - Network can fit `anything` even random noise - Larger capacity than people imagine before UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
  • 9.
    Generalization in DeepLearning Era - Over-parameterized network performs IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING
  • 10.
    Generalization in DeepLearning Era - Deep network regulairze itself (has better loss landscape) Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
  • 11.
    Generalization in DeepLearning Era SENSITIVITY AND GENERALIZATION IN NEURAL NETWORKS: AN EMPIRICAL STUDY
  • 12.
    Model-wise double descent Architecture -ResNet18, CNN, Transformers Noise Label ● `Hard` Distribution ● label noise is sampled only once and not per epoch
  • 13.
    Model-wise double descent NoiseLabel ● `Hard` Distribution ● label noise is sampled only once and not per epoch
  • 14.
    Model-wise double descent -Model-wise double descent across different architectures, datasets, optimizers, and training procedures - Also in adversarial training
  • 15.
  • 16.
  • 17.
    Epoch-wise double descent Sufficientlylarge models can undergo a “double descent” behavior where test error first decreases then increases near the interpolation threshold, and then decreases again. Increasing the train time increases the EMC—and thus a sufficiently large model transitions from under- to over-parameterized over the course of training.
  • 18.
    Epoch-wise double descent Conventionaltraining is split into two phases: 1. In the first phase, the network learns a function with a small generalization gap 2. In the second phase, the network starts to over-fit the data leading to an increase in test error Not the complete picture - Some regimes, the test error decreases again and may achieve a lower value at the end of training as compared to the first minimum Reminds - Information bottleneck - Lottery ticket hypothesis
  • 19.
  • 20.
  • 21.
    Sample-wise non-monotonicity More datadoesn’t improve For both models, more data hurt performance
  • 22.
    Sample-wise non-monotonicity Transformers - language-translationtask with no added label noise. Two effects combined - More samples - Larger models 4.5x more samples hurts performance for intermediate model
  • 23.
  • 24.
    Conclusion Take home message: Model behaves unexpectedly in transition regime - Training longer reverses overfitting - Double the training epoch is a technique in some task (eg. object detection) - Bigger models are worse - Fitting training set is an indicator - Also called Effective Model Complexity (EMC) - More data hurts - sticky :( - Generalization is still the Holy Grail in deep learning - remains the open question (both exp. & theory) - Connect data complexity with model complexity is still difficult - NAS in some sence systematically solve this problem Know your data & model - noise level (problem difficulty) - model capacity (fitting power)