IDS Lab
Understanding deep learning requires
rethinking generalization
Does deep learning really doing some generalization?

presentedby Jamie Seol
IDS Lab
Jamie Seol
Motivation
• Normally, we measure a generalization by:

• generalization error = |training error - test error|

• if we overfit, the training error should be low, while test error
becomes large = high generalization error!

• However, a complex neural network is fragile to be overfitted!

• for example, let’s train some human baby by randomly labeled
CIFAR-10 dataset

• then, give’em some sample in the training set (2nd+ epoch)

• they will say "what the…" to any question

• because it’s impossible to generalize some kind of
abtracted concept!

• what about in neural network?
IDS Lab
Jamie Seol
CIFAR-10
• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of 10
classes

• CNNs that we know well will solve this rather easily
IDS Lab
Jamie Seol
Randomized CIFAR-10
• When we randomize information of CIFAR-10’s training set, the
result of accuracy becomes:
IDS Lab
Jamie Seol
Randomized CIFAR-10
• This is just nothing more than over-overfit!

• What’s the problem than?

• neural networks memorized datasets

• even if it should have no meaning!

• it’s random! raaaaandddddddommm!!!

• aaaaarrrrrrrr!!!

• it did not generalize some concepts

• it just memorized!!!!
IDS Lab
Jamie Seol
Randomized CIFAR-10
• Even if you didn’t intend to, neural nets can just memorize thing
rather than generalizing!

• According to the experiment,

• the effective capacity of neural network is sufficient for
memorizing the entire data set

• randomizing (corrupting) data set makes task harder just by
small constant factor compared to the origial task!

• Again, even if you didn’t want to!! neural network is fragile to
overfit in natural sense!!

• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,
from the movie "3 Idiots"
IDS Lab
Jamie Seol
Regularization
• However, we do know that there are a lot of techniques for
regularization, which supports generalizations!

• dropout, batch norm, early stopping, weigh decay…

• It does seem help, but wait….

• can someone prove that regularizations fundamentally
improves generalization?

• does this works really really well? really???
IDS Lab
Jamie Seol
Regularization
• Isn’t data augmentation significantly more important than weight
decay?

• Even with regulizations, neural networks are good memorizers

• Just changing the model increased test accuracy
IDS Lab
Jamie Seol
Regularization
• Early stopping helps

• but not necessarily…
IDS Lab
Jamie Seol
Regularization
• Well… these techniques seem does helpful, but suspicion
remains…
IDS Lab
Jamie Seol
Rademacher complexity
• By the way, what’s so big deal about memorizing everything?

• The following measurement is called Rademacher complexity

• Detailed math is omitted here

• The thing is, if some model can memorize everything (actually, if
the hypothesis have power to fit randomized dataset), then
theoritical upper bound of generalization error is just 1

• which is useless!!!!

• actually, using regularization scheme lowers the bound, but this
is not true in ReLU, and we’ll show that there is some situation
that regularization helps nothing
IDS Lab
Jamie Seol
Finite-sample expressivity
• Remember Universal Approximation Theorem?

• finite-sample expressivity theorem is more practical version of it

• note that this statement shows that UAT does not guarantees
generalization!

• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that
can represent any function given by n samples in d dimensions

• This is not a hard theorem to prove, so let’s do it
IDS Lab
Jamie Seol
Lemma 1
• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank
• Proof: obvious
IDS Lab
Jamie Seol
Theorem 1
• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any function given by n samples in d dimensions
• Proof: Note that 2-layered neural network with ReLU can be expressed as
• where w, b ∈ ℝn and a ∈ ℝd
• for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)
for all i from 1 to n
• choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1
• Then, this becomes y = Aw, while Lemma 1 says that A is invertable
• done
IDS Lab
Jamie Seol
Finite-sample expressivity
• What does it mean?

• It means that once you have more than about 2n + d parameters,
your model already possesses a willingful power to super-overfit
and just to remember everything instead of generalizing some
concept, therefore it gains trivial bound for generalization error and
is exposed to sudden-death-danger of doing nothing more than a
memorizer

• long story short: we can’t speak formally about generalization in
deep learning yet

• a snake’s leg: for deeper network, use intermediate layers to
choose splitted interval rather than target, resulting similar O(n + k)
parameters required
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Let’s think about linear optimization

• If we have large d, which is a underdetermined problem, then we
can have multiple globla minima

• But hey, can we determine which optima gives best
generalization?

• in non-linear systems, peeking curvature helped

• but there’s no such thing as a curvature in linear system!
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined system, and known to be a regularizer itself
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• However… the result shows minimum l2 norm wasn’t always the
global optima in sense of generalization

• furthermore, it is possible to generate some dataset that
minimum l2 norm is not optima! a constructive counter
example!

• adding l2 regularization to parameters didn’t help a bit (not
shown in the table)
norm = 220
norm = 390
IDS Lab
Jamie Seol
Conclusion
• "Be careful whenever you speak 'generalization' in deep learning"

• Contributions of this paper:

• experimental framework for suspecting suspicious activities of
generalization techniques

• proof for lack of theoritical boundary of generalization error in
deep learning (since it can just memorize it all with small
effective capacity)

• optimization does not necessarily means generalization

• "beware of the light" - Caliban, from the movie "Logan"
IDS Lab
Jamie Seol
References
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-12
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-2-22

Understanding deep learning requires rethinking generalization

  • 1.
    IDS Lab Understanding deeplearning requires rethinking generalization Does deep learning really doing some generalization? presentedby Jamie Seol
  • 2.
    IDS Lab Jamie Seol Motivation •Normally, we measure a generalization by: • generalization error = |training error - test error| • if we overfit, the training error should be low, while test error becomes large = high generalization error! • However, a complex neural network is fragile to be overfitted! • for example, let’s train some human baby by randomly labeled CIFAR-10 dataset • then, give’em some sample in the training set (2nd+ epoch) • they will say "what the…" to any question • because it’s impossible to generalize some kind of abtracted concept! • what about in neural network?
  • 3.
    IDS Lab Jamie Seol CIFAR-10 •This is the CIFAR-10 dataset • The goal of this task is to classify given image into one of 10 classes • CNNs that we know well will solve this rather easily
  • 4.
    IDS Lab Jamie Seol RandomizedCIFAR-10 • When we randomize information of CIFAR-10’s training set, the result of accuracy becomes:
  • 5.
    IDS Lab Jamie Seol RandomizedCIFAR-10 • This is just nothing more than over-overfit! • What’s the problem than? • neural networks memorized datasets • even if it should have no meaning! • it’s random! raaaaandddddddommm!!! • aaaaarrrrrrrr!!! • it did not generalize some concepts • it just memorized!!!!
  • 6.
    IDS Lab Jamie Seol RandomizedCIFAR-10 • Even if you didn’t intend to, neural nets can just memorize thing rather than generalizing! • According to the experiment, • the effective capacity of neural network is sufficient for memorizing the entire data set • randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task! • Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!! • "You don’t have to explain the meanings. I’ll just memorize it" - Chatur, from the movie "3 Idiots"
  • 7.
    IDS Lab Jamie Seol Regularization •However, we do know that there are a lot of techniques for regularization, which supports generalizations! • dropout, batch norm, early stopping, weigh decay… • It does seem help, but wait…. • can someone prove that regularizations fundamentally improves generalization? • does this works really really well? really???
  • 8.
    IDS Lab Jamie Seol Regularization •Isn’t data augmentation significantly more important than weight decay? • Even with regulizations, neural networks are good memorizers • Just changing the model increased test accuracy
  • 9.
    IDS Lab Jamie Seol Regularization •Early stopping helps • but not necessarily…
  • 10.
    IDS Lab Jamie Seol Regularization •Well… these techniques seem does helpful, but suspicion remains…
  • 11.
    IDS Lab Jamie Seol Rademachercomplexity • By the way, what’s so big deal about memorizing everything? • The following measurement is called Rademacher complexity • Detailed math is omitted here • The thing is, if some model can memorize everything (actually, if the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1 • which is useless!!!! • actually, using regularization scheme lowers the bound, but this is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing
  • 12.
    IDS Lab Jamie Seol Finite-sampleexpressivity • Remember Universal Approximation Theorem? • finite-sample expressivity theorem is more practical version of it • note that this statement shows that UAT does not guarantees generalization! • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • This is not a hard theorem to prove, so let’s do it
  • 13.
    IDS Lab Jamie Seol Lemma1 • Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has full rank • Proof: obvious
  • 14.
    IDS Lab Jamie Seol Theorem1 • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as • where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi) for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done
  • 15.
    IDS Lab Jamie Seol Finite-sampleexpressivity • What does it mean? • It means that once you have more than about 2n + d parameters, your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer • long story short: we can’t speak formally about generalization in deep learning yet • a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required
  • 16.
    IDS Lab Jamie Seol StochasticGradient Descent • Let’s think about linear optimization • If we have large d, which is a underdetermined problem, then we can have multiple globla minima • But hey, can we determine which optima gives best generalization? • in non-linear systems, peeking curvature helped • but there’s no such thing as a curvature in linear system!
  • 17.
    IDS Lab Jamie Seol StochasticGradient Descent • Funny thing about SGD is, it gives optima for l2 loss for underdetermined system, and known to be a regularizer itself
  • 18.
    IDS Lab Jamie Seol StochasticGradient Descent • However… the result shows minimum l2 norm wasn’t always the global optima in sense of generalization • furthermore, it is possible to generate some dataset that minimum l2 norm is not optima! a constructive counter example! • adding l2 regularization to parameters didn’t help a bit (not shown in the table) norm = 220 norm = 390
  • 19.
    IDS Lab Jamie Seol Conclusion •"Be careful whenever you speak 'generalization' in deep learning" • Contributions of this paper: • experimental framework for suspecting suspicious activities of generalization techniques • proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity) • optimization does not necessarily means generalization • "beware of the light" - Caliban, from the movie "Logan"
  • 20.
    IDS Lab Jamie Seol References •Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-2-22