Understanding deep learning requires rethinking generalization

IDS Lab
Understanding deep learning requires
rethinking generalization
Does deep learning really doing some generalization?

presentedby Jamie Seol

IDS Lab
Jamie Seol
Motivation
• Normally, we measure a generalization by:

• generalization error = |training error - test error|

• if we overfit, the training error should be low, while test error
becomes large = high generalization error!

• However, a complex neural network is fragile to be overfitted!

• for example, let’s train some human baby by randomly labeled
CIFAR-10 dataset

• then, give’em some sample in the training set (2nd+ epoch)

• they will say "what the…" to any question

• because it’s impossible to generalize some kind of
abtracted concept!

• what about in neural network?

IDS Lab
Jamie Seol
CIFAR-10
• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of 10
classes

• CNNs that we know well will solve this rather easily

IDS Lab
Jamie Seol
Randomized CIFAR-10
• When we randomize information of CIFAR-10’s training set, the
result of accuracy becomes:

IDS Lab
Jamie Seol
Randomized CIFAR-10
• This is just nothing more than over-overfit!

• What’s the problem than?

• neural networks memorized datasets

• even if it should have no meaning!

• it’s random! raaaaandddddddommm!!!

• aaaaarrrrrrrr!!!

• it did not generalize some concepts

• it just memorized!!!!

IDS Lab
Jamie Seol
Randomized CIFAR-10
• Even if you didn’t intend to, neural nets can just memorize thing
rather than generalizing!

• According to the experiment,

• the effective capacity of neural network is sufficient for
memorizing the entire data set

• randomizing (corrupting) data set makes task harder just by
small constant factor compared to the origial task!

• Again, even if you didn’t want to!! neural network is fragile to
overfit in natural sense!!

• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,
from the movie "3 Idiots"

IDS Lab
Jamie Seol
Regularization
• However, we do know that there are a lot of techniques for
regularization, which supports generalizations!

• dropout, batch norm, early stopping, weigh decay…

• It does seem help, but wait….

• can someone prove that regularizations fundamentally
improves generalization?

• does this works really really well? really???

IDS Lab
Jamie Seol
Regularization
• Isn’t data augmentation significantly more important than weight
decay?

• Even with regulizations, neural networks are good memorizers

• Just changing the model increased test accuracy

IDS Lab
Jamie Seol
Regularization
• Early stopping helps

• but not necessarily…

IDS Lab
Jamie Seol
Regularization
• Well… these techniques seem does helpful, but suspicion
remains…

IDS Lab
Jamie Seol
Rademacher complexity
• By the way, what’s so big deal about memorizing everything?

• The following measurement is called Rademacher complexity

• Detailed math is omitted here

• The thing is, if some model can memorize everything (actually, if
the hypothesis have power to fit randomized dataset), then
theoritical upper bound of generalization error is just 1

• which is useless!!!!

• actually, using regularization scheme lowers the bound, but this
is not true in ReLU, and we’ll show that there is some situation
that regularization helps nothing

IDS Lab
Jamie Seol
Finite-sample expressivity
• Remember Universal Approximation Theorem?

• finite-sample expressivity theorem is more practical version of it

• note that this statement shows that UAT does not guarantees
generalization!

• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that
can represent any function given by n samples in d dimensions

• This is not a hard theorem to prove, so let’s do it

IDS Lab
Jamie Seol
Lemma 1
• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank
• Proof: obvious

IDS Lab
Jamie Seol
Theorem 1
• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any function given by n samples in d dimensions
• Proof: Note that 2-layered neural network with ReLU can be expressed as
• where w, b ∈ ℝn and a ∈ ℝd
• for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)
for all i from 1 to n
• choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1
• Then, this becomes y = Aw, while Lemma 1 says that A is invertable
• done

IDS Lab
Jamie Seol
Finite-sample expressivity
• What does it mean?

• It means that once you have more than about 2n + d parameters,
your model already possesses a willingful power to super-overfit
and just to remember everything instead of generalizing some
concept, therefore it gains trivial bound for generalization error and
is exposed to sudden-death-danger of doing nothing more than a
memorizer

• long story short: we can’t speak formally about generalization in
deep learning yet

• a snake’s leg: for deeper network, use intermediate layers to
choose splitted interval rather than target, resulting similar O(n + k)
parameters required

IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Let’s think about linear optimization

• If we have large d, which is a underdetermined problem, then we
can have multiple globla minima

• But hey, can we determine which optima gives best
generalization?

• in non-linear systems, peeking curvature helped

• but there’s no such thing as a curvature in linear system!

IDS Lab
Jamie Seol
• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined system, and known to be a regularizer itself

IDS Lab
Jamie Seol
• However… the result shows minimum l2 norm wasn’t always the
global optima in sense of generalization

• furthermore, it is possible to generate some dataset that
minimum l2 norm is not optima! a constructive counter
example!

• adding l2 regularization to parameters didn’t help a bit (not
shown in the table)
norm = 220
norm = 390

IDS Lab
Jamie Seol
Conclusion
• "Be careful whenever you speak 'generalization' in deep learning"

• Contributions of this paper:

• experimental framework for suspecting suspicious activities of
generalization techniques

• proof for lack of theoritical boundary of generalization error in
deep learning (since it can just memorize it all with small
effective capacity)

• optimization does not necessarily means generalization

• "beware of the light" - Caliban, from the movie "Logan"

IDS Lab
Jamie Seol
References
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-12
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-2-22

Understanding deep learning requires rethinking generalization

In this document

More Related Content

Similar to Understanding deep learning requires rethinking generalization

Recently uploaded

Understanding deep learning requires rethinking generalization