Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
IDS Lab
Understanding deep learning requires
rethinking generalization
Does deep learning really doing some generalization...
IDS Lab
Jamie Seol
Motivation
• Normally, we measure a generalization by:

• generalization error = |training error - test...
IDS Lab
Jamie Seol
CIFAR-10
• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of...
IDS Lab
Jamie Seol
Randomized CIFAR-10
• When we randomize information of CIFAR-10’s training set, the
result of accuracy ...
IDS Lab
Jamie Seol
Randomized CIFAR-10
• This is just nothing more than over-overfit!

• What’s the problem than?

• neura...
IDS Lab
Jamie Seol
Randomized CIFAR-10
• Even if you didn’t intend to, neural nets can just memorize thing
rather than gen...
IDS Lab
Jamie Seol
Regularization
• However, we do know that there are a lot of techniques for
regularization, which suppo...
IDS Lab
Jamie Seol
Regularization
• Isn’t data augmentation significantly more important than weight
decay?

• Even with r...
IDS Lab
Jamie Seol
Regularization
• Early stopping helps

• but not necessarily…
IDS Lab
Jamie Seol
Regularization
• Well… these techniques seem does helpful, but suspicion
remains…
IDS Lab
Jamie Seol
Rademacher complexity
• By the way, what’s so big deal about memorizing everything?

• The following me...
IDS Lab
Jamie Seol
Finite-sample expressivity
• Remember Universal Approximation Theorem?

• finite-sample expressivity th...
IDS Lab
Jamie Seol
Lemma 1
• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank
• Proof: ...
IDS Lab
Jamie Seol
Theorem 1
• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any funct...
IDS Lab
Jamie Seol
Finite-sample expressivity
• What does it mean?

• It means that once you have more than about 2n + d p...
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Let’s think about linear optimization

• If we have large d, which is a u...
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined...
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• However… the result shows minimum l2 norm wasn’t always the
global optima...
IDS Lab
Jamie Seol
Conclusion
• "Be careful whenever you speak 'generalization' in deep learning"

• Contributions of this...
IDS Lab
Jamie Seol
References
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." a...
Upcoming SlideShare
Loading in …5
×

Understanding deep learning requires rethinking generalization

558 views

Published on

Lab seminar about the paper: Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).

Published in: Engineering
  • Be the first to comment

Understanding deep learning requires rethinking generalization

  1. 1. IDS Lab Understanding deep learning requires rethinking generalization Does deep learning really doing some generalization? presentedby Jamie Seol
  2. 2. IDS Lab Jamie Seol Motivation • Normally, we measure a generalization by: • generalization error = |training error - test error| • if we overfit, the training error should be low, while test error becomes large = high generalization error! • However, a complex neural network is fragile to be overfitted! • for example, let’s train some human baby by randomly labeled CIFAR-10 dataset • then, give’em some sample in the training set (2nd+ epoch) • they will say "what the…" to any question • because it’s impossible to generalize some kind of abtracted concept! • what about in neural network?
  3. 3. IDS Lab Jamie Seol CIFAR-10 • This is the CIFAR-10 dataset • The goal of this task is to classify given image into one of 10 classes • CNNs that we know well will solve this rather easily
  4. 4. IDS Lab Jamie Seol Randomized CIFAR-10 • When we randomize information of CIFAR-10’s training set, the result of accuracy becomes:
  5. 5. IDS Lab Jamie Seol Randomized CIFAR-10 • This is just nothing more than over-overfit! • What’s the problem than? • neural networks memorized datasets • even if it should have no meaning! • it’s random! raaaaandddddddommm!!! • aaaaarrrrrrrr!!! • it did not generalize some concepts • it just memorized!!!!
  6. 6. IDS Lab Jamie Seol Randomized CIFAR-10 • Even if you didn’t intend to, neural nets can just memorize thing rather than generalizing! • According to the experiment, • the effective capacity of neural network is sufficient for memorizing the entire data set • randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task! • Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!! • "You don’t have to explain the meanings. I’ll just memorize it" - Chatur, from the movie "3 Idiots"
  7. 7. IDS Lab Jamie Seol Regularization • However, we do know that there are a lot of techniques for regularization, which supports generalizations! • dropout, batch norm, early stopping, weigh decay… • It does seem help, but wait…. • can someone prove that regularizations fundamentally improves generalization? • does this works really really well? really???
  8. 8. IDS Lab Jamie Seol Regularization • Isn’t data augmentation significantly more important than weight decay? • Even with regulizations, neural networks are good memorizers • Just changing the model increased test accuracy
  9. 9. IDS Lab Jamie Seol Regularization • Early stopping helps • but not necessarily…
  10. 10. IDS Lab Jamie Seol Regularization • Well… these techniques seem does helpful, but suspicion remains…
  11. 11. IDS Lab Jamie Seol Rademacher complexity • By the way, what’s so big deal about memorizing everything? • The following measurement is called Rademacher complexity • Detailed math is omitted here • The thing is, if some model can memorize everything (actually, if the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1 • which is useless!!!! • actually, using regularization scheme lowers the bound, but this is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing
  12. 12. IDS Lab Jamie Seol Finite-sample expressivity • Remember Universal Approximation Theorem? • finite-sample expressivity theorem is more practical version of it • note that this statement shows that UAT does not guarantees generalization! • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • This is not a hard theorem to prove, so let’s do it
  13. 13. IDS Lab Jamie Seol Lemma 1 • Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has full rank • Proof: obvious
  14. 14. IDS Lab Jamie Seol Theorem 1 • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as • where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi) for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done
  15. 15. IDS Lab Jamie Seol Finite-sample expressivity • What does it mean? • It means that once you have more than about 2n + d parameters, your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer • long story short: we can’t speak formally about generalization in deep learning yet • a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required
  16. 16. IDS Lab Jamie Seol Stochastic Gradient Descent • Let’s think about linear optimization • If we have large d, which is a underdetermined problem, then we can have multiple globla minima • But hey, can we determine which optima gives best generalization? • in non-linear systems, peeking curvature helped • but there’s no such thing as a curvature in linear system!
  17. 17. IDS Lab Jamie Seol Stochastic Gradient Descent • Funny thing about SGD is, it gives optima for l2 loss for underdetermined system, and known to be a regularizer itself
  18. 18. IDS Lab Jamie Seol Stochastic Gradient Descent • However… the result shows minimum l2 norm wasn’t always the global optima in sense of generalization • furthermore, it is possible to generate some dataset that minimum l2 norm is not optima! a constructive counter example! • adding l2 regularization to parameters didn’t help a bit (not shown in the table) norm = 220 norm = 390
  19. 19. IDS Lab Jamie Seol Conclusion • "Be careful whenever you speak 'generalization' in deep learning" • Contributions of this paper: • experimental framework for suspecting suspicious activities of generalization techniques • proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity) • optimization does not necessarily means generalization • "beware of the light" - Caliban, from the movie "Logan"
  20. 20. IDS Lab Jamie Seol References • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-2-22

×