Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

537 views

Published on

Published in:
Engineering

No Downloads

Total views

537

On SlideShare

0

From Embeds

0

Number of Embeds

9

Shares

0

Downloads

23

Comments

0

Likes

1

No embeds

No notes for slide

- 1. IDS Lab Understanding deep learning requires rethinking generalization Does deep learning really doing some generalization? presentedby Jamie Seol
- 2. IDS Lab Jamie Seol Motivation • Normally, we measure a generalization by: • generalization error = |training error - test error| • if we overfit, the training error should be low, while test error becomes large = high generalization error! • However, a complex neural network is fragile to be overfitted! • for example, let’s train some human baby by randomly labeled CIFAR-10 dataset • then, give’em some sample in the training set (2nd+ epoch) • they will say "what the…" to any question • because it’s impossible to generalize some kind of abtracted concept! • what about in neural network?
- 3. IDS Lab Jamie Seol CIFAR-10 • This is the CIFAR-10 dataset • The goal of this task is to classify given image into one of 10 classes • CNNs that we know well will solve this rather easily
- 4. IDS Lab Jamie Seol Randomized CIFAR-10 • When we randomize information of CIFAR-10’s training set, the result of accuracy becomes:
- 5. IDS Lab Jamie Seol Randomized CIFAR-10 • This is just nothing more than over-overfit! • What’s the problem than? • neural networks memorized datasets • even if it should have no meaning! • it’s random! raaaaandddddddommm!!! • aaaaarrrrrrrr!!! • it did not generalize some concepts • it just memorized!!!!
- 6. IDS Lab Jamie Seol Randomized CIFAR-10 • Even if you didn’t intend to, neural nets can just memorize thing rather than generalizing! • According to the experiment, • the effective capacity of neural network is sufficient for memorizing the entire data set • randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task! • Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!! • "You don’t have to explain the meanings. I’ll just memorize it" - Chatur, from the movie "3 Idiots"
- 7. IDS Lab Jamie Seol Regularization • However, we do know that there are a lot of techniques for regularization, which supports generalizations! • dropout, batch norm, early stopping, weigh decay… • It does seem help, but wait…. • can someone prove that regularizations fundamentally improves generalization? • does this works really really well? really???
- 8. IDS Lab Jamie Seol Regularization • Isn’t data augmentation significantly more important than weight decay? • Even with regulizations, neural networks are good memorizers • Just changing the model increased test accuracy
- 9. IDS Lab Jamie Seol Regularization • Early stopping helps • but not necessarily…
- 10. IDS Lab Jamie Seol Regularization • Well… these techniques seem does helpful, but suspicion remains…
- 11. IDS Lab Jamie Seol Rademacher complexity • By the way, what’s so big deal about memorizing everything? • The following measurement is called Rademacher complexity • Detailed math is omitted here • The thing is, if some model can memorize everything (actually, if the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1 • which is useless!!!! • actually, using regularization scheme lowers the bound, but this is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing
- 12. IDS Lab Jamie Seol Finite-sample expressivity • Remember Universal Approximation Theorem? • finite-sample expressivity theorem is more practical version of it • note that this statement shows that UAT does not guarantees generalization! • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • This is not a hard theorem to prove, so let’s do it
- 13. IDS Lab Jamie Seol Lemma 1 • Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has full rank • Proof: obvious
- 14. IDS Lab Jamie Seol Theorem 1 • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as • where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi) for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done
- 15. IDS Lab Jamie Seol Finite-sample expressivity • What does it mean? • It means that once you have more than about 2n + d parameters, your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer • long story short: we can’t speak formally about generalization in deep learning yet • a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required
- 16. IDS Lab Jamie Seol Stochastic Gradient Descent • Let’s think about linear optimization • If we have large d, which is a underdetermined problem, then we can have multiple globla minima • But hey, can we determine which optima gives best generalization? • in non-linear systems, peeking curvature helped • but there’s no such thing as a curvature in linear system!
- 17. IDS Lab Jamie Seol Stochastic Gradient Descent • Funny thing about SGD is, it gives optima for l2 loss for underdetermined system, and known to be a regularizer itself
- 18. IDS Lab Jamie Seol Stochastic Gradient Descent • However… the result shows minimum l2 norm wasn’t always the global optima in sense of generalization • furthermore, it is possible to generate some dataset that minimum l2 norm is not optima! a constructive counter example! • adding l2 regularization to parameters didn’t help a bit (not shown in the table) norm = 220 norm = 390
- 19. IDS Lab Jamie Seol Conclusion • "Be careful whenever you speak 'generalization' in deep learning" • Contributions of this paper: • experimental framework for suspecting suspicious activities of generalization techniques • proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity) • optimization does not necessarily means generalization • "beware of the light" - Caliban, from the movie "Logan"
- 20. IDS Lab Jamie Seol References • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-2-22

No public clipboards found for this slide

Be the first to comment