Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
[course site]
Javier Ruiz Hidalgo
javier.ruiz@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical U...
Outline
● Data
○ training, validation, test partitions
○ Augmentation
● Capacity of the network
○ Underfitting
○ Overfitti...
Outline
3
It’s all about the data...
4
Figure extracted from Kevin Zakka's Blog: “Nuts and Bolts of Applying Deep Learning” https://...
well, not only data...
● Computing power: GPUs
5
Source, NVIDIA 2017
well, not only data...
● Computing power: GPUs
● New learning architectures
○ CNN, RNN, LSTM, DBN, GNN,
6
End-to-end learning: speech recognition
7
Slide extracted from Andrew Ng NIPS 2016 Tutorial
End-to-end learning: autonomous driving
8
Slide extracted from Andrew Ng NIPS 2016 Tutorial
Network capacity
● Space of representable functions that a network can
potencially learn:
○ Number of layers / parameters
...
Generalization
The network needs to
generalize beyond the
training data to work on new
data that it has not seen yet
10
Underfitting vs Overfitting
● Overfitting: network fits training data too well
○ Performs badly on test data
○ Excessively...
Underfitting vs Overfitting
graph with under/over
12
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O...
Split your data into two sets: training and test
Data partition
13
How do we measure the generalization instead of
how wel...
Underfitting vs Overfitting
14
Test error
Training error
Figure extracted from Deep Learning by Ian Goodfellow and Yoshua ...
Data partition revisited
● Test set should not be used to tune your network
○ Network architecture
○ Number of layers
○ Hy...
● Add a validation set!
● Lock away your test set and use it only as a last
validation step
Data partition revisited (2)
1...
Data sets distribution
● Take into account the distribution of training
and test sets
17
TRAINING
60%
VALIDATION
20%
TEST
...
The bigger the better?
● Large networks
○ More capacity / More data
○ Prone to overfit
● Smaller networks
○ Lower capacity...
The bigger the better?
● In large networks, most local minima are equivalent and yield similar
performance.
● The probabil...
Prevent overfitting
● Early stopping
● Loss regularization
● Data augmentation
● Dropout
● Parameter sharing
● Adversarial...
Early stopping
21
Validation error
Training error
Training steps
Early stopping
Loss regularization
● Limit the values of parameters in the network
○ L2 or L1 regularization
22Figure extracted from Cris...
Data augmentation (1)
● Alterate input samples artificially to increase the
data size
● On-the-fly while training
○ Inject...
Data augmentation (2)
● Image
○ Random crops
○ Translations
○ Flips
○ Color changes
● Audio
○ Tempo perturbation, speed
● ...
Data augmentation (3)
● Synthetic data: Generate new input samples
25
A. Palazzi, Learning to Map Vehicles into Bird’s Eye...
Data augmentation (4)
● GANs (Generative Adversarial Networks)
○
○
26
P. Ferreira,et.al., Towards data set augmentation wi...
Dropout (1)
● At each training iteration, randomly remove some nodes
in the network along with all of their incoming and
o...
Dropout (2)
● Why dropout works?
○ Nodes become more
insensitive to the weights of
the other nodes → more
robust.
○ Averag...
Dropout (3)
● Dense-sparse-dense training (S. Han 2016)
a. Initial regular training
b. Drop connections where weights are ...
Parameter sharing
Multi-task Learning
30
CNNs
Figure extracted from Leonardo Araujo, Artificial Intelligence, 2017
Figure ...
Adversarial training
● Search for adversarial examples that network misclassifies
○ Human observer cannot tell the differen...
Strategy for machine learning (1)
Human-level performance can serve as a very
reliable proxy which can be leveraged to
det...
Strategy for machine learning (2)
Human level error . . 1%
Training error . . . 9%
Validation error . . 10%
Test error . ....
Strategy for machine learning (3)
Human level error . . 1%
Training error . . . 1.1%
Validation error . . 10%
Test error ....
Strategy for machine learning (4)
Human level error . . 1%
Training error . . . 1.1%
Validation error . . 2%
Test error . ...
Strategy for machine learning (5)
Human level error . . 1%
Training error . . . 1.1%
Validation error . . 1.2%
Test error ...
Strategy for machine learning (5)
37
High Train error?
Bigger Network
Train longer (hyper-parameters)
New architecture
Yes...
References
38
Nuts and Bolts of Applying Deep Learning by Andrew Ng
https://www.youtube.com/watch?v=F1ka6a13S9I
Thanks! Questions?
39
https://imatge.upc.edu/web/people/javier-ruiz-hidalgo
Upcoming SlideShare
Loading in …5
×

Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)

374 views

Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)

  1. 1. [course site] Javier Ruiz Hidalgo javier.ruiz@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Methodology Day 6 Lecture 2 #DLUPC
  2. 2. Outline ● Data ○ training, validation, test partitions ○ Augmentation ● Capacity of the network ○ Underfitting ○ Overfitting ● Prevent overfitting ○ Dropout, regularization ● Strategy 2
  3. 3. Outline 3
  4. 4. It’s all about the data... 4 Figure extracted from Kevin Zakka's Blog: “Nuts and Bolts of Applying Deep Learning” https://kevinzakka.github.io/2016/09/26/applying-deep-learning/
  5. 5. well, not only data... ● Computing power: GPUs 5 Source, NVIDIA 2017
  6. 6. well, not only data... ● Computing power: GPUs ● New learning architectures ○ CNN, RNN, LSTM, DBN, GNN, 6
  7. 7. End-to-end learning: speech recognition 7 Slide extracted from Andrew Ng NIPS 2016 Tutorial
  8. 8. End-to-end learning: autonomous driving 8 Slide extracted from Andrew Ng NIPS 2016 Tutorial
  9. 9. Network capacity ● Space of representable functions that a network can potencially learn: ○ Number of layers / parameters 9 Figure extracted from Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.
  10. 10. Generalization The network needs to generalize beyond the training data to work on new data that it has not seen yet 10
  11. 11. Underfitting vs Overfitting ● Overfitting: network fits training data too well ○ Performs badly on test data ○ Excessively complicated model ● Underfitting: network does not fit the data well enough ○ Excessively simple model ● Both underfitting and overfitting lead to poor predictions on new data and they do not generalize well 11
  12. 12. Underfitting vs Overfitting graph with under/over 12 Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O'Reilly Media, Inc., 2017
  13. 13. Split your data into two sets: training and test Data partition 13 How do we measure the generalization instead of how well the network does with the memorized data? TRAINING 60% TEST 20%
  14. 14. Underfitting vs Overfitting 14 Test error Training error Figure extracted from Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016
  15. 15. Data partition revisited ● Test set should not be used to tune your network ○ Network architecture ○ Number of layers ○ Hyper-parameters ● Failing to do so will overfit the network to your test set! ○ https://www.kaggle.com/c/higgs-boson/leaderboard 15
  16. 16. ● Add a validation set! ● Lock away your test set and use it only as a last validation step Data partition revisited (2) 16 TRAINING 60% VALIDATION 20% TEST 20%
  17. 17. Data sets distribution ● Take into account the distribution of training and test sets 17 TRAINING 60% VALIDATION 20% TEST 20% TRAINING 60% VAL. TRAIN 20% TEST 10% VAL. TEST 10%
  18. 18. The bigger the better? ● Large networks ○ More capacity / More data ○ Prone to overfit ● Smaller networks ○ Lower capacity / Less data ○ Prone to underfit 18
  19. 19. The bigger the better? ● In large networks, most local minima are equivalent and yield similar performance. ● The probability of finding a “bad” (high value) local minimum is non-zero for small networks and decreases quickly with network size. ● Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting. Better large capacity networks and prevent overfitting 19 Anna Choromanska et.al. “The Loss Surfaces of Multilayer Networks”, https://arxiv.org/pdf/1412.0233.pdf
  20. 20. Prevent overfitting ● Early stopping ● Loss regularization ● Data augmentation ● Dropout ● Parameter sharing ● Adversarial training 20 Figure extracted from Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016
  21. 21. Early stopping 21 Validation error Training error Training steps Early stopping
  22. 22. Loss regularization ● Limit the values of parameters in the network ○ L2 or L1 regularization 22Figure extracted from Cristina Scheau, Regularization in Deep Learning, 2016
  23. 23. Data augmentation (1) ● Alterate input samples artificially to increase the data size ● On-the-fly while training ○ Inject Noise ○ Transformations ○ ... 23 Alex Krizhevsky, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
  24. 24. Data augmentation (2) ● Image ○ Random crops ○ Translations ○ Flips ○ Color changes ● Audio ○ Tempo perturbation, speed ● Video ○ Temporal displacement 24
  25. 25. Data augmentation (3) ● Synthetic data: Generate new input samples 25 A. Palazzi, Learning to Map Vehicles into Bird’s Eye View, ICIAP 2017 DeepGTAV plugin: https://github.com/ai-tor/DeepGTAV
  26. 26. Data augmentation (4) ● GANs (Generative Adversarial Networks) ○ ○ 26 P. Ferreira,et.al., Towards data set augmentation with GANs, 2017. L. Sixt, et.al., RenderGAN: Generating Realistic labeled data, ICLR 2017.
  27. 27. Dropout (1) ● At each training iteration, randomly remove some nodes in the network along with all of their incoming and outgoing connections (N. Srivastava, 2014) 27Figure extracted from Cristina Scheau, Regularization in Deep Learning, 2016
  28. 28. Dropout (2) ● Why dropout works? ○ Nodes become more insensitive to the weights of the other nodes → more robust. ○ Averaging multiple models → ensemble. ○ Training a collection of 2n thinned networks with parameters sharing 28Figure extracted from Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press, 2016
  29. 29. Dropout (3) ● Dense-sparse-dense training (S. Han 2016) a. Initial regular training b. Drop connections where weights are under a particular threshold. c. Retrain sparse network to learn weights of important connections. d. Make network dense again and retrain using small learning rate, a step which adds back capacity. 29
  30. 30. Parameter sharing Multi-task Learning 30 CNNs Figure extracted from Leonardo Araujo, Artificial Intelligence, 2017 Figure extracted from Sebastian Ruder, An Overview of Multi-Task Learning in Deep Neural Networks, 2017
  31. 31. Adversarial training ● Search for adversarial examples that network misclassifies ○ Human observer cannot tell the difference ○ However, the network can make highly different predictions. 31I. Goodfellow, et.al., Explaining and Harnessing Adversarial Examples, ICLR, 2015
  32. 32. Strategy for machine learning (1) Human-level performance can serve as a very reliable proxy which can be leveraged to determine your next move when training your model. 32
  33. 33. Strategy for machine learning (2) Human level error . . 1% Training error . . . 9% Validation error . . 10% Test error . . . . . 11% 33 TRAINING 60% VALIDATION 20% TEST 20% Avoidable bias
  34. 34. Strategy for machine learning (3) Human level error . . 1% Training error . . . 1.1% Validation error . . 10% Test error . . . . . 11% 34 TRAINING 60% VALIDATION 20% TEST 20% Overfitting training
  35. 35. Strategy for machine learning (4) Human level error . . 1% Training error . . . 1.1% Validation error . . 2% Test error . . . . . 11% 35 TRAINING 60% VALIDATION 20% TEST 20% Overitting validation
  36. 36. Strategy for machine learning (5) Human level error . . 1% Training error . . . 1.1% Validation error . . 1.2% Test error . . . . . 1.2% 36 TRAINING 60% VALIDATION 20% TEST 20%
  37. 37. Strategy for machine learning (5) 37 High Train error? Bigger Network Train longer (hyper-parameters) New architecture Yes High Validation error? Regularization More data (augmentation) New architecture Yes No High Test error? More validation data Yes No DONE! No
  38. 38. References 38 Nuts and Bolts of Applying Deep Learning by Andrew Ng https://www.youtube.com/watch?v=F1ka6a13S9I
  39. 39. Thanks! Questions? 39 https://imatge.upc.edu/web/people/javier-ruiz-hidalgo

×