Regularization in deep learning

Vietnam Japan AI Community 2019-05-26
Kien Le
Regularization In
Deep Learning

Model (Function) Fitting
• How well a model performs on training/evaluation datasets will
define its characteristics
Underfit Overfit Good Fit
Training Dataset Poor Very Good Good
Evaluation Dataset Very Poor Poor Good

Model Fitting – Visualization
Variations of model fitting [1]

Bias Variance
• Prediction errors [2]
𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ]
2
(Bias)2 Variance
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2

Bias Variance
• Bias
• Represents the extent to which average prediction over all data sets differs
from the desired regression function
• Variance
• Represent the extent to which the model is sensitive to the particular choice
of data set

Quiz
• Model Fitting and Bias-Variance Relationship
Underfit Overfit Good Fit
Bias ? ? ?
Variance ? ? ?

Quiz - Answer
Fit a function to a dataset

Counter Underfit
• What causes underfit?
• Model capacity is too small to fit the training dataset as well as generalize to
new dataset.
• High bias, low variance
• Solution
• Increase the capacity of the model
• Examples:
• Increase number of layers, neurons in each layer, etc.
• Result:
• Lower Bias
• Underfit  Good Fit?

Counter Underfit
• It’s so simple, just turn it into an overfit model! 

Counter Overfit
• What cause overfit?
• Model capacity is so big that it adapts too well to training samples  Unable
to generalize well to new, unseen samples
• Low bias, high variance
• Solution
• Regularization
• But How?

Regularization Definition
• Regularization is any modiﬁcation we make to a learning algorithm
that is intended to reduce its generalization error but not its training
error. [4]

Regularization
Techniques
Early Stopping, L1/L2,
Batch Norm, Dropout

Regularization Techniques
• Early Stopping
• L1/L2
• Batch Norm
• Dropout
• Data Augmentation
• Layer Norm
• Weight Norm

Early Stopping
• There is point during training a large neural net when the model will
stop generalizing and only focus on learning the statistical noise in the
training dataset.
• Solution
• Stop whenever generalization errors increases

Early Stopping
• Pros
• Very simple
• Highly recommend to use for all training along with other techniques
• Keras Implementation has option to save BEST_WEIGHT
• https://keras.io/callbacks/
• Callback during training
• Cons
• May not work so well

L1/L2 Regularization
• L2 adds “squared magnitude” of coefficient as penalty term to the
loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2
• L1 adds “absolute value of magnitude” of coefficient as penalty term
to the loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽|
• Weight Penalties  Smaller Weights  Simpler Model  Less Overfit

• Regularization works on assumption that smaller weights generate
simpler model and thus helps avoid overfitting. [5]
• Why?

L1/L2 Comparison
• Robustness
• Sparsity

Robustness (Against Outliers)
• L1>L2
• The loss of outliers increase
• Exponentially in L2
• Linearly in L1
• L2 pays more efforts to deal with outliers  Less Robust

Sparsity
• L1>L2
• L1 zeros out coefficients, which leads to a sparse model
• L1 can be used for feature (coefficients) selection
• Unimportant ones have zero coefficients
• L2 will produce small values for almost all coefficients
• E.g: When applying L1/L2 to a layer with 4 weights, the results might
look like
• L1: 0.8, 0, 1, 0
• L2: 0.3,0.1,0.3, 0.2

Sparsity ([3])
𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡
gradient is constant (1 or -1)
w1: 5->0 in 10 steps
gradient is smaller over time (
w2: 5->0 in a big number of steps

• Fun Fact:
• What does “L” in L1/L2 stand for?

Batch Norm
• Original Paper Title:
• Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift [6]
• Internal Covariate Shift:
• The change in the distribution of network activations due to the change in
network parameters during training.

Internal Covariate Shift (More)
• Distribution of each layer’s inputs changes during training as the
parameters of the previous layers change.
• The layers need to continuously adapt to the new distribution!
• Problems:
• Slower training
• Hard to use big learning rate

Batch Norm Algorithm
• Batch Norm tries to fix the means and variances of layer inputs
• Reduce Internal Covariate Shift
• Run over batch axis

Batch Norm Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Force layers to learn harder to be robust a lot of variation of inputs
 A form of data augmentation

Batch Norm Recap
• Pros
• Networks train faster
• Allow higher learning rates
• Make weights easier to initialize
• Make more activation functions viable
• Regularization by forcing layers to be more robust to noises (may replace Dropout)
• Cons
• Not good for online learning
• Not good for RNN, LSTM
• Different calculation between train and test
• Related techniques
• Layer norm
• Weight norm

Dropout
• How it works
• Randomly selected neurons are ignored during each training step.
• Dropped neurons don’t have effect on next layers.
• Dropped neurons are not updated in backward training.
• Questions:
• What’re the ideas?
• Why dropout help to reduce overfit?

Ensemble Models - Bagging
• How it works
• Train multiple models on different subsets of data
• Combine those models into a final model
• Characteristics
• Each sub-model is trained separately
• Each sub-model is normally overfit
• The combination of those overfit models produce a less overfit model overall

Ensemble Models
• Averaging multiple models to create a final model with low variance

Dropout - Ensemble Models for DNN
• Can we apply Bagging for Neural Network?
• It’s computationally prohibitive
• Dropout aims to solve this problem by providing a method to
combine multiple models with practical computation cost.

Dropout
• Removing units from base model effectively creates a subnetwork.
• All those subnetworks are trained implicitly together with all
parameters shared (different from bagging)
• At predict mode, all learned units are activated, which averages all
trained subnetworks

Dropout – Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Similar with Batch Norm

Regularization Summary
• Two types of regularization
• Model optimization: Reduce the model complexity
• Data augmentation: Increase the size of training data
• Categorize techniques we have learned
• Model optimization: ?
• Data augmentation: ?

Notes
• MNIST Dataset
• To create overfit scenario
• Reduce dataset size (60K->1K)
• Create a complex (but not so good) model
• Techniques to try
• Early stopping
• Dropout
• Batch Norm
• Link:
• https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y
0ot6hSr

Key Takeaways
• Keywords: Overfit, Underfit, Bias, Variance
• Regularization Techniques: Dropout, Batch-Norm, Early Stopping

References
• [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-
learning-and-how-to-deal-with-it-6803a989c76
• [2] Pattern Recognition and Machine Learning, M. Bishop
• [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
• [4] Deep Learning, Goodfellow et. al
• [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2
• [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Sergey Ioffe et al
• [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9
• [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al
• [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-
overfitting/
• [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

Regularization in deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Regularization in deep learning

Similar to Regularization in deep learning (20)

Recently uploaded

Recently uploaded (20)

Regularization in deep learning

Editor's Notes