Regularization in deep learning

Vietnam Japan AI Community 2019-05-26
Kien Le
Regularization In
Deep Learning

Model (Function) Fitting
• How well a model performs on training/evaluation datasets will
define its characteristics
Underfit Overfit Good Fit
Training Dataset Poor Very Good Good
Evaluation Dataset Very Poor Poor Good

Model Fitting – Visualization
Variations of model fitting [1]

Bias Variance
• Prediction errors [2]
𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ]
2
(Bias)2 Variance
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2

Bias Variance
• Bias
• Represents the extent to which average prediction over all data sets differs
from the desired regression function
• Variance
• Represent the extent to which the model is sensitive to the particular choice
of data set

Quiz
• Model Fitting and Bias-Variance Relationship
Underfit Overfit Good Fit
Bias ? ? ?
Variance ? ? ?

Quiz - Answer
Fit a function to a dataset

Counter Underfit
• What causes underfit?
• Model capacity is too small to fit the training dataset as well as generalize to
new dataset.
• High bias, low variance
• Solution
• Increase the capacity of the model
• Examples:
• Increase number of layers, neurons in each layer, etc.
• Result:
• Lower Bias
• Underfit  Good Fit?

Counter Underfit
• It’s so simple, just turn it into an overfit model! 

Counter Overfit
• What cause overfit?
• Model capacity is so big that it adapts too well to training samples  Unable
to generalize well to new, unseen samples
• Low bias, high variance
• Solution
• Regularization
• But How?

Regularization Definition
• Regularization is any modiﬁcation we make to a learning algorithm
that is intended to reduce its generalization error but not its training
error. [4]

Regularization
Techniques
Early Stopping, L1/L2,
Batch Norm, Dropout

Regularization Techniques
• Early Stopping
• L1/L2
• Batch Norm
• Dropout
• Data Augmentation
• Layer Norm
• Weight Norm

Early Stopping
• There is point during training a large neural net when the model will
stop generalizing and only focus on learning the statistical noise in the
training dataset.
• Solution
• Stop whenever generalization errors increases

Early Stopping
• Pros
• Very simple
• Highly recommend to use for all training along with other techniques
• Keras Implementation has option to save BEST_WEIGHT
• https://keras.io/callbacks/
• Callback during training
• Cons
• May not work so well

L1/L2 Regularization
• L2 adds “squared magnitude” of coefficient as penalty term to the
loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2
• L1 adds “absolute value of magnitude” of coefficient as penalty term
to the loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽|
• Weight Penalties  Smaller Weights  Simpler Model  Less Overfit

• Regularization works on assumption that smaller weights generate
simpler model and thus helps avoid overfitting. [5]
• Why?

L1/L2 Comparison
• Robustness
• Sparsity

Robustness (Against Outliers)
• L1>L2
• The loss of outliers increase
• Exponentially in L2
• Linearly in L1
• L2 pays more efforts to deal with outliers  Less Robust

Sparsity
• L1>L2
• L1 zeros out coefficients, which leads to a sparse model
• L1 can be used for feature (coefficients) selection
• Unimportant ones have zero coefficients
• L2 will produce small values for almost all coefficients
• E.g: When applying L1/L2 to a layer with 4 weights, the results might
look like
• L1: 0.8, 0, 1, 0
• L2: 0.3,0.1,0.3, 0.2

Sparsity ([3])
𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡
gradient is constant (1 or -1)
w1: 5->0 in 10 steps
gradient is smaller over time (
w2: 5->0 in a big number of steps

• Fun Fact:
• What does “L” in L1/L2 stand for?

Batch Norm
• Original Paper Title:
• Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift [6]
• Internal Covariate Shift:
• The change in the distribution of network activations due to the change in
network parameters during training.

Internal Covariate Shift (More)
• Distribution of each layer’s inputs changes during training as the
parameters of the previous layers change.
• The layers need to continuously adapt to the new distribution!
• Problems:
• Slower training
• Hard to use big learning rate

Batch Norm Algorithm
• Batch Norm tries to fix the means and variances of layer inputs
• Reduce Internal Covariate Shift
• Run over batch axis

Batch Norm Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Force layers to learn harder to be robust a lot of variation of inputs
 A form of data augmentation

Batch Norm Recap
• Pros
• Networks train faster
• Allow higher learning rates
• Make weights easier to initialize
• Make more activation functions viable
• Regularization by forcing layers to be more robust to noises (may replace Dropout)
• Cons
• Not good for online learning
• Not good for RNN, LSTM
• Different calculation between train and test
• Related techniques
• Layer norm
• Weight norm

Dropout
• How it works
• Randomly selected neurons are ignored during each training step.
• Dropped neurons don’t have effect on next layers.
• Dropped neurons are not updated in backward training.
• Questions:
• What’re the ideas?
• Why dropout help to reduce overfit?

Ensemble Models - Bagging
• How it works
• Train multiple models on different subsets of data
• Combine those models into a final model
• Characteristics
• Each sub-model is trained separately
• Each sub-model is normally overfit
• The combination of those overfit models produce a less overfit model overall

Ensemble Models
• Averaging multiple models to create a final model with low variance

Dropout - Ensemble Models for DNN
• Can we apply Bagging for Neural Network?
• It’s computationally prohibitive
• Dropout aims to solve this problem by providing a method to
combine multiple models with practical computation cost.

Dropout
• Removing units from base model effectively creates a subnetwork.
• All those subnetworks are trained implicitly together with all
parameters shared (different from bagging)
• At predict mode, all learned units are activated, which averages all
trained subnetworks

Dropout – Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Similar with Batch Norm

Regularization Summary
• Two types of regularization
• Model optimization: Reduce the model complexity
• Data augmentation: Increase the size of training data
• Categorize techniques we have learned
• Model optimization: ?
• Data augmentation: ?

Notes
• MNIST Dataset
• To create overfit scenario
• Reduce dataset size (60K->1K)
• Create a complex (but not so good) model
• Techniques to try
• Early stopping
• Dropout
• Batch Norm
• Link:
• https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y
0ot6hSr

Key Takeaways
• Keywords: Overfit, Underfit, Bias, Variance
• Regularization Techniques: Dropout, Batch-Norm, Early Stopping

References
• [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-
learning-and-how-to-deal-with-it-6803a989c76
• [2] Pattern Recognition and Machine Learning, M. Bishop
• [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
• [4] Deep Learning, Goodfellow et. al
• [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2
• [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Sergey Ioffe et al
• [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9
• [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al
• [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-
overfitting/
• [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

Regularization in deep learning

In this document