SlideShare a Scribd company logo
Vietnam Japan AI Community 2019-05-26
Kien Le
Regularization In
Deep Learning
Regularization in deep learning
Model Fitting Introduction
Model (Function) Fitting
• How well a model performs on training/evaluation datasets will
define its characteristics
Underfit Overfit Good Fit
Training Dataset Poor Very Good Good
Evaluation Dataset Very Poor Poor Good
Model Fitting – Visualization
Variations of model fitting [1]
Bias Variance
• Prediction errors [2]
𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ]
2
(Bias)2 Variance
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2
Bias Variance
• Bias
• Represents the extent to which average prediction over all data sets differs
from the desired regression function
• Variance
• Represent the extent to which the model is sensitive to the particular choice
of data set
Quiz
• Model Fitting and Bias-Variance Relationship
Underfit Overfit Good Fit
Bias ? ? ?
Variance ? ? ?
Quiz - Answer
Fit a function to a dataset
Regularization Introduction
Counter Underfit
• What causes underfit?
• Model capacity is too small to fit the training dataset as well as generalize to
new dataset.
• High bias, low variance
• Solution
• Increase the capacity of the model
• Examples:
• Increase number of layers, neurons in each layer, etc.
• Result:
• Lower Bias
• Underfit  Good Fit?
Counter Underfit
• It’s so simple, just turn it into an overfit model! 
Counter Overfit
• What cause overfit?
• Model capacity is so big that it adapts too well to training samples  Unable
to generalize well to new, unseen samples
• Low bias, high variance
• Solution
• Regularization
• But How?
Regularization Definition
• Regularization is any modification we make to a learning algorithm
that is intended to reduce its generalization error but not its training
error. [4]
Regularization
Techniques
Early Stopping, L1/L2,
Batch Norm, Dropout
Regularization Techniques
• Early Stopping
• L1/L2
• Batch Norm
• Dropout
• Data Augmentation
• Layer Norm
• Weight Norm
Early Stopping
• There is point during training a large neural net when the model will
stop generalizing and only focus on learning the statistical noise in the
training dataset.
• Solution
• Stop whenever generalization errors increases
Early Stopping
Early Stopping
• Pros
• Very simple
• Highly recommend to use for all training along with other techniques
• Keras Implementation has option to save BEST_WEIGHT
• https://keras.io/callbacks/
• Callback during training
• Cons
• May not work so well
L1/L2 Regularization
• L2 adds “squared magnitude” of coefficient as penalty term to the
loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2
• L1 adds “absolute value of magnitude” of coefficient as penalty term
to the loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽|
• Weight Penalties  Smaller Weights  Simpler Model  Less Overfit
L1/L2 Regularization
• Regularization works on assumption that smaller weights generate
simpler model and thus helps avoid overfitting. [5]
• Why?
L1/L2 Comparison
• Robustness
• Sparsity
Robustness (Against Outliers)
• L1>L2
• The loss of outliers increase
• Exponentially in L2
• Linearly in L1
• L2 pays more efforts to deal with outliers  Less Robust
Sparsity
• L1>L2
• L1 zeros out coefficients, which leads to a sparse model
• L1 can be used for feature (coefficients) selection
• Unimportant ones have zero coefficients
• L2 will produce small values for almost all coefficients
• E.g: When applying L1/L2 to a layer with 4 weights, the results might
look like
• L1: 0.8, 0, 1, 0
• L2: 0.3,0.1,0.3, 0.2
Sparsity ([3])
𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡
gradient is constant (1 or -1)
w1: 5->0 in 10 steps
gradient is smaller over time (
w2: 5->0 in a big number of steps
L1/L2 Regularization
• Fun Fact:
• What does “L” in L1/L2 stand for?
Batch Norm
• Original Paper Title:
• Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift [6]
• Internal Covariate Shift:
• The change in the distribution of network activations due to the change in
network parameters during training.
Internal Covariate Shift (More)
• Distribution of each layer’s inputs changes during training as the
parameters of the previous layers change.
• The layers need to continuously adapt to the new distribution!
• Problems:
• Slower training
• Hard to use big learning rate
Batch Norm Algorithm
• Batch Norm tries to fix the means and variances of layer inputs
• Reduce Internal Covariate Shift
• Run over batch axis
Batch Norm Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Force layers to learn harder to be robust a lot of variation of inputs
 A form of data augmentation
Batch Norm Recap
• Pros
• Networks train faster
• Allow higher learning rates
• Make weights easier to initialize
• Make more activation functions viable
• Regularization by forcing layers to be more robust to noises (may replace Dropout)
• Cons
• Not good for online learning
• Not good for RNN, LSTM
• Different calculation between train and test
• Related techniques
• Layer norm
• Weight norm
Dropout
• How it works
• Randomly selected neurons are ignored during each training step.
• Dropped neurons don’t have effect on next layers.
• Dropped neurons are not updated in backward training.
• Questions:
• What’re the ideas?
• Why dropout help to reduce overfit?
Ensemble Models - Bagging
• How it works
• Train multiple models on different subsets of data
• Combine those models into a final model
• Characteristics
• Each sub-model is trained separately
• Each sub-model is normally overfit
• The combination of those overfit models produce a less overfit model overall
Ensemble Models
• Averaging multiple models to create a final model with low variance
Dropout - Ensemble Models for DNN
• Can we apply Bagging for Neural Network?
• It’s computationally prohibitive
• Dropout aims to solve this problem by providing a method to
combine multiple models with practical computation cost.
Dropout
• Removing units from base model effectively creates a subnetwork.
• All those subnetworks are trained implicitly together with all
parameters shared (different from bagging)
• At predict mode, all learned units are activated, which averages all
trained subnetworks
Dropout – Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Similar with Batch Norm
Regularization Summary
• Two types of regularization
• Model optimization: Reduce the model complexity
• Data augmentation: Increase the size of training data
• Categorize techniques we have learned
• Model optimization: ?
• Data augmentation: ?
Demo Batch Norm, Dropout
Notes
• MNIST Dataset
• To create overfit scenario
• Reduce dataset size (60K->1K)
• Create a complex (but not so good) model
• Techniques to try
• Early stopping
• Dropout
• Batch Norm
• Link:
• https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y
0ot6hSr
Key Takeaways
• Keywords: Overfit, Underfit, Bias, Variance
• Regularization Techniques: Dropout, Batch-Norm, Early Stopping
References
• [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-
learning-and-how-to-deal-with-it-6803a989c76
• [2] Pattern Recognition and Machine Learning, M. Bishop
• [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
• [4] Deep Learning, Goodfellow et. al
• [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2
• [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Sergey Ioffe et al
• [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9
• [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al
• [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-
overfitting/
• [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

More Related Content

What's hot

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Image processing second unit Notes
Image processing second unit NotesImage processing second unit Notes
Image processing second unit Notes
AAKANKSHA JAIN
 
Cnn
CnnCnn
Perceptron
PerceptronPerceptron
Perceptron
Nagarajan
 
07 regularization
07 regularization07 regularization
07 regularization
Ronald Teo
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
omaraldabash
 
Chapter10 image segmentation
Chapter10 image segmentationChapter10 image segmentation
Chapter10 image segmentation
asodariyabhavesh
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
Emanuele Ghelfi
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Image Processing: Spatial filters
Image Processing: Spatial filtersImage Processing: Spatial filters
Image Processing: Spatial filters
A B Shinde
 
Sharpening spatial filters
Sharpening spatial filtersSharpening spatial filters
Decision tree
Decision treeDecision tree
Decision tree
R A Akerkar
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
Ashraf Uddin
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Histogram Processing
Histogram ProcessingHistogram Processing
Histogram Processing
Amnaakhaan
 
Deep learning
Deep learningDeep learning
Deep learning
Ratnakar Pandey
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Suraj Aavula
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
MdMahfoozAlam5
 

What's hot (20)

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Image processing second unit Notes
Image processing second unit NotesImage processing second unit Notes
Image processing second unit Notes
 
Cnn
CnnCnn
Cnn
 
Perceptron
PerceptronPerceptron
Perceptron
 
07 regularization
07 regularization07 regularization
07 regularization
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Chapter10 image segmentation
Chapter10 image segmentationChapter10 image segmentation
Chapter10 image segmentation
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Image Processing: Spatial filters
Image Processing: Spatial filtersImage Processing: Spatial filters
Image Processing: Spatial filters
 
Sharpening spatial filters
Sharpening spatial filtersSharpening spatial filters
Sharpening spatial filters
 
Decision tree
Decision treeDecision tree
Decision tree
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Histogram Processing
Histogram ProcessingHistogram Processing
Histogram Processing
 
Deep learning
Deep learningDeep learning
Deep learning
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
 

Similar to Regularization in deep learning

Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
sudheeremoa229
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
NAGARAJANS68
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
Nimrita Koul
 
lecture-05.pptx
lecture-05.pptxlecture-05.pptx
lecture-05.pptx
SSSSSSSSSSSS5
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
Neotys
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Nimrita Koul
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
FINBOURNE Technology
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
microsoftventures
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Ml2 production
Ml2 productionMl2 production
Ml2 production
Nikhil Ketkar
 
Regularizing DNN.pptx
Regularizing DNN.pptxRegularizing DNN.pptx
Regularizing DNN.pptx
SnehashisPaul8
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
Kamer Ali Yuksel
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
vudinhphuong96
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
Jennifer Prendki
 

Similar to Regularization in deep learning (20)

Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
lecture-05.pptx
lecture-05.pptxlecture-05.pptx
lecture-05.pptx
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Ml2 production
Ml2 productionMl2 production
Ml2 production
 
Regularizing DNN.pptx
Regularizing DNN.pptxRegularizing DNN.pptx
Regularizing DNN.pptx
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
 

Recently uploaded

Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
AIRCC Publishing Corporation
 
JORC_Review_presentation. 2024 código jorcpdf
JORC_Review_presentation. 2024 código jorcpdfJORC_Review_presentation. 2024 código jorcpdf
JORC_Review_presentation. 2024 código jorcpdf
WilliamsNuezEspetia
 
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.pptUNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
shanmugamram247
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
mdfkobir
 
Indian Railway Signalling concepts and basics.pdf
Indian Railway Signalling concepts and basics.pdfIndian Railway Signalling concepts and basics.pdf
Indian Railway Signalling concepts and basics.pdf
princeshah76
 
Digital Image Processing - Module 4 Chapter 2
Digital Image Processing - Module 4 Chapter 2Digital Image Processing - Module 4 Chapter 2
Digital Image Processing - Module 4 Chapter 2
821priyankaj
 
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
ijwmn
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Robert Pyke
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Robert Pyke
 
Cisco Intersight Technical OverView.pptx
Cisco Intersight Technical OverView.pptxCisco Intersight Technical OverView.pptx
Cisco Intersight Technical OverView.pptx
Duy Nguyen
 
How to Formulate A Good Research Question
How to Formulate A  Good Research QuestionHow to Formulate A  Good Research Question
How to Formulate A Good Research Question
rkpv2002
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17
Brian Gongol
 
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYSDPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
RyanMacayan
 
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
AsiimweJulius2
 
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Concepts of Automatic Block Signalling.ppt
Concepts of Automatic Block Signalling.pptConcepts of Automatic Block Signalling.ppt
Concepts of Automatic Block Signalling.ppt
princeshah76
 
AI INTRODUCTION Artificial intelligence.ppt
AI INTRODUCTION Artificial intelligence.pptAI INTRODUCTION Artificial intelligence.ppt
AI INTRODUCTION Artificial intelligence.ppt
GeethaAL
 

Recently uploaded (20)

Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
Developing a Genetic Algorithm Based Daily Calorie Recommendation System for ...
 
JORC_Review_presentation. 2024 código jorcpdf
JORC_Review_presentation. 2024 código jorcpdfJORC_Review_presentation. 2024 código jorcpdf
JORC_Review_presentation. 2024 código jorcpdf
 
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.pptUNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
UNIT 1 - INTRODUCTION ON DISASTER MANAGEMENT.ppt
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
 
Indian Railway Signalling concepts and basics.pdf
Indian Railway Signalling concepts and basics.pdfIndian Railway Signalling concepts and basics.pdf
Indian Railway Signalling concepts and basics.pdf
 
Digital Image Processing - Module 4 Chapter 2
Digital Image Processing - Module 4 Chapter 2Digital Image Processing - Module 4 Chapter 2
Digital Image Processing - Module 4 Chapter 2
 
Modified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNNModified O-RAN 5G Edge Reference Architecture using RNN
Modified O-RAN 5G Edge Reference Architecture using RNN
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
 
Cisco Intersight Technical OverView.pptx
Cisco Intersight Technical OverView.pptxCisco Intersight Technical OverView.pptx
Cisco Intersight Technical OverView.pptx
 
How to Formulate A Good Research Question
How to Formulate A  Good Research QuestionHow to Formulate A  Good Research Question
How to Formulate A Good Research Question
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
 
If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17
 
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYSDPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
DPWH - DEPARTMENT OF PUBLIC WORKS AND HIGHWAYS
 
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
 
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
 
Concepts of Automatic Block Signalling.ppt
Concepts of Automatic Block Signalling.pptConcepts of Automatic Block Signalling.ppt
Concepts of Automatic Block Signalling.ppt
 
AI INTRODUCTION Artificial intelligence.ppt
AI INTRODUCTION Artificial intelligence.pptAI INTRODUCTION Artificial intelligence.ppt
AI INTRODUCTION Artificial intelligence.ppt
 

Regularization in deep learning

  • 1. Vietnam Japan AI Community 2019-05-26 Kien Le Regularization In Deep Learning
  • 4. Model (Function) Fitting • How well a model performs on training/evaluation datasets will define its characteristics Underfit Overfit Good Fit Training Dataset Poor Very Good Good Evaluation Dataset Very Poor Poor Good
  • 5. Model Fitting – Visualization Variations of model fitting [1]
  • 6. Bias Variance • Prediction errors [2] 𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ] 2 (Bias)2 Variance 𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2
  • 7. Bias Variance • Bias • Represents the extent to which average prediction over all data sets differs from the desired regression function • Variance • Represent the extent to which the model is sensitive to the particular choice of data set
  • 8. Quiz • Model Fitting and Bias-Variance Relationship Underfit Overfit Good Fit Bias ? ? ? Variance ? ? ?
  • 9. Quiz - Answer Fit a function to a dataset
  • 11. Counter Underfit • What causes underfit? • Model capacity is too small to fit the training dataset as well as generalize to new dataset. • High bias, low variance • Solution • Increase the capacity of the model • Examples: • Increase number of layers, neurons in each layer, etc. • Result: • Lower Bias • Underfit  Good Fit?
  • 12. Counter Underfit • It’s so simple, just turn it into an overfit model! 
  • 13. Counter Overfit • What cause overfit? • Model capacity is so big that it adapts too well to training samples  Unable to generalize well to new, unseen samples • Low bias, high variance • Solution • Regularization • But How?
  • 14. Regularization Definition • Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. [4]
  • 16. Regularization Techniques • Early Stopping • L1/L2 • Batch Norm • Dropout • Data Augmentation • Layer Norm • Weight Norm
  • 17. Early Stopping • There is point during training a large neural net when the model will stop generalizing and only focus on learning the statistical noise in the training dataset. • Solution • Stop whenever generalization errors increases
  • 19. Early Stopping • Pros • Very simple • Highly recommend to use for all training along with other techniques • Keras Implementation has option to save BEST_WEIGHT • https://keras.io/callbacks/ • Callback during training • Cons • May not work so well
  • 20. L1/L2 Regularization • L2 adds “squared magnitude” of coefficient as penalty term to the loss function. 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2 • L1 adds “absolute value of magnitude” of coefficient as penalty term to the loss function. 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽| • Weight Penalties  Smaller Weights  Simpler Model  Less Overfit
  • 21. L1/L2 Regularization • Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting. [5] • Why?
  • 23. Robustness (Against Outliers) • L1>L2 • The loss of outliers increase • Exponentially in L2 • Linearly in L1 • L2 pays more efforts to deal with outliers  Less Robust
  • 24. Sparsity • L1>L2 • L1 zeros out coefficients, which leads to a sparse model • L1 can be used for feature (coefficients) selection • Unimportant ones have zero coefficients • L2 will produce small values for almost all coefficients • E.g: When applying L1/L2 to a layer with 4 weights, the results might look like • L1: 0.8, 0, 1, 0 • L2: 0.3,0.1,0.3, 0.2
  • 25. Sparsity ([3]) 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 gradient is constant (1 or -1) w1: 5->0 in 10 steps gradient is smaller over time ( w2: 5->0 in a big number of steps
  • 26. L1/L2 Regularization • Fun Fact: • What does “L” in L1/L2 stand for?
  • 27. Batch Norm • Original Paper Title: • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [6] • Internal Covariate Shift: • The change in the distribution of network activations due to the change in network parameters during training.
  • 28. Internal Covariate Shift (More) • Distribution of each layer’s inputs changes during training as the parameters of the previous layers change. • The layers need to continuously adapt to the new distribution! • Problems: • Slower training • Hard to use big learning rate
  • 29. Batch Norm Algorithm • Batch Norm tries to fix the means and variances of layer inputs • Reduce Internal Covariate Shift • Run over batch axis
  • 30. Batch Norm Regularization Effect • Each hidden units are multiplied by a random value at each step of training  Add noises to training process  Force layers to learn harder to be robust a lot of variation of inputs  A form of data augmentation
  • 31. Batch Norm Recap • Pros • Networks train faster • Allow higher learning rates • Make weights easier to initialize • Make more activation functions viable • Regularization by forcing layers to be more robust to noises (may replace Dropout) • Cons • Not good for online learning • Not good for RNN, LSTM • Different calculation between train and test • Related techniques • Layer norm • Weight norm
  • 32. Dropout • How it works • Randomly selected neurons are ignored during each training step. • Dropped neurons don’t have effect on next layers. • Dropped neurons are not updated in backward training. • Questions: • What’re the ideas? • Why dropout help to reduce overfit?
  • 33. Ensemble Models - Bagging • How it works • Train multiple models on different subsets of data • Combine those models into a final model • Characteristics • Each sub-model is trained separately • Each sub-model is normally overfit • The combination of those overfit models produce a less overfit model overall
  • 34. Ensemble Models • Averaging multiple models to create a final model with low variance
  • 35. Dropout - Ensemble Models for DNN • Can we apply Bagging for Neural Network? • It’s computationally prohibitive • Dropout aims to solve this problem by providing a method to combine multiple models with practical computation cost.
  • 36. Dropout • Removing units from base model effectively creates a subnetwork. • All those subnetworks are trained implicitly together with all parameters shared (different from bagging) • At predict mode, all learned units are activated, which averages all trained subnetworks
  • 37. Dropout – Regularization Effect • Each hidden units are multiplied by a random value at each step of training  Add noises to training process  Similar with Batch Norm
  • 38. Regularization Summary • Two types of regularization • Model optimization: Reduce the model complexity • Data augmentation: Increase the size of training data • Categorize techniques we have learned • Model optimization: ? • Data augmentation: ?
  • 39. Demo Batch Norm, Dropout
  • 40. Notes • MNIST Dataset • To create overfit scenario • Reduce dataset size (60K->1K) • Create a complex (but not so good) model • Techniques to try • Early stopping • Dropout • Batch Norm • Link: • https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y 0ot6hSr
  • 41. Key Takeaways • Keywords: Overfit, Underfit, Bias, Variance • Regularization Techniques: Dropout, Batch-Norm, Early Stopping
  • 42. References • [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine- learning-and-how-to-deal-with-it-6803a989c76 • [2] Pattern Recognition and Machine Learning, M. Bishop • [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models • [4] Deep Learning, Goodfellow et. al • [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2 • [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe et al • [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9 • [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al • [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce- overfitting/ • [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

Editor's Notes

  1. There are two common sources of variance in a final model: The noise in the training data. The use of randomness in the machine learning algorithm.
  2. Lebesgue