SlideShare a Scribd company logo
1 of 42
Vietnam Japan AI Community 2019-05-26
Kien Le
Regularization In
Deep Learning
Model Fitting Introduction
Model (Function) Fitting
• How well a model performs on training/evaluation datasets will
define its characteristics
Underfit Overfit Good Fit
Training Dataset Poor Very Good Good
Evaluation Dataset Very Poor Poor Good
Model Fitting – Visualization
Variations of model fitting [1]
Bias Variance
• Prediction errors [2]
𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ]
2
(Bias)2 Variance
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2
Bias Variance
• Bias
• Represents the extent to which average prediction over all data sets differs
from the desired regression function
• Variance
• Represent the extent to which the model is sensitive to the particular choice
of data set
Quiz
• Model Fitting and Bias-Variance Relationship
Underfit Overfit Good Fit
Bias ? ? ?
Variance ? ? ?
Quiz - Answer
Fit a function to a dataset
Regularization Introduction
Counter Underfit
• What causes underfit?
• Model capacity is too small to fit the training dataset as well as generalize to
new dataset.
• High bias, low variance
• Solution
• Increase the capacity of the model
• Examples:
• Increase number of layers, neurons in each layer, etc.
• Result:
• Lower Bias
• Underfit  Good Fit?
Counter Underfit
• It’s so simple, just turn it into an overfit model! 
Counter Overfit
• What cause overfit?
• Model capacity is so big that it adapts too well to training samples  Unable
to generalize well to new, unseen samples
• Low bias, high variance
• Solution
• Regularization
• But How?
Regularization Definition
• Regularization is any modification we make to a learning algorithm
that is intended to reduce its generalization error but not its training
error. [4]
Regularization
Techniques
Early Stopping, L1/L2,
Batch Norm, Dropout
Regularization Techniques
• Early Stopping
• L1/L2
• Batch Norm
• Dropout
• Data Augmentation
• Layer Norm
• Weight Norm
Early Stopping
• There is point during training a large neural net when the model will
stop generalizing and only focus on learning the statistical noise in the
training dataset.
• Solution
• Stop whenever generalization errors increases
Early Stopping
Early Stopping
• Pros
• Very simple
• Highly recommend to use for all training along with other techniques
• Keras Implementation has option to save BEST_WEIGHT
• https://keras.io/callbacks/
• Callback during training
• Cons
• May not work so well
L1/L2 Regularization
• L2 adds “squared magnitude” of coefficient as penalty term to the
loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2
• L1 adds “absolute value of magnitude” of coefficient as penalty term
to the loss function.
𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽|
• Weight Penalties  Smaller Weights  Simpler Model  Less Overfit
L1/L2 Regularization
• Regularization works on assumption that smaller weights generate
simpler model and thus helps avoid overfitting. [5]
• Why?
L1/L2 Comparison
• Robustness
• Sparsity
Robustness (Against Outliers)
• L1>L2
• The loss of outliers increase
• Exponentially in L2
• Linearly in L1
• L2 pays more efforts to deal with outliers  Less Robust
Sparsity
• L1>L2
• L1 zeros out coefficients, which leads to a sparse model
• L1 can be used for feature (coefficients) selection
• Unimportant ones have zero coefficients
• L2 will produce small values for almost all coefficients
• E.g: When applying L1/L2 to a layer with 4 weights, the results might
look like
• L1: 0.8, 0, 1, 0
• L2: 0.3,0.1,0.3, 0.2
Sparsity ([3])
𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡
gradient is constant (1 or -1)
w1: 5->0 in 10 steps
gradient is smaller over time (
w2: 5->0 in a big number of steps
L1/L2 Regularization
• Fun Fact:
• What does “L” in L1/L2 stand for?
Batch Norm
• Original Paper Title:
• Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift [6]
• Internal Covariate Shift:
• The change in the distribution of network activations due to the change in
network parameters during training.
Internal Covariate Shift (More)
• Distribution of each layer’s inputs changes during training as the
parameters of the previous layers change.
• The layers need to continuously adapt to the new distribution!
• Problems:
• Slower training
• Hard to use big learning rate
Batch Norm Algorithm
• Batch Norm tries to fix the means and variances of layer inputs
• Reduce Internal Covariate Shift
• Run over batch axis
Batch Norm Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Force layers to learn harder to be robust a lot of variation of inputs
 A form of data augmentation
Batch Norm Recap
• Pros
• Networks train faster
• Allow higher learning rates
• Make weights easier to initialize
• Make more activation functions viable
• Regularization by forcing layers to be more robust to noises (may replace Dropout)
• Cons
• Not good for online learning
• Not good for RNN, LSTM
• Different calculation between train and test
• Related techniques
• Layer norm
• Weight norm
Dropout
• How it works
• Randomly selected neurons are ignored during each training step.
• Dropped neurons don’t have effect on next layers.
• Dropped neurons are not updated in backward training.
• Questions:
• What’re the ideas?
• Why dropout help to reduce overfit?
Ensemble Models - Bagging
• How it works
• Train multiple models on different subsets of data
• Combine those models into a final model
• Characteristics
• Each sub-model is trained separately
• Each sub-model is normally overfit
• The combination of those overfit models produce a less overfit model overall
Ensemble Models
• Averaging multiple models to create a final model with low variance
Dropout - Ensemble Models for DNN
• Can we apply Bagging for Neural Network?
• It’s computationally prohibitive
• Dropout aims to solve this problem by providing a method to
combine multiple models with practical computation cost.
Dropout
• Removing units from base model effectively creates a subnetwork.
• All those subnetworks are trained implicitly together with all
parameters shared (different from bagging)
• At predict mode, all learned units are activated, which averages all
trained subnetworks
Dropout – Regularization Effect
• Each hidden units are multiplied by a random value at each step of
training
 Add noises to training process
 Similar with Batch Norm
Regularization Summary
• Two types of regularization
• Model optimization: Reduce the model complexity
• Data augmentation: Increase the size of training data
• Categorize techniques we have learned
• Model optimization: ?
• Data augmentation: ?
Demo Batch Norm, Dropout
Notes
• MNIST Dataset
• To create overfit scenario
• Reduce dataset size (60K->1K)
• Create a complex (but not so good) model
• Techniques to try
• Early stopping
• Dropout
• Batch Norm
• Link:
• https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y
0ot6hSr
Key Takeaways
• Keywords: Overfit, Underfit, Bias, Variance
• Regularization Techniques: Dropout, Batch-Norm, Early Stopping
References
• [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-
learning-and-how-to-deal-with-it-6803a989c76
• [2] Pattern Recognition and Machine Learning, M. Bishop
• [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
• [4] Deep Learning, Goodfellow et. al
• [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2
• [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Sergey Ioffe et al
• [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9
• [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al
• [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-
overfitting/
• [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

More Related Content

What's hot

Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Activation functions
Activation functionsActivation functions
Activation functionsPRATEEK SAHU
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 

What's hot (20)

Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
Activation function
Activation functionActivation function
Activation function
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 

Similar to Regularization in deep learning

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfsudheeremoa229
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko Neotys
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep LearningKamer Ali Yuksel
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of BackpropagationJennifer Prendki
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...DurgaDevi310087
 

Similar to Regularization in deep learning (20)

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
lecture-05.pptx
lecture-05.pptxlecture-05.pptx
lecture-05.pptx
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Ml2 production
Ml2 productionMl2 production
Ml2 production
 
Regularizing DNN.pptx
Regularizing DNN.pptxRegularizing DNN.pptx
Regularizing DNN.pptx
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
 

Recently uploaded

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 

Recently uploaded (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 

Regularization in deep learning

  • 1. Vietnam Japan AI Community 2019-05-26 Kien Le Regularization In Deep Learning
  • 2.
  • 4. Model (Function) Fitting • How well a model performs on training/evaluation datasets will define its characteristics Underfit Overfit Good Fit Training Dataset Poor Very Good Good Evaluation Dataset Very Poor Poor Good
  • 5. Model Fitting – Visualization Variations of model fitting [1]
  • 6. Bias Variance • Prediction errors [2] 𝐸𝑟𝑟𝑜𝑟 𝑥 = (𝐸 𝑓 𝑥 − 𝑓 𝑥 )2+𝐸 𝑓 𝑥 − 𝐸[ 𝑓 𝑥 ] 2 (Bias)2 Variance 𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑣𝑔 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑇𝑟𝑢𝑒)2+𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝐴𝑣𝑔(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑))2
  • 7. Bias Variance • Bias • Represents the extent to which average prediction over all data sets differs from the desired regression function • Variance • Represent the extent to which the model is sensitive to the particular choice of data set
  • 8. Quiz • Model Fitting and Bias-Variance Relationship Underfit Overfit Good Fit Bias ? ? ? Variance ? ? ?
  • 9. Quiz - Answer Fit a function to a dataset
  • 11. Counter Underfit • What causes underfit? • Model capacity is too small to fit the training dataset as well as generalize to new dataset. • High bias, low variance • Solution • Increase the capacity of the model • Examples: • Increase number of layers, neurons in each layer, etc. • Result: • Lower Bias • Underfit  Good Fit?
  • 12. Counter Underfit • It’s so simple, just turn it into an overfit model! 
  • 13. Counter Overfit • What cause overfit? • Model capacity is so big that it adapts too well to training samples  Unable to generalize well to new, unseen samples • Low bias, high variance • Solution • Regularization • But How?
  • 14. Regularization Definition • Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. [4]
  • 16. Regularization Techniques • Early Stopping • L1/L2 • Batch Norm • Dropout • Data Augmentation • Layer Norm • Weight Norm
  • 17. Early Stopping • There is point during training a large neural net when the model will stop generalizing and only focus on learning the statistical noise in the training dataset. • Solution • Stop whenever generalization errors increases
  • 19. Early Stopping • Pros • Very simple • Highly recommend to use for all training along with other techniques • Keras Implementation has option to save BEST_WEIGHT • https://keras.io/callbacks/ • Callback during training • Cons • May not work so well
  • 20. L1/L2 Regularization • L2 adds “squared magnitude” of coefficient as penalty term to the loss function. 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 𝛽2 • L1 adds “absolute value of magnitude” of coefficient as penalty term to the loss function. 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠 + 𝜆 |𝛽| • Weight Penalties  Smaller Weights  Simpler Model  Less Overfit
  • 21. L1/L2 Regularization • Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting. [5] • Why?
  • 23. Robustness (Against Outliers) • L1>L2 • The loss of outliers increase • Exponentially in L2 • Linearly in L1 • L2 pays more efforts to deal with outliers  Less Robust
  • 24. Sparsity • L1>L2 • L1 zeros out coefficients, which leads to a sparse model • L1 can be used for feature (coefficients) selection • Unimportant ones have zero coefficients • L2 will produce small values for almost all coefficients • E.g: When applying L1/L2 to a layer with 4 weights, the results might look like • L1: 0.8, 0, 1, 0 • L2: 0.3,0.1,0.3, 0.2
  • 25. Sparsity ([3]) 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤1 = 𝑤1 − 0.5 ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 gradient is constant (1 or -1) w1: 5->0 in 10 steps gradient is smaller over time ( w2: 5->0 in a big number of steps
  • 26. L1/L2 Regularization • Fun Fact: • What does “L” in L1/L2 stand for?
  • 27. Batch Norm • Original Paper Title: • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [6] • Internal Covariate Shift: • The change in the distribution of network activations due to the change in network parameters during training.
  • 28. Internal Covariate Shift (More) • Distribution of each layer’s inputs changes during training as the parameters of the previous layers change. • The layers need to continuously adapt to the new distribution! • Problems: • Slower training • Hard to use big learning rate
  • 29. Batch Norm Algorithm • Batch Norm tries to fix the means and variances of layer inputs • Reduce Internal Covariate Shift • Run over batch axis
  • 30. Batch Norm Regularization Effect • Each hidden units are multiplied by a random value at each step of training  Add noises to training process  Force layers to learn harder to be robust a lot of variation of inputs  A form of data augmentation
  • 31. Batch Norm Recap • Pros • Networks train faster • Allow higher learning rates • Make weights easier to initialize • Make more activation functions viable • Regularization by forcing layers to be more robust to noises (may replace Dropout) • Cons • Not good for online learning • Not good for RNN, LSTM • Different calculation between train and test • Related techniques • Layer norm • Weight norm
  • 32. Dropout • How it works • Randomly selected neurons are ignored during each training step. • Dropped neurons don’t have effect on next layers. • Dropped neurons are not updated in backward training. • Questions: • What’re the ideas? • Why dropout help to reduce overfit?
  • 33. Ensemble Models - Bagging • How it works • Train multiple models on different subsets of data • Combine those models into a final model • Characteristics • Each sub-model is trained separately • Each sub-model is normally overfit • The combination of those overfit models produce a less overfit model overall
  • 34. Ensemble Models • Averaging multiple models to create a final model with low variance
  • 35. Dropout - Ensemble Models for DNN • Can we apply Bagging for Neural Network? • It’s computationally prohibitive • Dropout aims to solve this problem by providing a method to combine multiple models with practical computation cost.
  • 36. Dropout • Removing units from base model effectively creates a subnetwork. • All those subnetworks are trained implicitly together with all parameters shared (different from bagging) • At predict mode, all learned units are activated, which averages all trained subnetworks
  • 37. Dropout – Regularization Effect • Each hidden units are multiplied by a random value at each step of training  Add noises to training process  Similar with Batch Norm
  • 38. Regularization Summary • Two types of regularization • Model optimization: Reduce the model complexity • Data augmentation: Increase the size of training data • Categorize techniques we have learned • Model optimization: ? • Data augmentation: ?
  • 39. Demo Batch Norm, Dropout
  • 40. Notes • MNIST Dataset • To create overfit scenario • Reduce dataset size (60K->1K) • Create a complex (but not so good) model • Techniques to try • Early stopping • Dropout • Batch Norm • Link: • https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y 0ot6hSr
  • 41. Key Takeaways • Keywords: Overfit, Underfit, Bias, Variance • Regularization Techniques: Dropout, Batch-Norm, Early Stopping
  • 42. References • [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine- learning-and-how-to-deal-with-it-6803a989c76 • [2] Pattern Recognition and Machine Learning, M. Bishop • [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models • [4] Deep Learning, Goodfellow et. al • [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2 • [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe et al • [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9 • [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al • [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce- overfitting/ • [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

Editor's Notes

  1. There are two common sources of variance in a final model: The noise in the training data. The use of randomness in the machine learning algorithm.
  2. Lebesgue