SlideShare a Scribd company logo
CS 230 – Deep Learning https://stanford.edu/~shervine
VIP Cheatsheet: Tips and Tricks
Afshine Amidi and Shervine Amidi
November 26, 2018
Data processing
r Data augmentation – Deep learning models usually need a lot of data to be properly trained.
It is often useful to get more data from the existing ones using data augmentation techniques.
The main ones are summed up in the table below. More precisely, given the following input
image, here are the techniques that we can apply:
Original Flip Rotation Random crop
- Image without
any modification
- Flipped with respect
to an axis for which
the meaning of the
image is preserved
- Rotation with
a slight angle
- Simulates incorrect
horizon calibration
- Random focus
on one part of
the image
- Several random
crops can be
done in a row
Color shift Noise addition Information loss Contrast change
- Nuances of RGB
is slightly changed
- Captures noise
that can occur
with light exposure
- Addition of noise
- More tolerance to
quality variation of
inputs
- Parts of image
ignored
- Mimics potential
loss of parts of image
- Luminosity changes
- Controls difference
in exposition due
to time of day
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi}.
By noting µB, σ2
B the mean and variance of that we want to correct to the batch, it is done as
follows:
xi ←− γ
xi − µB
p
σ2
B + 
+ β
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
Training a neural network
r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
where the model sees the whole training set to update its weights.
r Mini-batch gradient descent – During the training phase, updating weights is usually not
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
points in a batch is a hyperparameter that we can tune.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
L(z,y) = −
h
y log(z) + (1 − y) log(1 − z)
i
r Backpropagation – Backpropagation is a method to update the weights in the neural network
by taking into account the actual output and the desired output. The derivative with respect
to each weight w is computed using the chain rule.
Using this method, each weight is updated with the rule:
w ←− w − α
∂L(z,y)
∂w
r Updating weights – In a neural network, weights are updated as follows:
• Step 1: Take a batch of training data and perform forward propagation to compute the
loss.
• Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
• Step 3: Use the gradients to update the weights of the network.
Stanford University 1 Winter 2019
CS 230 – Deep Learning https://stanford.edu/~shervine
Parameter tuning
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier
initialization enables to have initial weights that take into account characteristics that are unique
to the architecture.
r Transfer learning – Training a deep learning model requires a lot of data and more impor-
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets
that took days/weeks to train, and leverage it towards our use case. Depending on how much
data we have at hand, here are the different ways to leverage this:
Training size Illustration Explanation
Small
Freezes all layers,
trains weights on softmax
Medium
Freezes most layers,
trains weights on last
layers and softmax
Large
Trains weights on layers
and softmax by initializing
weights on pre-trained ones
r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
the training time and improve the numerical optimal solution. While Adam optimizer is the
most commonly used technique, others can also be useful. They are summed up in the table
below:
Method Explanation Update of w Update of b
Momentum
- Dampens oscillations
- Improvement to SGD
- 2 parameters to tune
w − αvdw b − αvdb
RMSprop
- Root Mean Square propagation
- Speeds up learning algorithm
by controlling oscillations
w − α
dw
√
sdw
b ←− b − α
db
√
sdb
Adam
- Adaptive Moment estimation
- Most popular method
- 4 parameters to tune
w − α
vdw
√
sdw + 
b ←− b − α
vdb
√
sdb + 
Remark: other methods include Adadelta, Adagrad and SGD.
Regularization
r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
data by dropping out neurons with probability p  0. It forces the model to avoid relying too
much on particular sets of features.
Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
r Weight regularization – In order to make sure that the weights are not too large and that
the model is not overfitting the training set, regularization techniques are usually performed on
the model weights. The main ones are summed up in the table below:
LASSO Ridge Elastic Net
- Shrinks coefficients to 0
- Good for variable selection
Makes coefficients smaller
Tradeoff between variable
selection and small coefficients
... + λ||θ||1
λ ∈ R
... + λ||θ||2
2
λ ∈ R
... + λ
h
(1 − α)||θ||1 + α||θ||2
2
i
λ ∈ R,α ∈ [0,1]
Stanford University 2 Winter 2019
CS 230 – Deep Learning https://stanford.edu/~shervine
r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
Good practices
r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.
Numerical gradient Analytical gradient
Formula
df
dx
(x) ≈
f(x + h) − f(x − h)
2h
df
dx
(x) = f0
(x)
Comments
- Expensive; loss has to be
computed two times per dimension
- Used to verify correctness
of analytical implementation
-Trade-off in choosing h
not too small (numerical instability)
nor too large (poor gradient approx.)
- ’Exact’ result
- Direct computation
- Used in the final implementation
? ? ?
Stanford University 3 Winter 2019

More Related Content

What's hot

Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMM
Pei-Che Chang
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
Andres Mendez-Vazquez
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
AMR koura
 
Chapter 24 aoa
Chapter 24 aoaChapter 24 aoa
Chapter 24 aoa
Hanif Durad
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
Khang Pham
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programming
hodcsencet
 
Regularization
RegularizationRegularization
Regularization
Darren Yow-Bang Wang
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
Dynamic Programming - Part II
Dynamic Programming - Part IIDynamic Programming - Part II
Dynamic Programming - Part II
Amrinder Arora
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
Khang Pham
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
Krish_ver2
 
DP
DPDP
Dynamicpgmming
DynamicpgmmingDynamicpgmming
Dynamicpgmming
Muhammad Wasif
 
Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07
Rediet Moges
 
NP-Completeness - II
NP-Completeness - IINP-Completeness - II
NP-Completeness - II
Amrinder Arora
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
ElifTech
 
Graph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search TraversalGraph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search Traversal
Amrinder Arora
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
Andres Mendez-Vazquez
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
giridaroori
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
Rediet Moges
 

What's hot (20)

Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMM
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
Chapter 24 aoa
Chapter 24 aoaChapter 24 aoa
Chapter 24 aoa
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programming
 
Regularization
RegularizationRegularization
Regularization
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
Dynamic Programming - Part II
Dynamic Programming - Part IIDynamic Programming - Part II
Dynamic Programming - Part II
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
DP
DPDP
DP
 
Dynamicpgmming
DynamicpgmmingDynamicpgmming
Dynamicpgmming
 
Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07
 
NP-Completeness - II
NP-Completeness - IINP-Completeness - II
NP-Completeness - II
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
 
Graph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search TraversalGraph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search Traversal
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 

Similar to Cheatsheet deep-learning-tips-tricks

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
DrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Ruochun Tzeng
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
Jeremy Nixon
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
Minho Heo
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
Nimrita Koul
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
MahmoudAbuGhali
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
milad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Mehrnaz Faraz
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Nimrita Koul
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
Madhumita Tamhane
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
Andres Hernandez
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
Crossing Minds
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Databricks
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
Sandeep Kumar
 

Similar to Cheatsheet deep-learning-tips-tricks (20)

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 

More from Steve Nouri

Add a subheading.pdf
Add a subheading.pdfAdd a subheading.pdf
Add a subheading.pdf
Steve Nouri
 
CES 2 AI Products
CES 2 AI ProductsCES 2 AI Products
CES 2 AI Products
Steve Nouri
 
CES AI Product.pdf
CES AI Product.pdfCES AI Product.pdf
CES AI Product.pdf
Steve Nouri
 
NFT Types
NFT TypesNFT Types
NFT Types
Steve Nouri
 
Perspectives matters
Perspectives mattersPerspectives matters
Perspectives matters
Steve Nouri
 
Make AI & BI work at Scale
Make AI & BI work at ScaleMake AI & BI work at Scale
Make AI & BI work at Scale
Steve Nouri
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
Steve Nouri
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
Steve Nouri
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculus
Steve Nouri
 

More from Steve Nouri (9)

Add a subheading.pdf
Add a subheading.pdfAdd a subheading.pdf
Add a subheading.pdf
 
CES 2 AI Products
CES 2 AI ProductsCES 2 AI Products
CES 2 AI Products
 
CES AI Product.pdf
CES AI Product.pdfCES AI Product.pdf
CES AI Product.pdf
 
NFT Types
NFT TypesNFT Types
NFT Types
 
Perspectives matters
Perspectives mattersPerspectives matters
Perspectives matters
 
Make AI & BI work at Scale
Make AI & BI work at ScaleMake AI & BI work at Scale
Make AI & BI work at Scale
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculus
 

Recently uploaded

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
sayalidalavi006
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 

Recently uploaded (20)

Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 

Cheatsheet deep-learning-tips-tricks

  • 1. CS 230 – Deep Learning https://stanford.edu/~shervine VIP Cheatsheet: Tips and Tricks Afshine Amidi and Shervine Amidi November 26, 2018 Data processing r Data augmentation – Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply: Original Flip Rotation Random crop - Image without any modification - Flipped with respect to an axis for which the meaning of the image is preserved - Rotation with a slight angle - Simulates incorrect horizon calibration - Random focus on one part of the image - Several random crops can be done in a row Color shift Noise addition Information loss Contrast change - Nuances of RGB is slightly changed - Captures noise that can occur with light exposure - Addition of noise - More tolerance to quality variation of inputs - Parts of image ignored - Mimics potential loss of parts of image - Luminosity changes - Controls difference in exposition due to time of day r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi}. By noting µB, σ2 B the mean and variance of that we want to correct to the batch, it is done as follows: xi ←− γ xi − µB p σ2 B + + β It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization. Training a neural network r Epoch – In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights. r Mini-batch gradient descent – During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune. r Loss function – In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z. r Cross-entropy loss – In the context of binary classification in neural networks, the cross- entropy loss L(z,y) is commonly used and is defined as follows: L(z,y) = − h y log(z) + (1 − y) log(1 − z) i r Backpropagation – Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule. Using this method, each weight is updated with the rule: w ←− w − α ∂L(z,y) ∂w r Updating weights – In a neural network, weights are updated as follows: • Step 1: Take a batch of training data and perform forward propagation to compute the loss. • Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight. • Step 3: Use the gradients to update the weights of the network. Stanford University 1 Winter 2019
  • 2. CS 230 – Deep Learning https://stanford.edu/~shervine Parameter tuning r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture. r Transfer learning – Training a deep learning model requires a lot of data and more impor- tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this: Training size Illustration Explanation Small Freezes all layers, trains weights on softmax Medium Freezes most layers, trains weights on last layers and softmax Large Trains weights on layers and softmax by initializing weights on pre-trained ones r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate. r Adaptive learning rates – Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below: Method Explanation Update of w Update of b Momentum - Dampens oscillations - Improvement to SGD - 2 parameters to tune w − αvdw b − αvdb RMSprop - Root Mean Square propagation - Speeds up learning algorithm by controlling oscillations w − α dw √ sdw b ←− b − α db √ sdb Adam - Adaptive Moment estimation - Most popular method - 4 parameters to tune w − α vdw √ sdw + b ←− b − α vdb √ sdb + Remark: other methods include Adadelta, Adagrad and SGD. Regularization r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p 0. It forces the model to avoid relying too much on particular sets of features. Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p. r Weight regularization – In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below: LASSO Ridge Elastic Net - Shrinks coefficients to 0 - Good for variable selection Makes coefficients smaller Tradeoff between variable selection and small coefficients ... + λ||θ||1 λ ∈ R ... + λ||θ||2 2 λ ∈ R ... + λ h (1 − α)||θ||1 + α||θ||2 2 i λ ∈ R,α ∈ [0,1] Stanford University 2 Winter 2019
  • 3. CS 230 – Deep Learning https://stanford.edu/~shervine r Early stopping – This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase. Good practices r Overfitting small batch – When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set. r Gradient checking – Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness. Numerical gradient Analytical gradient Formula df dx (x) ≈ f(x + h) − f(x − h) 2h df dx (x) = f0 (x) Comments - Expensive; loss has to be computed two times per dimension - Used to verify correctness of analytical implementation -Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approx.) - ’Exact’ result - Direct computation - Used in the final implementation ? ? ? Stanford University 3 Winter 2019