SlideShare a Scribd company logo
- 1 -
Tips for Training Deep Neural Networks
by
Dr. Vikas Kumar
Department of Data Science and Analytics
Central University of Rajasthan, India
Email: vikas@curaj.ac.in
- 2 -
Outline
 Neural Network Parameters
 Parameters vs Hyperparameters
 How to set network parameters
 Bias / Variance Trade-off
 Regularization Strategies
 Batch normalization
 Vanishing / Exploding gradients
 Gradient Descent
 Mini-batch Gradient Descent
Deep Learning
- 3 -
Neural Network Parameters
16 x 16 = 256
1
x
2
x
…
…
256
x
…
…
…
…
…
…
…
…
Ink → 1
No ink → 0
…
…
y1
y2
y1
0
0.1
0.7
0.2
y1 has the maximum value
Set the network parameters 𝜃 such that ……
Input:
y2 has the maximum value
Input:
is 1
is 2
is 0
How to let the
neural network
achieve this
𝜃 = 𝑊1
, 𝑏1
, 𝑊2
, 𝑏2
, ⋯ 𝑊𝐿
, 𝑏𝐿
…
…
- 4 -
Parameters vs Hyperparameters
 A model parameter is a variable of the selected
model which can be estimated by fitting the given
data to the model.
 Hyperparameter is a parameter from a prior
distribution; it captures the prior belief before data
is observed.
– These are the parameters that control the model
parameters
– In any machine learning algorithm, these parameters
need to be initialized before training a model.
Deep Learning
Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
- 5 -
Deep Neural Network: Parameters vs
Hyperparameters
 Parameters:
– 𝑊1, 𝑏1, 𝑊2, 𝑏2, ⋯ 𝑊𝐿, 𝑏𝐿
 Hyperparameters:
– Learning rate 𝜶 in gradient descent
– Number of iterations in gradient descent
– Number of layers in a Neural Network
– Number of neurons per layer in a Neural Network
– Activations Functions
– Mini-batch size
– Regularizations parameters
Deep Learning
Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
- 6 -
Train / Dev / Test sets
 Hyperparameters tuning is a highly iterative process, where you
– start with an idea, i.e. start with a certain number of hidden layers,
certain learning rate, etc.
– try the idea by implementing it
– experiment how well the idea has worked
– refine the idea and iterate this process
 Now how do we identify whether the idea is working? This is
where the train / dev / test sets come into play.
Deep Learning
Training
Set
Dev Set
Test Set
We train the model on the training data.
After training the model, we check how well it performs on the dev set.
When we have a final model, we evaluate it on the test set in order to get
an unbiased estimate of how well our algorithm is doing.
Data
- 7 -
Train / Dev / Test sets
Deep Learning
Training
Set
(60%)
Dev Set
(20%)
Test Set
(20%)
Training
Set
(70%)
Test Set
(20%)
Training
Set
(98%)
Dev Set (1%)
Test Set (1%)
Previously, when we had
small datasets, most
often the distribution of
different sets was
As the availability of data has
increased in recent years, we
can use a huge slice of it for
training the model
Data
- 8 -
Bias / Variance Trade-off
 Make sure the distribution of dev/test set is
same as training set
– Divide the training, dev and test sets in such a
way that their distribution is similar
– Skip the test set and validate the model using
the dev set only
Deep Learning
Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/
 We want our model to be just right, which
means having low bias and low variance.
 Overfitting: If the dev set error is much
more than the train set error, the model is
overfitting and has a high variance
 Underfitting: When both train and dev set
errors are high, the model is underfitting
and has a high bias
- 9 -
Overfitting in Deep Neural Nets
 Deep neural networks contain multiple non-linear
hidden layers
– This makes them very expressive models that can learn
very complicated relationships between their inputs and
outputs.
– In other words, model learns even the tiniest details
present in the data.
 But with limited training data, many of these
complicated relationships will be the result of sampling
noise
– So they will exist in the training set but not in real test
data even if it is drawn from the same distribution.
– So after learning all the possible patterns it can find, the
model tends to perform extremely well on the training set
but fails to produce good results on the dev and test sets.
Deep Learning
- 10 -
Regularization
 Regularization is:
– “any modification to a learning algorithm to
reduce its generalization error but not its training
error”
– Reduce generalization error even at the expense
of increasing training error
 E.g., Limiting model capacity is a regularization
method
Deep Learning
Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
- 11 -
Regularization Strategies
Deep Learning
- 12 -
Parameter Norm Penalties
 The most traditional form of regularization applicable
to deep learning is the concept of parameter norm
penalties.
 This approach limits the capacity of the model by
adding the penalty Ω 𝜃 to the objective function
resulting in:
min
𝜃
𝐽 = ℓ 𝜃 + 𝜆Ω 𝜃
 𝜆 ∈ [0, ∞) is a hyperparameter that weights the
relative contribution of the norm penalty to the value
of the objective function.
Deep Learning
- 13 -
L2 Norm Parameter Regularization
 Using L2 norm, we’re adding the constraints to the original
loss function, such that the weights of the network don’t
grow too large.
Ω 𝜃 = ||𝜃||2
2
 Assuming there is no bias parameters, only weights
Ω 𝑤 = ||𝑤||2
2
= 𝑤11
2
+ 𝑤12
2
+ ⋯
 By adding the regularized term, we’re fooling the model such
that it won’t drive the training error to zero, which in turn
reduces the complexity of the model.
Deep Learning
- 14 -
L1 Norm Parameter Regularization
 L1 norm is another option that can be used to penalize the size
of model parameters.
 L1 regularization on the model parameters w is:
Ω 𝑤 = ||𝑤||1 =
𝑖
|𝑤𝑖|
 The L2 Norm penalty decays the components of the vector w
that do not contribute much to reducing the objective function.
 On the other hand, the L1 norm penalty provides solutions that
are sparse.
 This sparsity property can be thought of as a feature selection
mechanism.
Deep Learning
- 15 -
Early Stopping
 When training models with sufficient representational
capacity to overfit the task, we often observe that training
error decreases steadily over time, while the error on the
validation set begins to rise again or remaining the same for
certain iterations, then there is no point in training the
model further.
 This means we can obtain a model with better validation set
error (and thus, hopefully better test set error) by returning
to the parameter setting at the point in time with the lowest
validation set error
Deep Learning
- 16 -
Parameter Tying
 Sometimes, we might not know which region the
parameters would lie in, but rather we known that there is
some dependencies between them.
 Parameter Tying refers to explicitly forcing the parameters
of two models to be close to each other, through the norm
penalty.
||𝑾(𝑨) − 𝑾(𝑩)||
 Here, 𝑾(𝑨) refers to the weights of the first model while
𝑾(𝑩) refers to those of the second one.
Deep Learning
- 17 -
Dropout
 Dropout is a bagging method
– Bagging is a method of averaging over several
models to improve generalization
 Impractical to train many neural networks since
it is expensive in time and memory
– It is a method of bagging applied to neural
networks
 Dropout is an inexpensive but powerful method
of regularizing a broad family of models
 Specifically, dropout trains the ensemble
consisting of all sub-networks that can be
formed by removing non-output units from an
underlying base network.
Deep Learning
- 18 -
Dropout - Intuitive Reason
 When teams up, if everyone expect the partner
will do the work, nothing will be done finally.
 However, if you know your partner will dropout,
you will do better.
 When testing, no one dropout actually, so
obtaining good results eventually.
- 19 -
Dropout
Training:
 Each time before computing the gradients
 Each neuron has p% to dropout
- 20 -
Dropout
Training:
 Each time before computing the gradients
 Each neuron has p% to dropout
 Using the new network for training
The structure of the network is
changed.
Thinner!
- 21 -
Dropout
Testing:
 No dropout
 If the dropout rate at training is
p%, all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
- 22 -
w1 w2
x
1
x
2
w1 w2
x
1
x
2
w1 w2
x
1
x
2
w1 w2
x
1
x
2
z=w1x1+w2x2 z=w2x2
z=w1x1 z=0
x
1
x
2
w1 w2
1
2
1
2
x
1
x
2
w1 w2
z=w1x1+w2x
2
𝑧 =
1
2
𝑤1𝑥1 +
1
2
𝑤2𝑥2
Why the weights should multiply (1-p)% (dropout
rate) when testing?
- 23 -
Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set
1
Set
2
Set
3
Set
4
- 24 -
Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
- 25 -
Setting up your Optimization Problem
Deep Learning
- 26 -
Normalizing Inputs
 The range of values of raw training data often varies widely
– Example: Has kids feature in {0,1}
– Value of car: $500-$100’sk
 If one of the features has a broad range of values, the
distance will be governed by this particular feature.
– After, normalization, each feature contributes approximately
proportionately to the final distance.
 In general, Gradient descent converges much faster with
feature scaling than without it.
 Good practice for numerical stability for numerical
calculations, and to avoid ill-conditioning when solving
systems of equations.
Deep Learning
- 27 -
Feature Scaling
…
…
…
…
…
…
…
…
…
…
…
…
…
…
𝑥1 𝑥2
𝑥3 𝑥𝑟
𝑥𝑚
mean: 𝑚𝑖
standard
deviation: 𝜎𝑖
𝑥𝑖
𝑟
←
𝑥𝑖
𝑟
− 𝑚𝑖
𝜎𝑖
The means of all dimensions are 0,
and the variances are all 1
For each
dimension i:
𝑥1
1
𝑥2
1
𝑥1
2
𝑥2
2
In general, gradient descent converges much
faster with feature scaling than without it.
- 28 -
Internal Covariate Shift
• The first guy tells the second guy, “go water
the plants”, the second guy tells the third
guy, “got water in your pants”, and so on
until the last guy hears, “kite bang eat face
monkey” or something totally wrong.
• Let’s say that the problems are entirely
systemic and due entirely to faulty red cups.
Then, the situation is analogous to forward
propagation
• If can get new cups to fix the problem by
trial and error, it would help to have a
consistent way of passing messages in a
more controlled and standardized
(“normalized”) way. e.g: Same volume,
same language, etc
Deep Learning
“First layer parameters change and
so the distribution of the input to
your second layer changes”
- 29 -
𝑎3
𝑎2
𝑎1
Batch
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝑊2
𝑊2
𝑊2
Sigmoid
…
…
…
…
…
…
𝑊1 𝑥1
𝑥2 𝑥3
𝑧1 𝑧2
𝑧3
=
Sigmoid
Sigmoid
Batch
- 30 -
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
𝜇 =
1
3
𝑖=1
3
𝑧𝑖
𝜎 =
1
3
𝑖=1
3
𝑧𝑖 − 𝜇 2
𝜇 and 𝜎
depends on 𝑧𝑖
- 31 -
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎 𝑧𝑖 =
𝑧𝑖
− 𝜇
𝜎 + 𝜀
𝜇 and 𝜎 depends
on 𝑧𝑖
𝑎3
𝑎2
𝑎1
Sigmoid
Sigmoid
Sigmoid
𝑧1
𝑧2
𝑧3
Batch Norms happens between computing Z and computing A. And the intuition is
that, instead of using the un-normalized value Z, you can use the normalized value Z
- 32 -
Batch normalization
 Setting mean to 𝜇 = 𝟎 and 𝜎 = 𝟏 work for most of
the applications, but in actual implementation, But
we don't want the hidden units to always have
mean 0 and variance 1
 𝑧𝑖
=
𝑧𝑖−𝜇
𝜎+𝜀
, we replace with the following
𝑧𝒏𝒐𝒓𝒎
𝑖 = 𝜸𝑧𝑖 + 𝜷
where 𝜸 and 𝜷 are learnable parameters.
 𝒛𝒊
is the special case of 𝑧𝒏𝒐𝒓𝒎
𝑖
= 𝜸𝑧𝑖
+ 𝜷 at 𝜸 = 𝝈 +
𝜺 and 𝜷 = 𝝁
Deep Learning
- 33 -
Batch normalization at testing time
𝑥 𝑊1 𝑧 𝑧
𝑧
𝑧𝑖 = 𝛾⨀𝑧𝑖 + 𝛽
𝑧 =
𝑧 − 𝜇
𝜎
𝜇, 𝜎 are from
batch
𝛾, 𝛽 are network
parameters
We do not have batch at testing stage.
Ideal solution:
Computing 𝜇 and 𝜎 using the whole training dataset.
Practical solution:
Computing the moving average of 𝜇 and 𝜎 of the
batches during training.
Acc
Updates
𝜇1
𝜇100
𝜇300
- 34 -
Why does normalizing the data make the algorithm faster?
 In the case of unnormalized data, the scale of
features will vary, and hence there will be a
variation in the parameters learnt for each
feature. This will make the cost function
asymmetric.
Deep Learning
𝑤
𝑏
𝐽
Unnormalized:
𝑤
𝑏
𝐽
Normalized:
 Whereas, in the case of normalized data, the
scale will be the same and the cost function
will also be symmetric.
 This makes it is easier for the gradient
descent algorithm to find the global minima
more quickly. And this, in turn, makes the
algorithm run much faster.
Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/
- 35 -
Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either very, very
big, and this makes training difficult.
1
x
2
x
……
……
𝑊1
𝑊𝐿−1
𝑊2
𝑊𝐿
𝑦
 For simplicity, we assume bias ( 𝑏 = 0 ) at
every layer and the activation function is
linear
𝑍1
= 𝑊1
𝑥 𝑍2
= 𝑊2
𝑍1 𝑍𝐿−1
= 𝑊𝐿−1
𝑍𝐿−2 𝑦 = 𝑊𝐿
𝑍𝐿−1
𝑦 = 𝑊𝐿
𝑊𝐿−1
𝑊𝐿−2
… 𝑊2
𝑊1
𝑥
 Assuming the entries in the weight matrix
are in the form
𝑊𝐿−1= 𝑊𝐿−2 = ⋯ 𝑊2 = 𝑊1 =
𝑝 0
0 𝑝 then, 𝑦 = 𝑊𝐿
×
𝑝 0
0 𝑝
𝐿−1
× 𝑥
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
- 36 -
Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either very, very
big, and this makes training difficult.
1
x
2
x
……
……
𝑊1
𝑊𝐿−1
𝑊2
𝑊𝐿
𝑦
𝑍1
= 𝑊1
𝑥 𝑍2
= 𝑊2
𝑍1 𝑍𝐿−1
= 𝑊𝐿−1
𝑍𝐿−2 𝑦 = 𝑊𝐿
𝑍𝐿−1
𝑦 = 𝑊𝐿
𝑊𝐿−1
𝑊𝐿−2
… 𝑊2
𝑊1
𝑥
𝑦 = 𝑊𝐿
×
𝑝 0
0 𝑝
𝐿−1
× 𝑥
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
 if 𝑝 > 1 and the number of layers in the
network is large, the value of 𝑦 will explode.
 Similarly, if 𝑝 < 1, the value of 𝑦 will be very
small. Hence, the gradient descent will take
very tinny step.
- 37 -
Solutions: Vanishing / Exploding gradients
 Use a good initialization
– Random Initialization
 The primary reason behind initializing the weights
randomly is to break symmetry.
 We want to make sure that different hidden units
learn different patterns.
 Do not use sigmoid for deep networks
– Problem: saturation
Deep Learning
Image Source: Pattern Recognition and Machine Learning, Bishop
- 38 -
ReLU
 Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Vanishing gradient
problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
- 39 -
ReLU
1
x
2
x
1
y
2
y
0
0
0
0
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
- 40 -
ReLU
1
x
2
x
1
y
2
y
A Thinner linear
network
Do not have
smaller gradients
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
- 41 -
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
𝛼 also learned by gradient descent
- 42 -
Deep Learning
Optimization Algorithms
- 43 -
Gradient Descent
𝑤1
𝑤2
Assume there are only two
parameters w1 and w2 in a
network.
The colors represent the
value of C.
Randomly pick a
starting point 𝜃0
Compute the
negative
gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
−𝛻𝐶 𝜃0
Times the
learning rate 𝜂
−𝜂𝛻𝐶 𝜃0
𝛻𝐶 𝜃0
=
𝜕𝐶 𝜃0
/𝜕𝑤1
𝜕𝐶 𝜃0
/𝜕𝑤2
−𝜂𝛻𝐶 𝜃0
𝜃 = 𝑤1, 𝑤2
Error Surface
𝜃∗
- 44 -
Gradient Descent
𝑤1
𝑤2
Compute the
negative
gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
Times the
learning rate 𝜂
−𝜂𝛻𝐶 𝜃0
𝜃1
−𝛻𝐶 𝜃1
−𝜂𝛻𝐶 𝜃1
−𝛻𝐶 𝜃2
−𝜂𝛻𝐶 𝜃2
𝜃2
Eventually, we would
reach a minima ….. Randomly pick a
starting point 𝜃0
- 45 -
Gradient Descent
 Gradient descent
– Pros
 Guaranteed to converge to global minimum for convex error surface
 Converge to local minimum for non-convex error surface
– Cons
 Very slow
 Intractable for dataset that do not fit in the memory
𝐶
𝑤1 𝑤2
Different initial point 𝜃0
Reach different minima, so
different results (non-convex)
- 46 -
Gradient Descent: Practical Issues
Deep Learning
- 47 -
Mini-batch
x1 NN
……
y1
𝑦1
𝐿1
x31 NN y31 𝑦31
𝐿31
x2 NN
……
y2
𝑦2
𝐿2
x16 NN y16 𝑦16
𝐿16
 Pick the 1st batch
 Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
 Pick the 2nd batch
𝜃2
← 𝜃1
− 𝜂𝛻𝐶 𝜃1
…
Mini-
batch
Mini-
batch
C is different each
time when we
update parameters!
𝐶 = 𝐿1 + 𝐿31 + ⋯
𝐶 = 𝐿2 + 𝐿16 + ⋯
- 48 -
Mini-batch
x1 NN
……
y1
𝑦1
𝐶1
x31 NN y31 𝑦31
𝐶31
x2 NN
……
y2
𝑦2
𝐶2
x16 NN y16 𝑦16
𝐶16
 Pick the 1st batch
 Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
 Pick the 2nd batch
𝜃2
← 𝜃1
− 𝜂𝛻𝐶 𝜃1
 Until all mini-batches
have been picked
…
one epoch
Faster Better!
Mini-
batch
Mini-
batch
Repeat the above process
𝐶 = 𝐶1 + 𝐶31 + ⋯
𝐶 = 𝐶2 + 𝐶16 + ⋯
- 49 -
How can we choose a mini-batch size?
 If the mini-batch size = m
– It is a batch gradient descent where all the
training examples are used in each iteration. It
takes too much time per iteration.
 If the mini-batch size = 1
– It is called stochastic gradient descent, where
each training example is its own mini-batch.
– Since in every iteration we are taking just a
single example, it can become extremely noisy
and takes much more time to reach the global
minima.
 If the mini-batch size is between 1 to m
– It is mini-batch gradient descent. The size of the
mini-batch should not be too large or too small.
Deep Learning
Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
- 50 -
Acknowledgement
 http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf
 https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques-
for-better-performance-of-neural-network-94f978a4e518
 https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx
 Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W.
Shavlik
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx
 Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.
 On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das,
ISICAL
Deep Learning

More Related Content

What's hot

Uninformed Search technique
Uninformed Search techniqueUninformed Search technique
Uninformed Search technique
Kapil Dahal
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
Edureka!
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Oswald Campesato
 
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaKeras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Edureka!
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Simplilearn
 
Optimizers
OptimizersOptimizers
Optimizers
Il Gu Yi
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded search
Hema Kashyap
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Shrey Malik
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
Somnath Banerjee
 
Learning by analogy
Learning by analogyLearning by analogy
Learning by analogy
Nitesh Singh
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Mohamed Mohamed El-Sayed
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Databricks
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang
 

What's hot (20)

Uninformed Search technique
Uninformed Search techniqueUninformed Search technique
Uninformed Search technique
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaKeras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
 
Optimizers
OptimizersOptimizers
Optimizers
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded search
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
Learning by analogy
Learning by analogyLearning by analogy
Learning by analogy
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 

Similar to NITW_Improving Deep Neural Networks (1).pptx

Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
Venkata Reddy Konasani
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
Minho Heo
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
Sandeep Kumar
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
Gaurav Bhalotia
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Designing your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthroughDesigning your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthrough
Lavanya Shukla
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
Steve Nouri
 
Artificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron'sArtificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron's
SRM institute of Science and Technology
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Deep learning
Deep learningDeep learning
Deep learning
Aman Kamboj
 
MACHINE LEARNING.pptx
MACHINE LEARNING.pptxMACHINE LEARNING.pptx
MACHINE LEARNING.pptx
SOURAVGHOSH623569
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
Vishwas N
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
DebabrataPain1
 
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
ijistjournal
 

Similar to NITW_Improving Deep Neural Networks (1).pptx (20)

ANN - UNIT 3.pptx
ANN - UNIT 3.pptxANN - UNIT 3.pptx
ANN - UNIT 3.pptx
 
ANN - UNIT 3.pptx
ANN - UNIT 3.pptxANN - UNIT 3.pptx
ANN - UNIT 3.pptx
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Designing your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthroughDesigning your neural networks – a step by step walkthrough
Designing your neural networks – a step by step walkthrough
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Artificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron'sArtificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron's
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
Deep learning
Deep learningDeep learning
Deep learning
 
MACHINE LEARNING.pptx
MACHINE LEARNING.pptxMACHINE LEARNING.pptx
MACHINE LEARNING.pptx
 
N ns 1
N ns 1N ns 1
N ns 1
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
 

Recently uploaded

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 

NITW_Improving Deep Neural Networks (1).pptx

  • 1. - 1 - Tips for Training Deep Neural Networks by Dr. Vikas Kumar Department of Data Science and Analytics Central University of Rajasthan, India Email: vikas@curaj.ac.in
  • 2. - 2 - Outline  Neural Network Parameters  Parameters vs Hyperparameters  How to set network parameters  Bias / Variance Trade-off  Regularization Strategies  Batch normalization  Vanishing / Exploding gradients  Gradient Descent  Mini-batch Gradient Descent Deep Learning
  • 3. - 3 - Neural Network Parameters 16 x 16 = 256 1 x 2 x … … 256 x … … … … … … … … Ink → 1 No ink → 0 … … y1 y2 y1 0 0.1 0.7 0.2 y1 has the maximum value Set the network parameters 𝜃 such that …… Input: y2 has the maximum value Input: is 1 is 2 is 0 How to let the neural network achieve this 𝜃 = 𝑊1 , 𝑏1 , 𝑊2 , 𝑏2 , ⋯ 𝑊𝐿 , 𝑏𝐿 … …
  • 4. - 4 - Parameters vs Hyperparameters  A model parameter is a variable of the selected model which can be estimated by fitting the given data to the model.  Hyperparameter is a parameter from a prior distribution; it captures the prior belief before data is observed. – These are the parameters that control the model parameters – In any machine learning algorithm, these parameters need to be initialized before training a model. Deep Learning Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
  • 5. - 5 - Deep Neural Network: Parameters vs Hyperparameters  Parameters: – 𝑊1, 𝑏1, 𝑊2, 𝑏2, ⋯ 𝑊𝐿, 𝑏𝐿  Hyperparameters: – Learning rate 𝜶 in gradient descent – Number of iterations in gradient descent – Number of layers in a Neural Network – Number of neurons per layer in a Neural Network – Activations Functions – Mini-batch size – Regularizations parameters Deep Learning Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide
  • 6. - 6 - Train / Dev / Test sets  Hyperparameters tuning is a highly iterative process, where you – start with an idea, i.e. start with a certain number of hidden layers, certain learning rate, etc. – try the idea by implementing it – experiment how well the idea has worked – refine the idea and iterate this process  Now how do we identify whether the idea is working? This is where the train / dev / test sets come into play. Deep Learning Training Set Dev Set Test Set We train the model on the training data. After training the model, we check how well it performs on the dev set. When we have a final model, we evaluate it on the test set in order to get an unbiased estimate of how well our algorithm is doing. Data
  • 7. - 7 - Train / Dev / Test sets Deep Learning Training Set (60%) Dev Set (20%) Test Set (20%) Training Set (70%) Test Set (20%) Training Set (98%) Dev Set (1%) Test Set (1%) Previously, when we had small datasets, most often the distribution of different sets was As the availability of data has increased in recent years, we can use a huge slice of it for training the model Data
  • 8. - 8 - Bias / Variance Trade-off  Make sure the distribution of dev/test set is same as training set – Divide the training, dev and test sets in such a way that their distribution is similar – Skip the test set and validate the model using the dev set only Deep Learning Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/  We want our model to be just right, which means having low bias and low variance.  Overfitting: If the dev set error is much more than the train set error, the model is overfitting and has a high variance  Underfitting: When both train and dev set errors are high, the model is underfitting and has a high bias
  • 9. - 9 - Overfitting in Deep Neural Nets  Deep neural networks contain multiple non-linear hidden layers – This makes them very expressive models that can learn very complicated relationships between their inputs and outputs. – In other words, model learns even the tiniest details present in the data.  But with limited training data, many of these complicated relationships will be the result of sampling noise – So they will exist in the training set but not in real test data even if it is drawn from the same distribution. – So after learning all the possible patterns it can find, the model tends to perform extremely well on the training set but fails to produce good results on the dev and test sets. Deep Learning
  • 10. - 10 - Regularization  Regularization is: – “any modification to a learning algorithm to reduce its generalization error but not its training error” – Reduce generalization error even at the expense of increasing training error  E.g., Limiting model capacity is a regularization method Deep Learning Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
  • 11. - 11 - Regularization Strategies Deep Learning
  • 12. - 12 - Parameter Norm Penalties  The most traditional form of regularization applicable to deep learning is the concept of parameter norm penalties.  This approach limits the capacity of the model by adding the penalty Ω 𝜃 to the objective function resulting in: min 𝜃 𝐽 = ℓ 𝜃 + 𝜆Ω 𝜃  𝜆 ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty to the value of the objective function. Deep Learning
  • 13. - 13 - L2 Norm Parameter Regularization  Using L2 norm, we’re adding the constraints to the original loss function, such that the weights of the network don’t grow too large. Ω 𝜃 = ||𝜃||2 2  Assuming there is no bias parameters, only weights Ω 𝑤 = ||𝑤||2 2 = 𝑤11 2 + 𝑤12 2 + ⋯  By adding the regularized term, we’re fooling the model such that it won’t drive the training error to zero, which in turn reduces the complexity of the model. Deep Learning
  • 14. - 14 - L1 Norm Parameter Regularization  L1 norm is another option that can be used to penalize the size of model parameters.  L1 regularization on the model parameters w is: Ω 𝑤 = ||𝑤||1 = 𝑖 |𝑤𝑖|  The L2 Norm penalty decays the components of the vector w that do not contribute much to reducing the objective function.  On the other hand, the L1 norm penalty provides solutions that are sparse.  This sparsity property can be thought of as a feature selection mechanism. Deep Learning
  • 15. - 15 - Early Stopping  When training models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, while the error on the validation set begins to rise again or remaining the same for certain iterations, then there is no point in training the model further.  This means we can obtain a model with better validation set error (and thus, hopefully better test set error) by returning to the parameter setting at the point in time with the lowest validation set error Deep Learning
  • 16. - 16 - Parameter Tying  Sometimes, we might not know which region the parameters would lie in, but rather we known that there is some dependencies between them.  Parameter Tying refers to explicitly forcing the parameters of two models to be close to each other, through the norm penalty. ||𝑾(𝑨) − 𝑾(𝑩)||  Here, 𝑾(𝑨) refers to the weights of the first model while 𝑾(𝑩) refers to those of the second one. Deep Learning
  • 17. - 17 - Dropout  Dropout is a bagging method – Bagging is a method of averaging over several models to improve generalization  Impractical to train many neural networks since it is expensive in time and memory – It is a method of bagging applied to neural networks  Dropout is an inexpensive but powerful method of regularizing a broad family of models  Specifically, dropout trains the ensemble consisting of all sub-networks that can be formed by removing non-output units from an underlying base network. Deep Learning
  • 18. - 18 - Dropout - Intuitive Reason  When teams up, if everyone expect the partner will do the work, nothing will be done finally.  However, if you know your partner will dropout, you will do better.  When testing, no one dropout actually, so obtaining good results eventually.
  • 19. - 19 - Dropout Training:  Each time before computing the gradients  Each neuron has p% to dropout
  • 20. - 20 - Dropout Training:  Each time before computing the gradients  Each neuron has p% to dropout  Using the new network for training The structure of the network is changed. Thinner!
  • 21. - 21 - Dropout Testing:  No dropout  If the dropout rate at training is p%, all the weights times (1-p)%  Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
  • 22. - 22 - w1 w2 x 1 x 2 w1 w2 x 1 x 2 w1 w2 x 1 x 2 w1 w2 x 1 x 2 z=w1x1+w2x2 z=w2x2 z=w1x1 z=0 x 1 x 2 w1 w2 1 2 1 2 x 1 x 2 w1 w2 z=w1x1+w2x 2 𝑧 = 1 2 𝑤1𝑥1 + 1 2 𝑤2𝑥2 Why the weights should multiply (1-p)% (dropout rate) when testing?
  • 23. - 23 - Dropout is a kind of ensemble. Ensemble Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures Training Set Set 1 Set 2 Set 3 Set 4
  • 24. - 24 - Dropout is a kind of ensemble. Ensemble y1 Network 1 Network 2 Network 3 Network 4 Testing data x y2 y3 y4 average
  • 25. - 25 - Setting up your Optimization Problem Deep Learning
  • 26. - 26 - Normalizing Inputs  The range of values of raw training data often varies widely – Example: Has kids feature in {0,1} – Value of car: $500-$100’sk  If one of the features has a broad range of values, the distance will be governed by this particular feature. – After, normalization, each feature contributes approximately proportionately to the final distance.  In general, Gradient descent converges much faster with feature scaling than without it.  Good practice for numerical stability for numerical calculations, and to avoid ill-conditioning when solving systems of equations. Deep Learning
  • 27. - 27 - Feature Scaling … … … … … … … … … … … … … … 𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑚 mean: 𝑚𝑖 standard deviation: 𝜎𝑖 𝑥𝑖 𝑟 ← 𝑥𝑖 𝑟 − 𝑚𝑖 𝜎𝑖 The means of all dimensions are 0, and the variances are all 1 For each dimension i: 𝑥1 1 𝑥2 1 𝑥1 2 𝑥2 2 In general, gradient descent converges much faster with feature scaling than without it.
  • 28. - 28 - Internal Covariate Shift • The first guy tells the second guy, “go water the plants”, the second guy tells the third guy, “got water in your pants”, and so on until the last guy hears, “kite bang eat face monkey” or something totally wrong. • Let’s say that the problems are entirely systemic and due entirely to faulty red cups. Then, the situation is analogous to forward propagation • If can get new cups to fix the problem by trial and error, it would help to have a consistent way of passing messages in a more controlled and standardized (“normalized”) way. e.g: Same volume, same language, etc Deep Learning “First layer parameters change and so the distribution of the input to your second layer changes”
  • 30. - 30 - Batch normalization 𝑥1 𝑥2 𝑥3 𝑊1 𝑊1 𝑊1 𝑧1 𝑧2 𝑧3 𝜇 𝜎 𝜇 = 1 3 𝑖=1 3 𝑧𝑖 𝜎 = 1 3 𝑖=1 3 𝑧𝑖 − 𝜇 2 𝜇 and 𝜎 depends on 𝑧𝑖
  • 31. - 31 - Batch normalization 𝑥1 𝑥2 𝑥3 𝑊1 𝑊1 𝑊1 𝑧1 𝑧2 𝑧3 𝜇 𝜎 𝑧𝑖 = 𝑧𝑖 − 𝜇 𝜎 + 𝜀 𝜇 and 𝜎 depends on 𝑧𝑖 𝑎3 𝑎2 𝑎1 Sigmoid Sigmoid Sigmoid 𝑧1 𝑧2 𝑧3 Batch Norms happens between computing Z and computing A. And the intuition is that, instead of using the un-normalized value Z, you can use the normalized value Z
  • 32. - 32 - Batch normalization  Setting mean to 𝜇 = 𝟎 and 𝜎 = 𝟏 work for most of the applications, but in actual implementation, But we don't want the hidden units to always have mean 0 and variance 1  𝑧𝑖 = 𝑧𝑖−𝜇 𝜎+𝜀 , we replace with the following 𝑧𝒏𝒐𝒓𝒎 𝑖 = 𝜸𝑧𝑖 + 𝜷 where 𝜸 and 𝜷 are learnable parameters.  𝒛𝒊 is the special case of 𝑧𝒏𝒐𝒓𝒎 𝑖 = 𝜸𝑧𝑖 + 𝜷 at 𝜸 = 𝝈 + 𝜺 and 𝜷 = 𝝁 Deep Learning
  • 33. - 33 - Batch normalization at testing time 𝑥 𝑊1 𝑧 𝑧 𝑧 𝑧𝑖 = 𝛾⨀𝑧𝑖 + 𝛽 𝑧 = 𝑧 − 𝜇 𝜎 𝜇, 𝜎 are from batch 𝛾, 𝛽 are network parameters We do not have batch at testing stage. Ideal solution: Computing 𝜇 and 𝜎 using the whole training dataset. Practical solution: Computing the moving average of 𝜇 and 𝜎 of the batches during training. Acc Updates 𝜇1 𝜇100 𝜇300
  • 34. - 34 - Why does normalizing the data make the algorithm faster?  In the case of unnormalized data, the scale of features will vary, and hence there will be a variation in the parameters learnt for each feature. This will make the cost function asymmetric. Deep Learning 𝑤 𝑏 𝐽 Unnormalized: 𝑤 𝑏 𝐽 Normalized:  Whereas, in the case of normalized data, the scale will be the same and the cost function will also be symmetric.  This makes it is easier for the gradient descent algorithm to find the global minima more quickly. And this, in turn, makes the algorithm run much faster. Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/
  • 35. - 35 - Vanishing / Exploding gradients  When you're training a very deep network, sometimes the derivatives can get either very, very big, and this makes training difficult. 1 x 2 x …… …… 𝑊1 𝑊𝐿−1 𝑊2 𝑊𝐿 𝑦  For simplicity, we assume bias ( 𝑏 = 0 ) at every layer and the activation function is linear 𝑍1 = 𝑊1 𝑥 𝑍2 = 𝑊2 𝑍1 𝑍𝐿−1 = 𝑊𝐿−1 𝑍𝐿−2 𝑦 = 𝑊𝐿 𝑍𝐿−1 𝑦 = 𝑊𝐿 𝑊𝐿−1 𝑊𝐿−2 … 𝑊2 𝑊1 𝑥  Assuming the entries in the weight matrix are in the form 𝑊𝐿−1= 𝑊𝐿−2 = ⋯ 𝑊2 = 𝑊1 = 𝑝 0 0 𝑝 then, 𝑦 = 𝑊𝐿 × 𝑝 0 0 𝑝 𝐿−1 × 𝑥 Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
  • 36. - 36 - Vanishing / Exploding gradients  When you're training a very deep network, sometimes the derivatives can get either very, very big, and this makes training difficult. 1 x 2 x …… …… 𝑊1 𝑊𝐿−1 𝑊2 𝑊𝐿 𝑦 𝑍1 = 𝑊1 𝑥 𝑍2 = 𝑊2 𝑍1 𝑍𝐿−1 = 𝑊𝐿−1 𝑍𝐿−2 𝑦 = 𝑊𝐿 𝑍𝐿−1 𝑦 = 𝑊𝐿 𝑊𝐿−1 𝑊𝐿−2 … 𝑊2 𝑊1 𝑥 𝑦 = 𝑊𝐿 × 𝑝 0 0 𝑝 𝐿−1 × 𝑥 Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients  if 𝑝 > 1 and the number of layers in the network is large, the value of 𝑦 will explode.  Similarly, if 𝑝 < 1, the value of 𝑦 will be very small. Hence, the gradient descent will take very tinny step.
  • 37. - 37 - Solutions: Vanishing / Exploding gradients  Use a good initialization – Random Initialization  The primary reason behind initializing the weights randomly is to break symmetry.  We want to make sure that different hidden units learn different patterns.  Do not use sigmoid for deep networks – Problem: saturation Deep Learning Image Source: Pattern Recognition and Machine Learning, Bishop
  • 38. - 38 - ReLU  Rectified Linear Unit (ReLU) Reason: 1. Fast to compute 2. Vanishing gradient problem 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0 𝜎 𝑧
  • 40. - 40 - ReLU 1 x 2 x 1 y 2 y A Thinner linear network Do not have smaller gradients 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0
  • 41. - 41 - ReLU - variant 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0.01𝑧 𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 𝛼𝑧 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈 𝛼 also learned by gradient descent
  • 42. - 42 - Deep Learning Optimization Algorithms
  • 43. - 43 - Gradient Descent 𝑤1 𝑤2 Assume there are only two parameters w1 and w2 in a network. The colors represent the value of C. Randomly pick a starting point 𝜃0 Compute the negative gradient at 𝜃0 −𝛻𝐶 𝜃0 𝜃0 −𝛻𝐶 𝜃0 Times the learning rate 𝜂 −𝜂𝛻𝐶 𝜃0 𝛻𝐶 𝜃0 = 𝜕𝐶 𝜃0 /𝜕𝑤1 𝜕𝐶 𝜃0 /𝜕𝑤2 −𝜂𝛻𝐶 𝜃0 𝜃 = 𝑤1, 𝑤2 Error Surface 𝜃∗
  • 44. - 44 - Gradient Descent 𝑤1 𝑤2 Compute the negative gradient at 𝜃0 −𝛻𝐶 𝜃0 𝜃0 Times the learning rate 𝜂 −𝜂𝛻𝐶 𝜃0 𝜃1 −𝛻𝐶 𝜃1 −𝜂𝛻𝐶 𝜃1 −𝛻𝐶 𝜃2 −𝜂𝛻𝐶 𝜃2 𝜃2 Eventually, we would reach a minima ….. Randomly pick a starting point 𝜃0
  • 45. - 45 - Gradient Descent  Gradient descent – Pros  Guaranteed to converge to global minimum for convex error surface  Converge to local minimum for non-convex error surface – Cons  Very slow  Intractable for dataset that do not fit in the memory 𝐶 𝑤1 𝑤2 Different initial point 𝜃0 Reach different minima, so different results (non-convex)
  • 46. - 46 - Gradient Descent: Practical Issues Deep Learning
  • 47. - 47 - Mini-batch x1 NN …… y1 𝑦1 𝐿1 x31 NN y31 𝑦31 𝐿31 x2 NN …… y2 𝑦2 𝐿2 x16 NN y16 𝑦16 𝐿16  Pick the 1st batch  Randomly initialize 𝜃0 𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0  Pick the 2nd batch 𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1 … Mini- batch Mini- batch C is different each time when we update parameters! 𝐶 = 𝐿1 + 𝐿31 + ⋯ 𝐶 = 𝐿2 + 𝐿16 + ⋯
  • 48. - 48 - Mini-batch x1 NN …… y1 𝑦1 𝐶1 x31 NN y31 𝑦31 𝐶31 x2 NN …… y2 𝑦2 𝐶2 x16 NN y16 𝑦16 𝐶16  Pick the 1st batch  Randomly initialize 𝜃0 𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0  Pick the 2nd batch 𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1  Until all mini-batches have been picked … one epoch Faster Better! Mini- batch Mini- batch Repeat the above process 𝐶 = 𝐶1 + 𝐶31 + ⋯ 𝐶 = 𝐶2 + 𝐶16 + ⋯
  • 49. - 49 - How can we choose a mini-batch size?  If the mini-batch size = m – It is a batch gradient descent where all the training examples are used in each iteration. It takes too much time per iteration.  If the mini-batch size = 1 – It is called stochastic gradient descent, where each training example is its own mini-batch. – Since in every iteration we are taking just a single example, it can become extremely noisy and takes much more time to reach the global minima.  If the mini-batch size is between 1 to m – It is mini-batch gradient descent. The size of the mini-batch should not be too large or too small. Deep Learning Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
  • 50. - 50 - Acknowledgement  http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf  https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques- for-better-performance-of-neural-network-94f978a4e518  https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf  http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx  Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W. Shavlik  http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx  Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.  On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das, ISICAL Deep Learning