Optimization techniq

Deep Learning | Machine Learning
Optimization Technique
Rakshith

Table of Content
• Basic mathematics
• Introduction to simple Linear Regression
• Why optimization
• Calculation of gradient descent
• Variations in gradient descent
• Miscellaneous topics
1. Batch normalization
2. Memonization
3. Weight initialization

Basic Mathematics
Gradient: An inclined or a slope
Tangent: A straight line or plane that touches a curve or curved surface at a point, but if
extended does not cross it at that point.

Basic of Gradient Descent ….
Trigonometric value: Tan Angles Radians
0 0
30 -6.405
45 1.619
60 0.320
90 -1.995
Sin = opp/hyp
Cos = adj/hyp
Tan = opp/adj

Basic of Gradient Descent ….
• Tan becomes 0 when slop is zero
• Tan = 0 is not always equal to global minima

Introduction to simple linear regression:
X Y
-1 -1
1 2
2 3
4 3
6 5
7 8
19 20
Assumption of linear regression: Data is linearly distributed
-2
0
2
4
6
8
10
-2 0 2 4 6 8
Y
X
Simple Linear Regression

Introduction to simple linear regression:
Equation of straight line is y = Ax + B
X is given , A is slope B is intercept need to be found.
A = (∑x)(∑y)- n(∑xy) / ((∑x)^2 -n ∑x^2)
B = (∑x)(∑x^2)- n(∑y)(∑x^2) / ((∑x)^2 -n ∑x^2)
X Y
-1 -1
1 2
2 3
4 3
6 5
7 8
19 20
X sqr Y sqr XY
1 1 1
1 4 2
4 9 6
16 9 12
36 25 30
49 64 56
107 112 107
A = 0.932384
B = 0.380783
SSE = 3.4

X
y
0.93284* X + 0.380783 , SSE = 3.4
0.9226* X + 0.3567 , SSE = 2.8
0.9157* X + 0.2777 , SSE = 2.2
Way to find the best fit line which minimizes the error is by optimization technique ,
Optimization uses gradient descent to find minimum error
Why optimization

• Calculation of gradient descent
Lets consider how to calculate gradient descent

To fit a line Y pred = a + b X, start off with random values of a and b and calculate prediction error (SSE)
Step 1:

Step 2: Calculate the error gradient w.r.t the weights
∂SSE/∂a
∂SSE/∂b

So, update rules:
1.New a = a – r * ∂SSE/∂a = 0.45-0.01*3.300 = 0.42
2.New b = b – r * ∂SSE/∂b= 0.75-0.01*1.545 = 0.73
here, r is the learning rate = 0.01, which is the pace of adjustment to the weights.
Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized

Step 4: Use new a and b for prediction and to calculate new Total SSE
You can see with the new prediction, the total SSE has gone down (0.677 to 0.553). That means prediction accuracy has
improved
Step 5: Repeat step 3 and 4 till the time further adjustments to a, b doesn’t significantly reduces the error. At that time, we
have arrived at the optimal a , b with the highest prediction accuracy.

Extending the idea of Gradient Descent to Neural network
Forward Propagation
Initialize Weight (One Time)
Feed data
Compute Y
Compute loss
Backpropogation
Compute Partial differentials
Update weights

Gradient
Descent
Mini Batch
SGD
Stochastic
Gradient Descent Ad
grad
SGD with
momentum
Ad Delta
Adam
Flavors Of Gradients
Learning RateNo of Samples

• Gradient Descent
• Stochastic Gradient Descent
• Mini Batch SGD
Stochastic Gradient Descent
X1 X2 X3 Y
0 0.2 0.9 0
0.22 0.25 0.22 5
0.24 0.6 0.58 2.2
0.33 0.13 0.2 5.9
0.37 0.89 0.55 3.2
0.44 0.3 0.39 1.5
0.44 0.5 0.54 1.8
0.57 0.78 0.53 2.9
0.93 3 1 9.4
1 0.61 0.61 2.3
Feed one rows at a time
• Random Weight initialization
• No of coefficients is propositional to no of columns

X1 X2 X3 Y
0 0.2 0.9 0
0.22 0.25 0.22 5
0.24 0.6 0.58 2.2
0.33 0.13 0.2 5.9
0.37 0.89 0.55 3.2
0.44 0.3 0.39 1.5
0.44 0.5 0.54 1.8
0.57 0.78 0.53 2.9
0.93 3 1 9.4
1 0.61 0.61 2.3
Mini Batch SGD
B1
B2
B3
B4
B5
B1

Batch Normalization
Fully connected NW
L1 L3 L4 L5 L6L2I/P O/P
Before we feed data into network we normalize the data to bring the value under same scale this is nothing but
Mean centering /variance scaling / normalization
In Deep NW small change in input causes large change in the output because of lot of multiplication

Normalization is always recommend in neural networks
Because of fast convergence

Why Batch normalization ?
Internal co- variance shift
• L1 To L2 no much changes or difference by the time it reaches L5 there is a huge shift
• Where to introduce normalization ? Heuristic
N1
N2 N3

Memonization
Instead of computing some repeated partial
Derivate , what about compute once
and reuse it ?

Weight initializations
Don’t Does
1. Never initialize your weights is equal to zero
2. Initialize same weight across all neuron
this problem is a called problem of symmetry
5
3
3
3
3
3
3
2
3
3
34
15 +
15 +
15 +
6 +
6 +
6 +
12
12
12
= 33
= 33
= 33

3. Large negative values
• Relu = Dead activation (refer activation function presentation)
• Sigmoid = vanishing gradient
Does
Random initialization , on Random initialization each neuron learns different aspects
Imagine each neuron is base model combine multiple base models ex: Random forest each model is built on different
attributes so it sees lot of variation and learns very well.

• Zeros: Initializer that generates tensors initialized to 0.
• Ones: Initializer that generates tensors initialized to 1.
• Constant: Initializer that generates tensors initialized to a constant value.
• RandomNormal: Initializer that generates tensors with a normal distribution.
• RandomUniform: Initializer that generates tensors with a uniform distribution.
• TruncatedNormal: Initializer that generates a truncated normal distribution.
• VarianceScaling: Initializer capable of adapting its scale to the shape of
weights.
• Orthogonal: Initializer that generates a random orthogonal matrix.
• Identity: Initializer that generates the identity matrix.
• lecun_uniform: LeCun uniform initializer.
• glorot_normal: Glorot normal initializer, also called Xavier normal initializer.
• glorot_uniform: Glorot uniform initializer, also called Xavier uniform initializer.
• he_normal: He normal initializer.
• lecun_normal: LeCun normal initializer.
• he_uniform: He uniform variance scaling initializer.
from keras import initializers

• RandomNormal
keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None)
• RandomUniform
keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None)
• TruncatedNormal
keras.initializers.TruncatedNormal(mean=0.0, stddev=0.05, seed=None)

• glorot_normal
keras.initializers.glorot_normal(seed=None)
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / (fan_in + fan_out))
• glorot_uniform
keras.initializers.glorot_uniform(seed=None)
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / (fan_in + fan_out))
• he_uniform
keras.initializers.he_uniform(seed=None)
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / fan_in).
• he_normal
keras.initializers.he_normal(seed=None)
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / fan_in)where fan_in is the number of input units in the weight tensor.

Optimization techniq

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimization techniq

Similar to Optimization techniq (20)

Recently uploaded

Recently uploaded (20)

Optimization techniq