Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Optimization techniq
1. Deep Learning | Machine Learning
Optimization Technique
Rakshith
2. Table of Content
• Basic mathematics
• Introduction to simple Linear Regression
• Why optimization
• Calculation of gradient descent
• Variations in gradient descent
• Miscellaneous topics
1. Batch normalization
2. Memonization
3. Weight initialization
3. Basic Mathematics
Gradient: An inclined or a slope
Tangent: A straight line or plane that touches a curve or curved surface at a point, but if
extended does not cross it at that point.
4. Basic of Gradient Descent ….
Trigonometric value: Tan Angles Radians
0 0
30 -6.405
45 1.619
60 0.320
90 -1.995
Sin = opp/hyp
Cos = adj/hyp
Tan = opp/adj
5. Basic of Gradient Descent ….
• Tan becomes 0 when slop is zero
• Tan = 0 is not always equal to global minima
6. Introduction to simple linear regression:
X Y
-1 -1
1 2
2 3
4 3
6 5
7 8
19 20
Assumption of linear regression: Data is linearly distributed
-2
0
2
4
6
8
10
-2 0 2 4 6 8
Y
X
Simple Linear Regression
7. Introduction to simple linear regression:
Equation of straight line is y = Ax + B
X is given , A is slope B is intercept need to be found.
A = (∑x)(∑y)- n(∑xy) / ((∑x)^2 -n ∑x^2)
B = (∑x)(∑x^2)- n(∑y)(∑x^2) / ((∑x)^2 -n ∑x^2)
X Y
-1 -1
1 2
2 3
4 3
6 5
7 8
19 20
X sqr Y sqr XY
1 1 1
1 4 2
4 9 6
16 9 12
36 25 30
49 64 56
107 112 107
A = 0.932384
B = 0.380783
SSE = 3.4
8. X
y
0.93284* X + 0.380783 , SSE = 3.4
0.9226* X + 0.3567 , SSE = 2.8
0.9157* X + 0.2777 , SSE = 2.2
Way to find the best fit line which minimizes the error is by optimization technique ,
Optimization uses gradient descent to find minimum error
Why optimization
9. • Calculation of gradient descent
Lets consider how to calculate gradient descent
10. To fit a line Y pred = a + b X, start off with random values of a and b and calculate prediction error (SSE)
Step 1:
11. Step 2: Calculate the error gradient w.r.t the weights
∂SSE/∂a
∂SSE/∂b
12. So, update rules:
1.New a = a – r * ∂SSE/∂a = 0.45-0.01*3.300 = 0.42
2.New b = b – r * ∂SSE/∂b= 0.75-0.01*1.545 = 0.73
here, r is the learning rate = 0.01, which is the pace of adjustment to the weights.
Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized
13. Step 4: Use new a and b for prediction and to calculate new Total SSE
You can see with the new prediction, the total SSE has gone down (0.677 to 0.553). That means prediction accuracy has
improved
Step 5: Repeat step 3 and 4 till the time further adjustments to a, b doesn’t significantly reduces the error. At that time, we
have arrived at the optimal a , b with the highest prediction accuracy.
14. Extending the idea of Gradient Descent to Neural network
Forward Propagation
Initialize Weight (One Time)
Feed data
Compute Y
Compute loss
Backpropogation
Compute Partial differentials
Update weights
18. Batch Normalization
Fully connected NW
L1 L3 L4 L5 L6L2I/P O/P
Before we feed data into network we normalize the data to bring the value under same scale this is nothing but
Mean centering /variance scaling / normalization
In Deep NW small change in input causes large change in the output because of lot of multiplication
20. Why Batch normalization ?
Internal co- variance shift
• L1 To L2 no much changes or difference by the time it reaches L5 there is a huge shift
• Where to introduce normalization ? Heuristic
N1
N2 N3
22. Weight initializations
Don’t Does
1. Never initialize your weights is equal to zero
2. Initialize same weight across all neuron
this problem is a called problem of symmetry
5
3
3
3
3
3
3
2
3
3
34
15 +
15 +
15 +
6 +
6 +
6 +
12
12
12
= 33
= 33
= 33
23. 3. Large negative values
• Relu = Dead activation (refer activation function presentation)
• Sigmoid = vanishing gradient
Does
Random initialization , on Random initialization each neuron learns different aspects
Imagine each neuron is base model combine multiple base models ex: Random forest each model is built on different
attributes so it sees lot of variation and learns very well.
24. • Zeros: Initializer that generates tensors initialized to 0.
• Ones: Initializer that generates tensors initialized to 1.
• Constant: Initializer that generates tensors initialized to a constant value.
• RandomNormal: Initializer that generates tensors with a normal distribution.
• RandomUniform: Initializer that generates tensors with a uniform distribution.
• TruncatedNormal: Initializer that generates a truncated normal distribution.
• VarianceScaling: Initializer capable of adapting its scale to the shape of
weights.
• Orthogonal: Initializer that generates a random orthogonal matrix.
• Identity: Initializer that generates the identity matrix.
• lecun_uniform: LeCun uniform initializer.
• glorot_normal: Glorot normal initializer, also called Xavier normal initializer.
• glorot_uniform: Glorot uniform initializer, also called Xavier uniform initializer.
• he_normal: He normal initializer.
• lecun_normal: LeCun normal initializer.
• he_uniform: He uniform variance scaling initializer.
from keras import initializers
26. • glorot_normal
keras.initializers.glorot_normal(seed=None)
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / (fan_in + fan_out))
• glorot_uniform
keras.initializers.glorot_uniform(seed=None)
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / (fan_in + fan_out))
• he_uniform
keras.initializers.he_uniform(seed=None)
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / fan_in).
• he_normal
keras.initializers.he_normal(seed=None)
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / fan_in)where fan_in is the number of input units in the weight tensor.