Successfully reported this slideshow.

Upcoming SlideShare
ML_ Unit 2_Part_B
×

Check these out next

Vanishing & Exploding Gradients issue in deep learning

Vanishing & Exploding Gradients issue in deep learning

More Related Content

Similar to Vanishing & Exploding Gradients (20)

1. 1. Vanishing Gradients – What? 1. “Vanishing” means disappearing. Vanishing gradients means that error gradients becoming so small that we can barely see any update on weights (refer grad descent equation). Hence, the convergence is not achieved. 2. Before going further, lets see below 3 equations to see when we multiply numbers that are between 0 to 1, the output is lesser than values of both the input numbers. 3. Let’s assume a network shown on next page with sigmoid activation used across the network layers. Activations like tanh and sigmoid limit the value of z between 0 and 1. The derivative value of these activations lies between 0 to 0.25. This makes any number multiplied with these derivatives to reduce in absolute terms as seen in step 2.
4. 4. Vanishing Gradients – How to Avoid?
5. 5. Vanishing Gradients – How to Avoid? 2. Reason  The first problem that we discussed was the usage of activations whose derivatives are low. The second problem deals with low value of initialized weights. We can understand this from simple example as shown in network on previous page. The equations for error grad w.r.t w1 includes value of w5 as well. Hence, if value of w5 is initialized very low, it will also plays a role in making the gradient w.r.t w1 smaller i.e vanishing gradient. We can also say Vanishing gradient problems will be more prominent in deep networks. This is because the number of multiplicative terms to compute the gradient of initial layers in a deep network is very high. Resolution  As we can see from below equations, the derivative of activation function along with weights play a role in causing vanishing gradients because both are there in equation for computation of error gradient. We need to initialize the weights properly to avoid vanishing gradient problem. We will discuss about it further in weight initialization strategy section.

Editor's Notes

• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
• Why BN is not applied in batch or stochastic mode?
Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video