What are activation functions and why do we need those.pdf
1. What are activation functions and why
do we need those?
Activation functions are functions which are used in the Artificial Neural Networks to
capture the complexities inside the data. A neural network without an activation
function is just a simple regression model. The activation function does the non-
linear transformation to the input making it capable to learn and perform more
complex tasks.We introduce non-linearity in each layer through activation functions.
2. Let us assume there are 3 hidden layers, 1 input and 1 output layer.
W1-Weight matrix between Input layer and first hidden layer
W2-Weight matrix between first hidden layer and second hidden layer
W3-Weight matrix between second hidden layer and third hidden layer
W4-Weight matrix between third hidden layer and output layer
Below mentioned equations represents a feedforward neural network.
If we stack multiple layers, we can see output layer as a function:
3. What are Ideal qualities of an activation function:
The activation function generally introduce non-linearity in the network to capture the
complex relations between input features and output variable/class.
2. Continuously differentiable:
The activation function needs to be differentiable since neural networks are generally
trained using gradient descent process or to enable gradient based optimization
methods.
3. Zero centered:
Zero centered activations functions makes sure that mean activation value is around
0. This is important because convergence is usually seen faster on normalized data.
I have explained many of the commonly used activation below, some are zero
centered some are not. Mostly when we have a activation function which is not zero
centered we tend to use normalization layers like batch normalization to mitigate this
issue.
4. Computational expense should be low:
Activation functions are used in each layer of the network and is computed a lot of
times, hence its computation should be easy and not very computationally
expensive.
5. Killing gradients:
Activation functions like sigmoid has a saturation problem where the value doesn’t
change much for large negative and large positive values.
The derivative of the sigmoid function gets very small there which in turn prevents
the updating of the weights in initial layers during backpropagation and hence the
network doesn’t learn effectively. This should be avoided to learn patterns in the data
and hence the activation function should not ideally suffer from this issue.
4. Most commonly used activation functions:
In this section we will go over different activation functions.
The sigmoid function is defined as:
The sigmoid function is a type of activation function which has a characteristic “S”
shaped curve which has domain of all real numbers and output between 0 and 1. An
undesirable property of the sigmoid function is that the activation of the neuron
saturates either at 0 or 1 when the input from the neuron is either large positive and
large negative. It is also non-zero centered which makes neural network learning
difficult. In almost majority of the cases, it is always better to use Tanh activation
function instead of sigmoid activation function.
2. Tanh function -
tanh curve
Tanh has just one advantage over sigmoid function that it is zero-centered and it’s
value is binded between -1 and 1.
5. 3. RELU(Rectified Linear Unit) -
RELU plot
RELU is one of the many non zero-centered activation function and given this
disadvantage it is still widely used because of the advantages it has. It
is computationally very inexpensive, does not cause saturation and does not cause
the vanishing gradient problem. The RELU function doesn’t have a higher limit,
hence it has a problem of exploding activations and on the other hand for negative
values, it has 0 activation and hence it completely ignores the nodes with negative
values. Hence it suffers from “dying relu” problem.
Dying ReLU problem: During the backpropagation process, the weights and biases
for some neurons are not updated because its nature where activation is zero for
negative values. This might create dead neurons which never get activated.
6. 4. Leaky RELU -
Leaky RELU is a type of activation function based on RELU function with a small
slope for negative values instead of zero.
Leaky RELU function
Here, alpha is generally set to 0.01. It solves the “dying RELU” problem and also its
value is generally small and is not set near to 1 since it will only be a linear function
then.
If we use alpha as hyperparameter for each neuron, it becomes a PReLU or
parametrized RELU function.
7. 5. ReLU6 -
This version of ReLU function is basically a ReLU function restricted on the positive
side.
Image credit:pytorch
This helps in containing the activation function for large input positive values and
hence stops the gradient to go to inf value.
6. Exponential Linear Units (ELUs) Function -
Exponential Linear Unit is also a version of ReLU that modifies the slope of the
negative part of the function.
This activation function also avoids dead ReLU problem but it has exploding gradient
problem because of no constraint on the activations for large positive values.
8. 7. Softmax activation function -
It often used in the last activation layer of a neural network to normalize the output of
a network to a probability value that in turn is mapped to each class which helps us
in deciding the probability of output belonging to each class with respect to given
inputs. It is popularly used for multi-class classification problems.
I hope you enjoyed reading this. I have tried to cover many of the activation functions
which are commonly used in Neural Networks.
To know more visit our remaining pages:-
Website:- https://coffeebeans.io/
Blogs:- https://coffeebeans.io/blogs