Depth of neural network is crucial for its success. However, network training becomes more difficult with increasing depth. New architecture designed to ease gradient-based training of very deep network is highway network.
2. Introduction
Depth of neural network is crucial for its success. However, network training becomes more
difficult with increasing depth.
New architecture designed to ease gradient-based training of very deep network is highway
network.
Highway networks with hundreds of layers can be trained directly using stochastic gradient
descent.
It uses skip connections modulated by learned gating mechanisms to regulate information flow,
inspired by Long short term memory (LSTM) recurrent neural network.
Highway Networks have been used as part of text sequence labelling and speech recognition
task.
3. Gradient Descent
Commonly used iterative optimization algorithms
of machine learning to train the machine learning
and deep learning models. It helps in finding the
local minimum of a function.
The main objective of using a gradient descent
algorithm is to minimize the cost function using
iteration.
4. Loss/Cost Function
Function that compares the target
and predicted output values.
Measures how well the neural
network models the training data.
When training, we aim to minimize
this loss between the predicted and
target outputs.
5.
6.
7.
8. ➔ SGD is great when we have
tons of data and a lot of
parameters.
➔ In these situations, regular
GD may not be
computationally feasible.
9. LSTM Recurrent Neural Network
Standard Recurrent Neural Networks (RNNs) suffer from short-term
memory due to a vanishing gradient problem that emerges when
working with longer data sequences.
Luckily, we have more advanced versions of RNNs that can preserve
important information from earlier parts of the sequence and carry it
forward.
The two best-known versions are Long Short-Term Memory (LSTM)
and Gated Recurring Unit (GRU).
10. LSTM vs GRU
➔ GRU has two gates that are
reset and update while LSTM
has three gates that are input,
output and forget.
➔ GRU is less complex than
LSTM because it has less
number of gates. If the dataset
is small then GRU is preferred
otherwise LSTM for the larger
dataset.
11. Highway network
vs plain networks
➔ HN is virtually independent
of depth while other
suffers significantly.
➔ SGD stalls at beginning in
plain networks unless a
specific weight is initialized
12. Model
The model has two gates in addition to the y = H(WH, x) gate:
The transform gate T(WT, x)
The carry gate C(WC, x)
Those two last gates are non-linear transfer functions (Sigmoid
function).
The H(WH, x) function can be any desired transfer function.
The carry gate is defined as:
C(WC, x) = 1 - T(WT, x)
While the transform gate is just a gate with a sigmoid transfer function.
14. Structure cont.
Depending on the output of transform gates, a highway layer can
smoothly vary its behavior between that of a plain layer and a layer
which simply passes its inputs through.
15. Conclusion
Training very deep networks is difficult without increasing
total network size.
Highway networks are novel NN architectures which
enable the training of extremely deep networks using
simple SGD.
Optimization of highway network is not hampered even as
network depth increases to a hundred layers.
16. Conclusion Cont.
Ability to train extremely deep networks opens up the
possibility of studying impact of depth on complex
problems without restrictions.
Various activation functions can be used in deep highway
networks.