An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
2. Brief Review on Other Types of Neural Network
Convolutional Neural NetworkDensely Connected Layers
Universal Classifier / Regressor Effective for image processing
3. Universal Approximation
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
It is proved that with enough
units and layers, densely
connected layers is capable of
universal approximation.
Universal approximation means
that you can approximate any
mapping In this case, the densely connected layers is a
function mapping a D-dim vector to a K-dim vector.
4. An Example of Mapping : “Hello World!” of Machine Learning
⋮
⋮
⋮
⋮
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’784
pixel values
Classifying MNIST handwritten digit
dataset is a mapping from 784-d
vectors to 10-d vectors.
This is based on a hypothesis that a
function exist such that maps vectors
of 784 pixel values to 10-d vectors
whose elements are probabilities.
Actually this neural net is capable of
classifying handwritten digits with
over 90% accuracy.
5. Universal Approximation
AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
In a case of image classification with
AlexNet, you flatten the last
activation maps and get 4096-d
vector.
The 4096-d space is more
related to the “meaning” of
input images.
And we can assume that the last
three layers is a mapping from
the “meaning,” “semantic”
vectors to 1000-d vector.
6. Why RNN?
Question : What is a critical constraint of densely
connected layers and convolutional neural network?
Input size of those networks are fixed.
You cannot properly deal with inputs
whose order matters.
These are especially
problematic when
you work on
“sequence data.”
7. Sequence Data
Sequence data is sequences of vectors or
matrices for several timesteps.
In some tasks we want to change sequence data to sequence data.
RNN is mainly used for this type of task.
Assume that , .
It is know that RNN can approximate any mapping .
Indexes of time steps
*The ”time steps” don’t necessarily mean refer to real time. They
are just the number of elements.
8. Sequence Data
Sequence data mean types of data the orders of whose elements matter.
Voice/music
Video
Text
9. Sequence Data
In natural language processing, including machine translation,
usually a word or a part of a word is encoded as a vector(for
example word embedding).
「そのオフィスでドイツ語が話せないのは私だけです。」
Let’s take a look the three sentences below in German, English, and Japanese. Assume
that each sentence is encoded as , , .
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I’m the only one who cannot speak German in the office.”
Let’s take sequence data as an example.
10. Variety of Architectures of RNN Networks
Confusion in studying RNNs
comes from their variety of
architectures.
Maybe you will see
mainly these types of
charts when you start
learning RNN.
Bidirectional RNN Elman RNN
11. Architecture of Simple RNN
Simple RNN is basically just a single
densely connected layers with some
layers.
This network propagates forward in
one time step just as normal densely
connected layers.
In the forward propagation of RNN,
neurons in the middle layer also propagate
to the middle layer itself.
12. Architecture of a Simple RNN
⋯
It is normal to display
every time step of
forward propagation.
The inputs and the
outputs depend on the
time step.
When you display RNN,
it is common to show
the inner structure as a
blackbox.
We share the same
parameters at every
time step.
13. Architecture of a Simple RNN
This is the idea of
“unfolding.”
Most RNNs more or less
share the same structure.
15. Forward Propagation of Simple RNN
If you unfold forward propagation of
an RNN from time step to , it
looks like this.
You share the same parameters at
every time step.
Usually this type of chart simplified
this way.
16. Back Propagation of Simple RNN
There are some ways to calculate
gradients with respect to parameters to
optimize RNN.
In this slide, we are going to look at how
BPTT (back propagation through time)
works.
In BPTT we use an unfolded RNN chart,
and error at every step propagates to the
first step.
17. Back Propagation of Simple RNN
To be honest, this part is not worth
spending a lot of time because in
conclusion simple RNNs are not useful.
But if you take a look at how BPTT of
simple RNNs works, that would give you
clear understandings of how other
fancier RNNs work.
Please be patient over next some pages
if you are interested.
I made the equations as concrete as
possible so that it is easy to understand.
18. BPTT for Simple RNN : Outline
You get an error at each time step,
and they propagate to the first
time step.
You need to calculate gradients
with respect to parameters at
each time step.
Assume that you have activated an
RNN from step to step .
19. BPTT for Simple RNN : Outline
You have to be careful that in order
to calculate all those gradients, you
use the same parameters irrelevant
to time step.
In order to calculate you need
errors which derive from
later steps.
Let’s take an example of
You repeat that for all the gradients
20. BPTT for Simple RNN : Outline
We don’t calculate such things
as .
As I have just mentioned now, you use
the same parameters to calculate
gradients at every time step.
21. BPTT for Simple RNN : Outline
And you renew parameters
with summations of .
In my opinion, is not a
proper notation. It should be
denoted as below.
Gradients depend on time
steps.
22. BPTT for Simple RNN : Brief Review on Chain Rules
Just as backprop of normal densely connected layers, you use a
lot of chain rules in backprop of RNN.
Let be a function of two variances , and the variance , be
functions of , .
Let be a function of a variance and the variance
be a function of
23. BPTT for Simple RNN : Brief Review on Chain Rules
Let be a function of n variances and the
variances be functions of m variances
24. BPTT for Simple RNN : Brief Review on Chain Rules
This generalized chain rule is super important for back propagation.
For simplicity, let’s denote the function in the way below.
Again, the partial differentiation of with respect to is
25. First, let’s calculate .
This is a gradient at the last step, so you don’t have to consider the error
backpropagated from the future steps.
Let’s look at each element of
.
BPTT for Simple RNN : Calculating
*chain rule
Hence
26. BPTT for Simple RNN : Calculating
Therefore calculating is equal to
calculating .
You should keep it in mind that in ,
only is the function of .
Just as well, when you calculate , the
error back propagates only from the error
function above.
27. BPTT for Simple RNN : Calculating
To calculate gradients , you need to consider
errors backpropagated via recurrent connections.
You don’t need to consider recurrent connections to
calculate.
Just as , you get .
With recurrent
connections
With NO recurrent
connections
28. BPTT for Simple RNN :
Calculating
Next, we use chain
rules in some steps to calculate
*chain rule
*chain rule
29. BPTT for Simple RNN : Calculating
We got this equation in the former slide :
You don’t need to consider recurrent connections to calculate
Just as , you get .
We got this equation in the former slide :
Hence
31. BPTT for Simple RNN
From now on, especially, you have to be super
careful about notations of gradients.
is a gradient with respect to neurons.
We don’t calculate such things as .
is a gradient with respect to parameters.
In other words, we share the same parameters every
time step.
But parameters have different errors(gradients) ,
at EVERY time step.
32. BPTT for Simple RNN
But some study materials use this notation in RNN
backprop (for example a textbook by MIT, which is
also recommended by Stanford University.).
When you see , that means you differentiate
neurons at time step t, with respect to .
might not be a proper notation, because
does not depend on time steps.
Therefore usually
33. BPTT for Simple RNN : Calculating
*chain rule
*chain rule
We start concretely calculating gradients with respect to each
parameter.
We need to calculate .
34. BPTT for Simple RNN : Calculating
As we calculated in the last slide
Hence
36. Hence
In the last slide we got
BPTT for Simple RNN : Calculating
37. First we calculate
BPTT for Simple RNN : Calculating
Hence
Then *chain rule
38. Just as formers slides
BPTT for Simple RNN : Calculating
Hence Then
*chain rule
39. BPTT for Simple RNN : Calculating
Hence Then
Also
*chain rule
40. Difficulty of Training “Deep”
Neural Network
⋯
⋯
I mentioned that simple
RNN is not useful.
Training RNN with sequence
input over several time
steps is virtually the same as
training densely connected
layers with several layers.
That is because of
vanishing/exploding gradient
problem.
41. Vanishing / Exploding Gradient Problem
⋯
This vanishing/exploding problem is
not only a problem of RNN.
In fact, if you just simply stack
more than three layers of neural
networks, it is very hard to train
the network.
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
42. Vanishing / Exploding Gradient Problem
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
Forward propagation is a repeating of
linear summation and “squashing”
the sum to a certain range.
Linear
summation
Squashing
the sum
Linear
summation
Squashing
the sum
43. • Limitation of
hardware
A Super Brief History of Neural Nets
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
• Vanishing / Exploding
gradient problem
• Hardships in
linearly
inseparable
data problems
• Lack of
computation
algorithms
• Limitation of
hardware
• Shortage of data
sources
• Lack of theories for
hyper parameters
2001:
Release
of Xbox
Increase of data
transmission speed
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter
44. Deep Belief Network and Pretraining
Vanishing/exploding gradient problem had
been one of the bottlenecks of deep neural
net research.
In 2006, a group under Geoffrey Hinton
discovered a way to tackle overfitting by
pretraining of neural nets with deep belief
networks.
This is counted as a breakthrough of deep
learning research.
45. Tackling Vanishing Gradient Problem with LSTM
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
2001:
Release
of Xbox
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter
1997:
LSTM(Long-Short-
Term-Memory)
Using LSTM is one way to deal with
vanishing gradient problem of RNN.
The idea of LSTM was already
discovered in 1997, I mean when I
was born.
46. For example Francois Chollet, the developer of Keras, says….
LSTM
In short, he suggests we should not think too much about the
structure of LSTM. With Keras, you can implement LSTM with
one or two line codes.
But anyway, this slide is going to explain everything, contrary
to his advice.
The structure of LSTM is much more complicated than simple
RNN. Understanding its backprop also can be a pure torture.
47. Learning LSTM also can be
confusing because you can find
various charts which show the
same idea.
LSTM: Architecture
You can roughly classify the
charts into two types.
This chart is closer to the one
shown in the original paper.
48. In conclusion, I don’t recommend you
to understand LSTM with this chart.
LSTM: Forward Propagation
But many papers use this type of
chart, so let’s BRIEFLY take a
look at this first.
This is one block of LSTM.
Just as well as the normal RNN,
each block gets an input, and
gives out an output.
The output goes back to the block
with recurrent connections.
LSTM block
49. LSTM: Forward Propagation
The black dot at the center is called cell,
and it contains some information.
The output is calculated with
the value of the cell.
Basically LSTM repeats
renewing the cell and giving
out an output every time.
To be exact, this type of LSTM is called
“peephole LSTM.” For simplicity, however,
we are going to think about LSTM without
peephole connections in next some slides.
50. LSTM: Forward Propagation
An LSTM block has sections named “gates.”
Forget gate
Input gate
Output gate
Each gate gets an input at the time step
and get the output of last time step.
There is no consensus on how to call the last gate. Let’s
call it a “block input” in this slide.
Block input
In each gate, the linear summations of
those inputs and recurrent
connections are activated with sigmoid
functions.
But the block input is activated with a
hyperbolic tangent.
52. LSTM: Forward Propagation
Let’s see how each gate in an LSTM block behaves at time step t.
Forget gate
Input gate
Output gate
The outputs of the block input and the
input gate are multiplied elementwise.
First, keep it in mind that dotted lines show recurrent
connection, flow of information from one time step before.
Block input
connection in the time step
connection with time-lag
The forget gate “forgets” the cell state
at the last time step, by multiplying
values from -1 to 1, elementwise.
And you renew the cell state.
The output gate also restrict the
output by elementwise multiplication.
53. LSTM: Forward Propagation
Let’s see how each gate in an LSTM block behaves at time step t.
Forget gate
Input gate
Output gate
The outputs of the block input and the
input gate are multiplied elementwise.
First, keep it in mind that dotted lines show recurrent
connection, flow of information from one time step before.
Block input
connection in the time step
connection with time-lag
The forget gate “forgets” the cell state
at the last time step, by multiplying
values from -1 to 1, elementwise.
And you renew the cell state.
The output gate also restrict the
output by elementwise multiplication.
54. LSTM: Architecture
Instead of the chart in the “Space Odyssey”
paper, I recommend you to use this LSTM
char like an electronic circuit.
56. LSTM : Forward Propagation
If you write down every
equation on the former
chart, that looks like this.
57. BPTT for LSTM
This part is the climax of this slide.
It is going to be more complicated than
before.
However, you have only to be careful
which variances are affecting which
functions, just as other types of backprops.
Backprop formulas on this slide is based on
a paper “LSTM: A Search Space Odyssey.”
This slide is going to concretely show
how to get the equations in the paper.
58. BPTT for LSTM
From now on we are going to denote gradients in the way below
Just as BPTT of a simple RNN, to calculate , you need errors at
time step s >= t+1 .
59. BPTT for LSTM: Calculating
At time step of BTPP, first of all you can calculate as below.
*At this timing, we already know .
Calculating is relatively complicated. For now let’s
skip this part.
60. BPTT for LSTM: Calculating
First you need to calculate .
The relations of variances is:
The flow of purple arrows in the chart shows how a variance directly affects values.
72. BPTT for LSTM: Calculating
Hence
In order to calculate , you need to calculate , therefore .
And in general
Let’s calculate , and then you can calculate in general just as well.
*chain rule
73. Hence And in general
BPTT for LSTM: Calculating
In order to calculate , you need to calculate , therefore .
*chain rule
At first let’s calculate , and then you can calculate in general just as well.
74. BPTT for LSTM: Calculating
Hence
At first let’s calculate , and then you can calculate in general just as well.
In order to calculate , you need to calculate , therefore .
*chain rule
And in general
75. BPTT for LSTM: Calculating
Hence
Let’s calculate , as a representative.
In order to calculate , you need to calculate , therefore .
*chain rule
Just as well
76. Variety of Architecture
We have finally looked over how LSTM work, and you should be
proud of that.
But in practice the architecture of
RNN can be much more complicated.
Bidirectional RNN
First, you can stack RNN diagonally for
more complicated mapping.
Stacked RNN
You can also propagate information
backward with a bidirectional RNN.
77. Bidirectional Connection
Machine translation is a typical case where you need bidirectional
connections.
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I am the only one who cannot speak German in the office.”
「そのオフィスでドイツ語が話せないのは私だけです。」
Later time steps in the original language can affect the earlier time
steps in the target language.
78. Further Problems
Question : What is
a problem of this
model?
If you naively just use the RNN we have
seen, basically it gets one input and gives
out one output every time step.
Or it get an input every timestep and
gives out one input at the final timestep.
RNN model we have seen have a fundamental problem.
79. Sequence to Sequence Model
It is easy to imagine that this simple model of RNN is not useful for
machine translation.
Usually
80. Architecture of Vanilla RNN
This model is not proper for problems like machine
translation or voice recognition.
Let’s briefly take a look at how we can apply RNN to such tasks.
81. [EOS]
[BOS]
その で 語 話 ない は だけオフィス ドイツ が せ の 私 です
Sequence to Sequence Model:
Machine Translation
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutschsprechen
First you put input(original sentence)
in the encoder.
At the last time step of encoder, you
get a hidden state.
You initialize the RNN of decoder
part with the hidden state and give
[BOS] as an input.
You use the output as the input
of the next time step.
You repeat this until the
decoder RNN gives out [EOS]
82. [EOS]
[BOS]
その で 語 話 ない は だけオフィス ドイツ が せ の 私 です
Sequence to Sequence Model:
How to Train
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
When you train this model, you fix the
input sequence in the decoder.
For example if
83. Universal Approximation
AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
I’d like you to remember that I
mentioned that a CNN can be
divided into a “feature extraction par
”
The 4096-d space is more
related to the “meaning” of
input images.
And we can assume that the
last three layers is a mapping
from the “meaning” vectors to
1000-d vector.
84. [EOS]
[BOS]
Sequence to Sequence Model:
Image Captioning
dasIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
Surprisingly, if you use the
First you put input(original
sentence) in the encoder.
85. [EOS]
[BOS]
Sequence to Sequence Model
However these are very simplified
model of sequence to seuquece
model.
In practice you have to consider s
For example, stacking RNN,
bidirectional connection, attention,
beam search, building a language
model.
You might need at least one
semester for this topic plus this
natural language processing is
a fast changing, hot field.
86. Sequence to Sequence Model
For example the RNN model for
Google Translator stacks as
many as 8 LSTM layers.
On my company laptop, it took
me almost one week to train
stacked-2-layer LSTM on a
much smaller training data.