Precise LSTM Algorithm

DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin
Recurrent Neural Network

Brief Review on Other Types of Neural Network
Convolutional Neural NetworkDensely Connected Layers
 Universal Classifier / Regressor  Effective for image processing

Universal Approximation
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
 It is proved that with enough
units and layers, densely
connected layers is capable of
universal approximation.
 Universal approximation means
that you can approximate any
mapping In this case, the densely connected layers is a
function mapping a D-dim vector to a K-dim vector.

An Example of Mapping : “Hello World!” of Machine Learning
⋮
⋮
⋮
⋮
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’784
pixel values
 Classifying MNIST handwritten digit
dataset is a mapping from 784-d
vectors to 10-d vectors.
 This is based on a hypothesis that a
function exist such that maps vectors
of 784 pixel values to 10-d vectors
whose elements are probabilities.
 Actually this neural net is capable of
classifying handwritten digits with
over 90% accuracy.

AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
 In a case of image classification with
AlexNet, you flatten the last
activation maps and get 4096-d
vector.
 The 4096-d space is more
related to the “meaning” of
input images.
 And we can assume that the last
three layers is a mapping from
the “meaning,” “semantic”
vectors to 1000-d vector.

Why RNN?
 Question : What is a critical constraint of densely
connected layers and convolutional neural network?
 Input size of those networks are fixed.
 You cannot properly deal with inputs
whose order matters.
These are especially
problematic when
you work on
“sequence data.”

Sequence Data
 Sequence data is sequences of vectors or
matrices for several timesteps.
 In some tasks we want to change sequence data to sequence data.
RNN is mainly used for this type of task.
 Assume that , .
 It is know that RNN can approximate any mapping .
Indexes of time steps
*The ”time steps” don’t necessarily mean refer to real time. They
are just the number of elements.

Sequence Data
 Sequence data mean types of data the orders of whose elements matter.
 Voice/music
 Video
 Text

Sequence Data
 In natural language processing, including machine translation,
usually a word or a part of a word is encoded as a vector(for
example word embedding).
「そのオフィスでドイツ語が話せないのは私だけです。」
 Let’s take a look the three sentences below in German, English, and Japanese. Assume
that each sentence is encoded as , , .
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I’m the only one who cannot speak German in the office.”
 Let’s take sequence data as an example.

Variety of Architectures of RNN Networks
 Confusion in studying RNNs
comes from their variety of
architectures.
 Maybe you will see
mainly these types of
charts when you start
learning RNN.
Bidirectional RNN Elman RNN

Architecture of Simple RNN
 Simple RNN is basically just a single
densely connected layers with some
layers.
 This network propagates forward in
one time step just as normal densely
connected layers.
 In the forward propagation of RNN,
neurons in the middle layer also propagate
to the middle layer itself.

Architecture of a Simple RNN
⋯
 It is normal to display
every time step of
forward propagation.
 The inputs and the
outputs depend on the
time step.
 When you display RNN,
it is common to show
the inner structure as a
blackbox.
 We share the same
parameters at every
time step.

Architecture of a Simple RNN
 This is the idea of
“unfolding.”
 Most RNNs more or less
share the same structure.

Forward Propagation of
Simple RNN
 Let’s take a look at RNN forward
propagation from time step
to .

Forward Propagation of Simple RNN
 If you unfold forward propagation of
an RNN from time step to , it
looks like this.
 You share the same parameters at
every time step.
 Usually this type of chart simplified
this way.

Back Propagation of Simple RNN
 There are some ways to calculate
gradients with respect to parameters to
optimize RNN.
 In this slide, we are going to look at how
BPTT (back propagation through time)
works.
 In BPTT we use an unfolded RNN chart,
and error at every step propagates to the
first step.

Back Propagation of Simple RNN
 To be honest, this part is not worth
spending a lot of time because in
conclusion simple RNNs are not useful.
 But if you take a look at how BPTT of
simple RNNs works, that would give you
clear understandings of how other
fancier RNNs work.
 Please be patient over next some pages
if you are interested.
 I made the equations as concrete as
possible so that it is easy to understand.

BPTT for Simple RNN : Outline
 You get an error at each time step,
and they propagate to the first
time step.
 You need to calculate gradients
with respect to parameters at
each time step.
 Assume that you have activated an
RNN from step to step .

 You have to be careful that in order
to calculate all those gradients, you
use the same parameters irrelevant
to time step.
 In order to calculate you need
errors which derive from
later steps.
 Let’s take an example of
 You repeat that for all the gradients

 We don’t calculate such things
as .
 As I have just mentioned now, you use
the same parameters to calculate
gradients at every time step.

 And you renew parameters
with summations of .
 In my opinion, is not a
proper notation. It should be
denoted as below.
 Gradients depend on time
steps.

BPTT for Simple RNN : Brief Review on Chain Rules
 Just as backprop of normal densely connected layers, you use a
lot of chain rules in backprop of RNN.
 Let be a function of two variances , and the variance , be
functions of , .
 Let be a function of a variance and the variance
be a function of

 Let be a function of n variances and the
variances be functions of m variances

 This generalized chain rule is super important for back propagation.
 For simplicity, let’s denote the function in the way below.
 Again, the partial differentiation of with respect to is

 First, let’s calculate .
 This is a gradient at the last step, so you don’t have to consider the error
backpropagated from the future steps.
 Let’s look at each element of
.
BPTT for Simple RNN : Calculating
*chain rule
 Hence

 Therefore calculating is equal to
calculating .
 You should keep it in mind that in ,
only is the function of .
 Just as well, when you calculate , the
error back propagates only from the error
function above.

 To calculate gradients , you need to consider
errors backpropagated via recurrent connections.
 You don’t need to consider recurrent connections to
calculate.
 Just as , you get .
With recurrent
connections
With NO recurrent
connections

BPTT for Simple RNN :
Calculating
 Next, we use chain
rules in some steps to calculate
*chain rule
*chain rule

 We got this equation in the former slide :
 You don’t need to consider recurrent connections to calculate
 Just as , you get .
 We got this equation in the former slide :
 Hence

 Hence

BPTT for Simple RNN
 From now on, especially, you have to be super
careful about notations of gradients.
 is a gradient with respect to neurons.
 We don’t calculate such things as .
 is a gradient with respect to parameters.
 In other words, we share the same parameters every
time step.
 But parameters have different errors(gradients) ,
at EVERY time step.

BPTT for Simple RNN
 But some study materials use this notation in RNN
backprop (for example a textbook by MIT, which is
also recommended by Stanford University.).
 When you see , that means you differentiate
neurons at time step t, with respect to .
 might not be a proper notation, because
does not depend on time steps.
 Therefore usually

*chain rule
*chain rule
 We start concretely calculating gradients with respect to each
parameter.
 We need to calculate .

 As we calculated in the last slide
 Hence

*chain rule
*chain rule
 We calculate

 Hence
 In the last slide we got

 First we calculate
 Hence
 Then *chain rule

 Just as formers slides
 Hence Then
*chain rule

 Hence Then
 Also
*chain rule

Difficulty of Training “Deep”
Neural Network
⋯
⋯
 I mentioned that simple
RNN is not useful.
 Training RNN with sequence
input over several time
steps is virtually the same as
training densely connected
layers with several layers.
 That is because of
vanishing/exploding gradient
problem.

Vanishing / Exploding Gradient Problem
⋯
 This vanishing/exploding problem is
not only a problem of RNN.
 In fact, if you just simply stack
more than three layers of neural
networks, it is very hard to train
the network.
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮

Vanishing / Exploding Gradient Problem
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
 Forward propagation is a repeating of
linear summation and “squashing”
the sum to a certain range.
Linear
summation
Squashing
the sum
Linear
summation
Squashing
the sum

• Limitation of
hardware
A Super Brief History of Neural Nets
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
• Vanishing / Exploding
gradient problem
• Hardships in
linearly
inseparable
data problems
• Lack of
computation
algorithms
• Limitation of
hardware
• Shortage of data
sources
• Lack of theories for
hyper parameters
2001:
Release
of Xbox
Increase of data
transmission speed
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter

Deep Belief Network and Pretraining
 Vanishing/exploding gradient problem had
been one of the bottlenecks of deep neural
net research.
 In 2006, a group under Geoffrey Hinton
discovered a way to tackle overfitting by
pretraining of neural nets with deep belief
networks.
 This is counted as a breakthrough of deep
learning research.

Tackling Vanishing Gradient Problem with LSTM
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
2001:
Release
of Xbox
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter
1997:
LSTM(Long-Short-
Term-Memory)
 Using LSTM is one way to deal with
vanishing gradient problem of RNN.
 The idea of LSTM was already
discovered in 1997, I mean when I
was born.

 For example Francois Chollet, the developer of Keras, says….
LSTM
 In short, he suggests we should not think too much about the
structure of LSTM. With Keras, you can implement LSTM with
one or two line codes.
 But anyway, this slide is going to explain everything, contrary
to his advice.
 The structure of LSTM is much more complicated than simple
RNN. Understanding its backprop also can be a pure torture.

 Learning LSTM also can be
confusing because you can find
various charts which show the
same idea.
LSTM: Architecture
 You can roughly classify the
charts into two types.
 This chart is closer to the one
shown in the original paper.

 In conclusion, I don’t recommend you
to understand LSTM with this chart.
LSTM: Forward Propagation
 But many papers use this type of
chart, so let’s BRIEFLY take a
look at this first.
 This is one block of LSTM.
 Just as well as the normal RNN,
each block gets an input, and
gives out an output.
 The output goes back to the block
with recurrent connections.
LSTM block

 The black dot at the center is called cell,
and it contains some information.
 The output is calculated with
the value of the cell.
 Basically LSTM repeats
renewing the cell and giving
out an output every time.
 To be exact, this type of LSTM is called
“peephole LSTM.” For simplicity, however,
we are going to think about LSTM without
peephole connections in next some slides.

 An LSTM block has sections named “gates.”
Forget gate
Input gate
Output gate
 Each gate gets an input at the time step
and get the output of last time step.
 There is no consensus on how to call the last gate. Let’s
call it a “block input” in this slide.
Block input
 In each gate, the linear summations of
those inputs and recurrent
connections are activated with sigmoid
functions.
 But the block input is activated with a
hyperbolic tangent.

 The output gate also restrict the output by
elementwise multiplication.

 Let’s see how each gate in an LSTM block behaves at time step t.
Forget gate
Input gate
Output gate
 The outputs of the block input and the
input gate are multiplied elementwise.
 First, keep it in mind that dotted lines show recurrent
connection, flow of information from one time step before.
Block input
connection in the time step
connection with time-lag
 The forget gate “forgets” the cell state
at the last time step, by multiplying
values from -1 to 1, elementwise.
 And you renew the cell state.
 The output gate also restrict the
output by elementwise multiplication.

LSTM: Architecture
 Instead of the chart in the “Space Odyssey”
paper, I recommend you to use this LSTM
char like an electronic circuit.

LSTM : Forward Propagation
 If you write down every
equation on the former
chart, that looks like this.

BPTT for LSTM
 This part is the climax of this slide.
 It is going to be more complicated than
before.
 However, you have only to be careful
which variances are affecting which
functions, just as other types of backprops.
 Backprop formulas on this slide is based on
a paper “LSTM: A Search Space Odyssey.”
 This slide is going to concretely show
how to get the equations in the paper.

BPTT for LSTM
 From now on we are going to denote gradients in the way below
 Just as BPTT of a simple RNN, to calculate , you need errors at
time step s >= t+1 .

BPTT for LSTM: Calculating
 At time step of BTPP, first of all you can calculate as below.
*At this timing, we already know .
 Calculating is relatively complicated. For now let’s
skip this part.

 First you need to calculate .
 The relations of variances is:
 The flow of purple arrows in the chart shows how a variance directly affects values.

 Hence

BPTT for Simple LSTM: Calculating
 Next we calculate :

 If you continue
calculating
 Hence

 The relations of
variances are like:
 Hence

 In a former slide, we saw is calculated as below.
 Let’s see how we can get this equation.

 Just as well as back prop of simple RNN the
errors comes from time step
 And from time step

1
N
 Let’s see how to how to calculate , especially the part

 Then
 Hence

 Hence
 In order to calculate , you need to calculate , therefore .
 And in general
 Let’s calculate , and then you can calculate in general just as well.
*chain rule

 Hence  And in general
*chain rule
 At first let’s calculate , and then you can calculate in general just as well.

 Hence
 At first let’s calculate , and then you can calculate in general just as well.
*chain rule
 And in general

 Hence
 Let’s calculate , as a representative.
*chain rule
 Just as well

Variety of Architecture
 We have finally looked over how LSTM work, and you should be
proud of that.
 But in practice the architecture of
RNN can be much more complicated.
Bidirectional RNN
 First, you can stack RNN diagonally for
more complicated mapping.
Stacked RNN
 You can also propagate information
backward with a bidirectional RNN.

Bidirectional Connection
 Machine translation is a typical case where you need bidirectional
connections.
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I am the only one who cannot speak German in the office.”
「そのオフィスでドイツ語が話せないのは私だけです。」
 Later time steps in the original language can affect the earlier time
steps in the target language.

Further Problems
 Question : What is
a problem of this
model?
 If you naively just use the RNN we have
seen, basically it gets one input and gives
out one output every time step.
 Or it get an input every timestep and
gives out one input at the final timestep.
 RNN model we have seen have a fundamental problem.

Sequence to Sequence Model
 It is easy to imagine that this simple model of RNN is not useful for
machine translation.
 Usually

Architecture of Vanilla RNN
 This model is not proper for problems like machine
translation or voice recognition.
 Let’s briefly take a look at how we can apply RNN to such tasks.

[EOS]
[BOS]
そので語話ないはだけオフィスドイツがせの私です
Sequence to Sequence Model:
Machine Translation
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutschsprechen
 First you put input(original sentence)
in the encoder.
 At the last time step of encoder, you
get a hidden state.
 You initialize the RNN of decoder
part with the hidden state and give
[BOS] as an input.
 You use the output as the input
of the next time step.
 You repeat this until the
decoder RNN gives out [EOS]

[EOS]
[BOS]
そので語話ないはだけオフィスドイツがせの私です
How to Train
 When you train this model, you fix the
input sequence in the decoder.
 For example if

AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
 I’d like you to remember that I
mentioned that a CNN can be
divided into a “feature extraction par
”
 The 4096-d space is more
related to the “meaning” of
input images.
 And we can assume that the
last three layers is a mapping
from the “meaning” vectors to
1000-d vector.

[EOS]
[BOS]
Image Captioning
dasIch bin der Einzige in kannBüro der kein Deutsch sprechen
 Surprisingly, if you use the
 First you put input(original
sentence) in the encoder.

[EOS]
[BOS]
 However these are very simplified
model of sequence to seuquece
model.
 In practice you have to consider s
 For example, stacking RNN,
bidirectional connection, attention,
beam search, building a language
model.
 You might need at least one
semester for this topic plus this
natural language processing is
a fast changing, hot field.

 For example the RNN model for
Google Translator stacks as
many as 8 LSTM layers.
 On my company laptop, it took
me almost one week to train
stacked-2-layer LSTM on a
much smaller training data.

Precise LSTM Algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Precise LSTM Algorithm

Similar to Precise LSTM Algorithm (20)

More from YasutoTamura1

More from YasutoTamura1 (6)

Recently uploaded

Recently uploaded (20)

Precise LSTM Algorithm