SlideShare a Scribd company logo
1 of 86
DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin
Recurrent Neural Network
Brief Review on Other Types of Neural Network
Convolutional Neural NetworkDensely Connected Layers
 Universal Classifier / Regressor  Effective for image processing
Universal Approximation
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
 It is proved that with enough
units and layers, densely
connected layers is capable of
universal approximation.
 Universal approximation means
that you can approximate any
mapping In this case, the densely connected layers is a
function mapping a D-dim vector to a K-dim vector.
An Example of Mapping : “Hello World!” of Machine Learning
⋮
⋮
⋮
⋮
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’784
pixel values
 Classifying MNIST handwritten digit
dataset is a mapping from 784-d
vectors to 10-d vectors.
 This is based on a hypothesis that a
function exist such that maps vectors
of 784 pixel values to 10-d vectors
whose elements are probabilities.
 Actually this neural net is capable of
classifying handwritten digits with
over 90% accuracy.
Universal Approximation
AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
 In a case of image classification with
AlexNet, you flatten the last
activation maps and get 4096-d
vector.
 The 4096-d space is more
related to the “meaning” of
input images.
 And we can assume that the last
three layers is a mapping from
the “meaning,” “semantic”
vectors to 1000-d vector.
Why RNN?
 Question : What is a critical constraint of densely
connected layers and convolutional neural network?
 Input size of those networks are fixed.
 You cannot properly deal with inputs
whose order matters.
These are especially
problematic when
you work on
“sequence data.”
Sequence Data
 Sequence data is sequences of vectors or
matrices for several timesteps.
 In some tasks we want to change sequence data to sequence data.
RNN is mainly used for this type of task.
 Assume that , .
 It is know that RNN can approximate any mapping .
Indexes of time steps
*The ”time steps” don’t necessarily mean refer to real time. They
are just the number of elements.
Sequence Data
 Sequence data mean types of data the orders of whose elements matter.
 Voice/music
 Video
 Text
Sequence Data
 In natural language processing, including machine translation,
usually a word or a part of a word is encoded as a vector(for
example word embedding).
「そのオフィスでドイツ語が話せないのは私だけです。」
 Let’s take a look the three sentences below in German, English, and Japanese. Assume
that each sentence is encoded as , , .
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I’m the only one who cannot speak German in the office.”
 Let’s take sequence data as an example.
Variety of Architectures of RNN Networks
 Confusion in studying RNNs
comes from their variety of
architectures.
 Maybe you will see
mainly these types of
charts when you start
learning RNN.
Bidirectional RNN Elman RNN
Architecture of Simple RNN
 Simple RNN is basically just a single
densely connected layers with some
layers.
 This network propagates forward in
one time step just as normal densely
connected layers.
 In the forward propagation of RNN,
neurons in the middle layer also propagate
to the middle layer itself.
Architecture of a Simple RNN
⋯
 It is normal to display
every time step of
forward propagation.
 The inputs and the
outputs depend on the
time step.
 When you display RNN,
it is common to show
the inner structure as a
blackbox.
 We share the same
parameters at every
time step.
Architecture of a Simple RNN
 This is the idea of
“unfolding.”
 Most RNNs more or less
share the same structure.
Forward Propagation of
Simple RNN
 Let’s take a look at RNN forward
propagation from time step
to .
Forward Propagation of Simple RNN
 If you unfold forward propagation of
an RNN from time step to , it
looks like this.
 You share the same parameters at
every time step.
 Usually this type of chart simplified
this way.
Back Propagation of Simple RNN
 There are some ways to calculate
gradients with respect to parameters to
optimize RNN.
 In this slide, we are going to look at how
BPTT (back propagation through time)
works.
 In BPTT we use an unfolded RNN chart,
and error at every step propagates to the
first step.
Back Propagation of Simple RNN
 To be honest, this part is not worth
spending a lot of time because in
conclusion simple RNNs are not useful.
 But if you take a look at how BPTT of
simple RNNs works, that would give you
clear understandings of how other
fancier RNNs work.
 Please be patient over next some pages
if you are interested.
 I made the equations as concrete as
possible so that it is easy to understand.
BPTT for Simple RNN : Outline
 You get an error at each time step,
and they propagate to the first
time step.
 You need to calculate gradients
with respect to parameters at
each time step.
 Assume that you have activated an
RNN from step to step .
BPTT for Simple RNN : Outline
 You have to be careful that in order
to calculate all those gradients, you
use the same parameters irrelevant
to time step.
 In order to calculate you need
errors which derive from
later steps.
 Let’s take an example of
 You repeat that for all the gradients
BPTT for Simple RNN : Outline
 We don’t calculate such things
as .
 As I have just mentioned now, you use
the same parameters to calculate
gradients at every time step.
BPTT for Simple RNN : Outline
 And you renew parameters
with summations of .
 In my opinion, is not a
proper notation. It should be
denoted as below.
 Gradients depend on time
steps.
BPTT for Simple RNN : Brief Review on Chain Rules
 Just as backprop of normal densely connected layers, you use a
lot of chain rules in backprop of RNN.
 Let be a function of two variances , and the variance , be
functions of , .
 Let be a function of a variance and the variance
be a function of
BPTT for Simple RNN : Brief Review on Chain Rules
 Let be a function of n variances and the
variances be functions of m variances
BPTT for Simple RNN : Brief Review on Chain Rules
 This generalized chain rule is super important for back propagation.
 For simplicity, let’s denote the function in the way below.
 Again, the partial differentiation of with respect to is
 First, let’s calculate .
 This is a gradient at the last step, so you don’t have to consider the error
backpropagated from the future steps.
 Let’s look at each element of
.
BPTT for Simple RNN : Calculating
*chain rule
 Hence
BPTT for Simple RNN : Calculating
 Therefore calculating is equal to
calculating .
 You should keep it in mind that in ,
only is the function of .
 Just as well, when you calculate , the
error back propagates only from the error
function above.
BPTT for Simple RNN : Calculating
 To calculate gradients , you need to consider
errors backpropagated via recurrent connections.
 You don’t need to consider recurrent connections to
calculate.
 Just as , you get .
With recurrent
connections
With NO recurrent
connections
BPTT for Simple RNN :
Calculating
 Next, we use chain
rules in some steps to calculate
*chain rule
*chain rule
BPTT for Simple RNN : Calculating
 We got this equation in the former slide :
 You don’t need to consider recurrent connections to calculate
 Just as , you get .
 We got this equation in the former slide :
 Hence
BPTT for Simple RNN : Calculating
 Hence
BPTT for Simple RNN
 From now on, especially, you have to be super
careful about notations of gradients.
 is a gradient with respect to neurons.
 We don’t calculate such things as .
 is a gradient with respect to parameters.
 In other words, we share the same parameters every
time step.
 But parameters have different errors(gradients) ,
at EVERY time step.
BPTT for Simple RNN
 But some study materials use this notation in RNN
backprop (for example a textbook by MIT, which is
also recommended by Stanford University.).
 When you see , that means you differentiate
neurons at time step t, with respect to .
 might not be a proper notation, because
does not depend on time steps.
 Therefore usually
BPTT for Simple RNN : Calculating
*chain rule
*chain rule
 We start concretely calculating gradients with respect to each
parameter.
 We need to calculate .
BPTT for Simple RNN : Calculating
 As we calculated in the last slide
 Hence
*chain rule
*chain rule
BPTT for Simple RNN : Calculating
 We calculate
 Hence
 In the last slide we got
BPTT for Simple RNN : Calculating
 First we calculate
BPTT for Simple RNN : Calculating
 Hence
 Then *chain rule
 Just as formers slides
BPTT for Simple RNN : Calculating
 Hence Then
*chain rule
BPTT for Simple RNN : Calculating
 Hence Then
 Also
*chain rule
Difficulty of Training “Deep”
Neural Network
⋯
⋯
 I mentioned that simple
RNN is not useful.
 Training RNN with sequence
input over several time
steps is virtually the same as
training densely connected
layers with several layers.
 That is because of
vanishing/exploding gradient
problem.
Vanishing / Exploding Gradient Problem
⋯
 This vanishing/exploding problem is
not only a problem of RNN.
 In fact, if you just simply stack
more than three layers of neural
networks, it is very hard to train
the network.
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
Vanishing / Exploding Gradient Problem
⋮ ⋮ ⋮ ⋮ ⋮
⋯
⋯
⋯
⋮
 Forward propagation is a repeating of
linear summation and “squashing”
the sum to a certain range.
Linear
summation
Squashing
the sum
Linear
summation
Squashing
the sum
• Limitation of
hardware
A Super Brief History of Neural Nets
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
• Vanishing / Exploding
gradient problem
• Hardships in
linearly
inseparable
data problems
• Lack of
computation
algorithms
• Limitation of
hardware
• Shortage of data
sources
• Lack of theories for
hyper parameters
2001:
Release
of Xbox
Increase of data
transmission speed
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter
Deep Belief Network and Pretraining
 Vanishing/exploding gradient problem had
been one of the bottlenecks of deep neural
net research.
 In 2006, a group under Geoffrey Hinton
discovered a way to tackle overfitting by
pretraining of neural nets with deep belief
networks.
 This is counted as a breakthrough of deep
learning research.
Tackling Vanishing Gradient Problem with LSTM
1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom
1980:
Neocognitron
1986:
Backpropagation
1989:
LeNet
2006:
Pretraining with
deep belief net
1968:
“2001 : A Space
Odyssey”
1958:
Perceptron
1st AI Winter
2001:
Release
of Xbox
The advent of
the idea of AI
1995:
Commercia-
lization of
the Internet
2nd AI Winter
1997:
LSTM(Long-Short-
Term-Memory)
 Using LSTM is one way to deal with
vanishing gradient problem of RNN.
 The idea of LSTM was already
discovered in 1997, I mean when I
was born.
 For example Francois Chollet, the developer of Keras, says….
LSTM
 In short, he suggests we should not think too much about the
structure of LSTM. With Keras, you can implement LSTM with
one or two line codes.
 But anyway, this slide is going to explain everything, contrary
to his advice.
 The structure of LSTM is much more complicated than simple
RNN. Understanding its backprop also can be a pure torture.
 Learning LSTM also can be
confusing because you can find
various charts which show the
same idea.
LSTM: Architecture
 You can roughly classify the
charts into two types.
 This chart is closer to the one
shown in the original paper.
 In conclusion, I don’t recommend you
to understand LSTM with this chart.
LSTM: Forward Propagation
 But many papers use this type of
chart, so let’s BRIEFLY take a
look at this first.
 This is one block of LSTM.
 Just as well as the normal RNN,
each block gets an input, and
gives out an output.
 The output goes back to the block
with recurrent connections.
LSTM block
LSTM: Forward Propagation
 The black dot at the center is called cell,
and it contains some information.
 The output is calculated with
the value of the cell.
 Basically LSTM repeats
renewing the cell and giving
out an output every time.
 To be exact, this type of LSTM is called
“peephole LSTM.” For simplicity, however,
we are going to think about LSTM without
peephole connections in next some slides.
LSTM: Forward Propagation
 An LSTM block has sections named “gates.”
Forget gate
Input gate
Output gate
 Each gate gets an input at the time step
and get the output of last time step.
 There is no consensus on how to call the last gate. Let’s
call it a “block input” in this slide.
Block input
 In each gate, the linear summations of
those inputs and recurrent
connections are activated with sigmoid
functions.
 But the block input is activated with a
hyperbolic tangent.
LSTM: Forward Propagation
 The output gate also restrict the output by
elementwise multiplication.
LSTM: Forward Propagation
 Let’s see how each gate in an LSTM block behaves at time step t.
Forget gate
Input gate
Output gate
 The outputs of the block input and the
input gate are multiplied elementwise.
 First, keep it in mind that dotted lines show recurrent
connection, flow of information from one time step before.
Block input
connection in the time step
connection with time-lag
 The forget gate “forgets” the cell state
at the last time step, by multiplying
values from -1 to 1, elementwise.
 And you renew the cell state.
 The output gate also restrict the
output by elementwise multiplication.
LSTM: Forward Propagation
 Let’s see how each gate in an LSTM block behaves at time step t.
Forget gate
Input gate
Output gate
 The outputs of the block input and the
input gate are multiplied elementwise.
 First, keep it in mind that dotted lines show recurrent
connection, flow of information from one time step before.
Block input
connection in the time step
connection with time-lag
 The forget gate “forgets” the cell state
at the last time step, by multiplying
values from -1 to 1, elementwise.
 And you renew the cell state.
 The output gate also restrict the
output by elementwise multiplication.
LSTM: Architecture
 Instead of the chart in the “Space Odyssey”
paper, I recommend you to use this LSTM
char like an electronic circuit.
LSTM: Forward Propagation
LSTM : Forward Propagation
 If you write down every
equation on the former
chart, that looks like this.
BPTT for LSTM
 This part is the climax of this slide.
 It is going to be more complicated than
before.
 However, you have only to be careful
which variances are affecting which
functions, just as other types of backprops.
 Backprop formulas on this slide is based on
a paper “LSTM: A Search Space Odyssey.”
 This slide is going to concretely show
how to get the equations in the paper.
BPTT for LSTM
 From now on we are going to denote gradients in the way below
 Just as BPTT of a simple RNN, to calculate , you need errors at
time step s >= t+1 .
BPTT for LSTM: Calculating
 At time step of BTPP, first of all you can calculate as below.
*At this timing, we already know .
 Calculating is relatively complicated. For now let’s
skip this part.
BPTT for LSTM: Calculating
 First you need to calculate .
 The relations of variances is:
 The flow of purple arrows in the chart shows how a variance directly affects values.
 Hence
BPTT for LSTM: Calculating
BPTT for Simple LSTM: Calculating
 Next we calculate :
BPTT for Simple LSTM: Calculating
 If you continue
calculating
 Hence
BPTT for Simple LSTM: Calculating
 The relations of
variances are like:
 Hence
BPTT for Simple LSTM: Calculating
 The relations of
variances are like:
 Hence
BPTT for Simple LSTM: Calculating
 The relations of
variances are like:
 Hence
BPTT for LSTM: Calculating
 In a former slide, we saw is calculated as below.
 Let’s see how we can get this equation.
BPTT for LSTM: Calculating
 Just as well as back prop of simple RNN the
errors comes from time step
 And from time step
BPTT for LSTM: Calculating
BPTT for LSTM: Calculating
1
N
 Let’s see how to how to calculate , especially the part
 Then
BPTT for LSTM: Calculating
 Hence
BPTT for LSTM: Calculating
 Hence
 In order to calculate , you need to calculate , therefore .
 And in general
 Let’s calculate , and then you can calculate in general just as well.
*chain rule
 Hence  And in general
BPTT for LSTM: Calculating
 In order to calculate , you need to calculate , therefore .
*chain rule
 At first let’s calculate , and then you can calculate in general just as well.
BPTT for LSTM: Calculating
 Hence
 At first let’s calculate , and then you can calculate in general just as well.
 In order to calculate , you need to calculate , therefore .
*chain rule
 And in general
BPTT for LSTM: Calculating
 Hence
 Let’s calculate , as a representative.
 In order to calculate , you need to calculate , therefore .
*chain rule
 Just as well
Variety of Architecture
 We have finally looked over how LSTM work, and you should be
proud of that.
 But in practice the architecture of
RNN can be much more complicated.
Bidirectional RNN
 First, you can stack RNN diagonally for
more complicated mapping.
Stacked RNN
 You can also propagate information
backward with a bidirectional RNN.
Bidirectional Connection
 Machine translation is a typical case where you need bidirectional
connections.
“Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.”
“I am the only one who cannot speak German in the office.”
「そのオフィスでドイツ語が話せないのは私だけです。」
 Later time steps in the original language can affect the earlier time
steps in the target language.
Further Problems
 Question : What is
a problem of this
model?
 If you naively just use the RNN we have
seen, basically it gets one input and gives
out one output every time step.
 Or it get an input every timestep and
gives out one input at the final timestep.
 RNN model we have seen have a fundamental problem.
Sequence to Sequence Model
 It is easy to imagine that this simple model of RNN is not useful for
machine translation.
 Usually
Architecture of Vanilla RNN
 This model is not proper for problems like machine
translation or voice recognition.
 Let’s briefly take a look at how we can apply RNN to such tasks.
[EOS]
[BOS]
その で 語 話 ない は だけオフィス ドイツ が せ の 私 です
Sequence to Sequence Model:
Machine Translation
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutschsprechen
 First you put input(original sentence)
in the encoder.
 At the last time step of encoder, you
get a hidden state.
 You initialize the RNN of decoder
part with the hidden state and give
[BOS] as an input.
 You use the output as the input
of the next time step.
 You repeat this until the
decoder RNN gives out [EOS]
[EOS]
[BOS]
その で 語 話 ない は だけオフィス ドイツ が せ の 私 です
Sequence to Sequence Model:
How to Train
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
 When you train this model, you fix the
input sequence in the decoder.
 For example if
Universal Approximation
AlexNet
(224, 224, 3)
tensors
Feature extraction
part
Classifying
part
4096-d
vector
1000-d
vector
 I’d like you to remember that I
mentioned that a CNN can be
divided into a “feature extraction par
”
 The 4096-d space is more
related to the “meaning” of
input images.
 And we can assume that the
last three layers is a mapping
from the “meaning” vectors to
1000-d vector.
[EOS]
[BOS]
Sequence to Sequence Model:
Image Captioning
dasIch bin der Einzige in kannBüro der kein Deutsch sprechen
demIch bin der Einzige in kannBüro der kein Deutsch sprechen
 Surprisingly, if you use the
 First you put input(original
sentence) in the encoder.
[EOS]
[BOS]
Sequence to Sequence Model
 However these are very simplified
model of sequence to seuquece
model.
 In practice you have to consider s
 For example, stacking RNN,
bidirectional connection, attention,
beam search, building a language
model.
 You might need at least one
semester for this topic plus this
natural language processing is
a fast changing, hot field.
Sequence to Sequence Model
 For example the RNN model for
Google Translator stacks as
many as 8 LSTM layers.
 On my company laptop, it took
me almost one week to train
stacked-2-layer LSTM on a
much smaller training data.

More Related Content

What's hot

Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkKnoldus Inc.
 
simple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationsimple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationYasutoTamura1
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료Whi Kwon
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 

What's hot (20)

Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
LSTM
LSTMLSTM
LSTM
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Lstm
LstmLstm
Lstm
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
simple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationsimple_rnn_forward_back_propagation
simple_rnn_forward_back_propagation
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
RNN-LSTM.pptx
RNN-LSTM.pptxRNN-LSTM.pptx
RNN-LSTM.pptx
 

Similar to Precise LSTM Algorithm

Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksYasutoTamura1
 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionSaumyaMundra3
 
lepibwp74jd2rz.pdf
lepibwp74jd2rz.pdflepibwp74jd2rz.pdf
lepibwp74jd2rz.pdfSajalTyagi6
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
5.2 Least Squares Linear Regression.pptx
5.2  Least Squares Linear Regression.pptx5.2  Least Squares Linear Regression.pptx
5.2 Least Squares Linear Regression.pptxMaiEllahham1
 
Tensor Field Network (and other ConvNet Generalisations)
Tensor Field Network (and other ConvNet Generalisations)Tensor Field Network (and other ConvNet Generalisations)
Tensor Field Network (and other ConvNet Generalisations)Peng Cheng
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxArunKumar674066
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Vishal Mishra
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12Sunwoo Kim
 
lpSolve - R Library
lpSolve - R LibrarylpSolve - R Library
lpSolve - R LibraryDavid Faris
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14Max De Koninck
 
A TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CA TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CJoshua Gorinson
 
Arrays and pointers
Arrays and pointersArrays and pointers
Arrays and pointersKevin Nguyen
 
SVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxSVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxMohamedMonir33
 

Similar to Precise LSTM Algorithm (20)

Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
 
lepibwp74jd2rz.pdf
lepibwp74jd2rz.pdflepibwp74jd2rz.pdf
lepibwp74jd2rz.pdf
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
5.2 Least Squares Linear Regression.pptx
5.2  Least Squares Linear Regression.pptx5.2  Least Squares Linear Regression.pptx
5.2 Least Squares Linear Regression.pptx
 
Tensor Field Network (and other ConvNet Generalisations)
Tensor Field Network (and other ConvNet Generalisations)Tensor Field Network (and other ConvNet Generalisations)
Tensor Field Network (and other ConvNet Generalisations)
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
 
lpSolve - R Library
lpSolve - R LibrarylpSolve - R Library
lpSolve - R Library
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
Real time signal processing
Real time signal processingReal time signal processing
Real time signal processing
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14
 
Generative models
Generative modelsGenerative models
Generative models
 
A TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CA TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN C
 
Arrays and pointers
Arrays and pointersArrays and pointers
Arrays and pointers
 
SVM & KNN Presentation.pptx
SVM & KNN Presentation.pptxSVM & KNN Presentation.pptx
SVM & KNN Presentation.pptx
 

More from YasutoTamura1

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1YasutoTamura1
 
NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxYasutoTamura1
 
Brief instruction on backprop
Brief instruction on backpropBrief instruction on backprop
Brief instruction on backpropYasutoTamura1
 
Illustrative Introductory CNN
Illustrative Introductory CNNIllustrative Introductory CNN
Illustrative Introductory CNNYasutoTamura1
 

More from YasutoTamura1 (6)

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1
 
NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptx
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
 
Brief instruction on backprop
Brief instruction on backpropBrief instruction on backprop
Brief instruction on backprop
 
Illustrative Introductory CNN
Illustrative Introductory CNNIllustrative Introductory CNN
Illustrative Introductory CNN
 

Recently uploaded

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

Precise LSTM Algorithm

  • 1. DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin Recurrent Neural Network
  • 2. Brief Review on Other Types of Neural Network Convolutional Neural NetworkDensely Connected Layers  Universal Classifier / Regressor  Effective for image processing
  • 3. Universal Approximation ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ ⋯ ⋯ ⋮  It is proved that with enough units and layers, densely connected layers is capable of universal approximation.  Universal approximation means that you can approximate any mapping In this case, the densely connected layers is a function mapping a D-dim vector to a K-dim vector.
  • 4. An Example of Mapping : “Hello World!” of Machine Learning ⋮ ⋮ ⋮ ⋮ 1.0 1.0 1.0 1.0 ⋮ ⋮ ⋮ 0.2 0.3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1.0 1.0 Flattening 784-d vector 16-d vector 10-d vector ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 3% ⋮ ⋮ ⋮ ⋮ 83% ⋮ ⋮ ⋮ ⋮ ⋮ 5% ‘5’784 pixel values  Classifying MNIST handwritten digit dataset is a mapping from 784-d vectors to 10-d vectors.  This is based on a hypothesis that a function exist such that maps vectors of 784 pixel values to 10-d vectors whose elements are probabilities.  Actually this neural net is capable of classifying handwritten digits with over 90% accuracy.
  • 5. Universal Approximation AlexNet (224, 224, 3) tensors Feature extraction part Classifying part 4096-d vector 1000-d vector  In a case of image classification with AlexNet, you flatten the last activation maps and get 4096-d vector.  The 4096-d space is more related to the “meaning” of input images.  And we can assume that the last three layers is a mapping from the “meaning,” “semantic” vectors to 1000-d vector.
  • 6. Why RNN?  Question : What is a critical constraint of densely connected layers and convolutional neural network?  Input size of those networks are fixed.  You cannot properly deal with inputs whose order matters. These are especially problematic when you work on “sequence data.”
  • 7. Sequence Data  Sequence data is sequences of vectors or matrices for several timesteps.  In some tasks we want to change sequence data to sequence data. RNN is mainly used for this type of task.  Assume that , .  It is know that RNN can approximate any mapping . Indexes of time steps *The ”time steps” don’t necessarily mean refer to real time. They are just the number of elements.
  • 8. Sequence Data  Sequence data mean types of data the orders of whose elements matter.  Voice/music  Video  Text
  • 9. Sequence Data  In natural language processing, including machine translation, usually a word or a part of a word is encoded as a vector(for example word embedding). 「そのオフィスでドイツ語が話せないのは私だけです。」  Let’s take a look the three sentences below in German, English, and Japanese. Assume that each sentence is encoded as , , . “Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.” “I’m the only one who cannot speak German in the office.”  Let’s take sequence data as an example.
  • 10. Variety of Architectures of RNN Networks  Confusion in studying RNNs comes from their variety of architectures.  Maybe you will see mainly these types of charts when you start learning RNN. Bidirectional RNN Elman RNN
  • 11. Architecture of Simple RNN  Simple RNN is basically just a single densely connected layers with some layers.  This network propagates forward in one time step just as normal densely connected layers.  In the forward propagation of RNN, neurons in the middle layer also propagate to the middle layer itself.
  • 12. Architecture of a Simple RNN ⋯  It is normal to display every time step of forward propagation.  The inputs and the outputs depend on the time step.  When you display RNN, it is common to show the inner structure as a blackbox.  We share the same parameters at every time step.
  • 13. Architecture of a Simple RNN  This is the idea of “unfolding.”  Most RNNs more or less share the same structure.
  • 14. Forward Propagation of Simple RNN  Let’s take a look at RNN forward propagation from time step to .
  • 15. Forward Propagation of Simple RNN  If you unfold forward propagation of an RNN from time step to , it looks like this.  You share the same parameters at every time step.  Usually this type of chart simplified this way.
  • 16. Back Propagation of Simple RNN  There are some ways to calculate gradients with respect to parameters to optimize RNN.  In this slide, we are going to look at how BPTT (back propagation through time) works.  In BPTT we use an unfolded RNN chart, and error at every step propagates to the first step.
  • 17. Back Propagation of Simple RNN  To be honest, this part is not worth spending a lot of time because in conclusion simple RNNs are not useful.  But if you take a look at how BPTT of simple RNNs works, that would give you clear understandings of how other fancier RNNs work.  Please be patient over next some pages if you are interested.  I made the equations as concrete as possible so that it is easy to understand.
  • 18. BPTT for Simple RNN : Outline  You get an error at each time step, and they propagate to the first time step.  You need to calculate gradients with respect to parameters at each time step.  Assume that you have activated an RNN from step to step .
  • 19. BPTT for Simple RNN : Outline  You have to be careful that in order to calculate all those gradients, you use the same parameters irrelevant to time step.  In order to calculate you need errors which derive from later steps.  Let’s take an example of  You repeat that for all the gradients
  • 20. BPTT for Simple RNN : Outline  We don’t calculate such things as .  As I have just mentioned now, you use the same parameters to calculate gradients at every time step.
  • 21. BPTT for Simple RNN : Outline  And you renew parameters with summations of .  In my opinion, is not a proper notation. It should be denoted as below.  Gradients depend on time steps.
  • 22. BPTT for Simple RNN : Brief Review on Chain Rules  Just as backprop of normal densely connected layers, you use a lot of chain rules in backprop of RNN.  Let be a function of two variances , and the variance , be functions of , .  Let be a function of a variance and the variance be a function of
  • 23. BPTT for Simple RNN : Brief Review on Chain Rules  Let be a function of n variances and the variances be functions of m variances
  • 24. BPTT for Simple RNN : Brief Review on Chain Rules  This generalized chain rule is super important for back propagation.  For simplicity, let’s denote the function in the way below.  Again, the partial differentiation of with respect to is
  • 25.  First, let’s calculate .  This is a gradient at the last step, so you don’t have to consider the error backpropagated from the future steps.  Let’s look at each element of . BPTT for Simple RNN : Calculating *chain rule  Hence
  • 26. BPTT for Simple RNN : Calculating  Therefore calculating is equal to calculating .  You should keep it in mind that in , only is the function of .  Just as well, when you calculate , the error back propagates only from the error function above.
  • 27. BPTT for Simple RNN : Calculating  To calculate gradients , you need to consider errors backpropagated via recurrent connections.  You don’t need to consider recurrent connections to calculate.  Just as , you get . With recurrent connections With NO recurrent connections
  • 28. BPTT for Simple RNN : Calculating  Next, we use chain rules in some steps to calculate *chain rule *chain rule
  • 29. BPTT for Simple RNN : Calculating  We got this equation in the former slide :  You don’t need to consider recurrent connections to calculate  Just as , you get .  We got this equation in the former slide :  Hence
  • 30. BPTT for Simple RNN : Calculating  Hence
  • 31. BPTT for Simple RNN  From now on, especially, you have to be super careful about notations of gradients.  is a gradient with respect to neurons.  We don’t calculate such things as .  is a gradient with respect to parameters.  In other words, we share the same parameters every time step.  But parameters have different errors(gradients) , at EVERY time step.
  • 32. BPTT for Simple RNN  But some study materials use this notation in RNN backprop (for example a textbook by MIT, which is also recommended by Stanford University.).  When you see , that means you differentiate neurons at time step t, with respect to .  might not be a proper notation, because does not depend on time steps.  Therefore usually
  • 33. BPTT for Simple RNN : Calculating *chain rule *chain rule  We start concretely calculating gradients with respect to each parameter.  We need to calculate .
  • 34. BPTT for Simple RNN : Calculating  As we calculated in the last slide  Hence
  • 35. *chain rule *chain rule BPTT for Simple RNN : Calculating  We calculate
  • 36.  Hence  In the last slide we got BPTT for Simple RNN : Calculating
  • 37.  First we calculate BPTT for Simple RNN : Calculating  Hence  Then *chain rule
  • 38.  Just as formers slides BPTT for Simple RNN : Calculating  Hence Then *chain rule
  • 39. BPTT for Simple RNN : Calculating  Hence Then  Also *chain rule
  • 40. Difficulty of Training “Deep” Neural Network ⋯ ⋯  I mentioned that simple RNN is not useful.  Training RNN with sequence input over several time steps is virtually the same as training densely connected layers with several layers.  That is because of vanishing/exploding gradient problem.
  • 41. Vanishing / Exploding Gradient Problem ⋯  This vanishing/exploding problem is not only a problem of RNN.  In fact, if you just simply stack more than three layers of neural networks, it is very hard to train the network. ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ ⋯ ⋯ ⋮
  • 42. Vanishing / Exploding Gradient Problem ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ ⋯ ⋯ ⋮  Forward propagation is a repeating of linear summation and “squashing” the sum to a certain range. Linear summation Squashing the sum Linear summation Squashing the sum
  • 43. • Limitation of hardware A Super Brief History of Neural Nets 1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom 1980: Neocognitron 1986: Backpropagation 1989: LeNet 2006: Pretraining with deep belief net 1968: “2001 : A Space Odyssey” 1958: Perceptron 1st AI Winter • Vanishing / Exploding gradient problem • Hardships in linearly inseparable data problems • Lack of computation algorithms • Limitation of hardware • Shortage of data sources • Lack of theories for hyper parameters 2001: Release of Xbox Increase of data transmission speed The advent of the idea of AI 1995: Commercia- lization of the Internet 2nd AI Winter
  • 44. Deep Belief Network and Pretraining  Vanishing/exploding gradient problem had been one of the bottlenecks of deep neural net research.  In 2006, a group under Geoffrey Hinton discovered a way to tackle overfitting by pretraining of neural nets with deep belief networks.  This is counted as a breakthrough of deep learning research.
  • 45. Tackling Vanishing Gradient Problem with LSTM 1940 ~ 1970: The 1st AI Boom 1980 ~ 1990: The 2nd Boom 2006 ~: The 3rd Boom 1980: Neocognitron 1986: Backpropagation 1989: LeNet 2006: Pretraining with deep belief net 1968: “2001 : A Space Odyssey” 1958: Perceptron 1st AI Winter 2001: Release of Xbox The advent of the idea of AI 1995: Commercia- lization of the Internet 2nd AI Winter 1997: LSTM(Long-Short- Term-Memory)  Using LSTM is one way to deal with vanishing gradient problem of RNN.  The idea of LSTM was already discovered in 1997, I mean when I was born.
  • 46.  For example Francois Chollet, the developer of Keras, says…. LSTM  In short, he suggests we should not think too much about the structure of LSTM. With Keras, you can implement LSTM with one or two line codes.  But anyway, this slide is going to explain everything, contrary to his advice.  The structure of LSTM is much more complicated than simple RNN. Understanding its backprop also can be a pure torture.
  • 47.  Learning LSTM also can be confusing because you can find various charts which show the same idea. LSTM: Architecture  You can roughly classify the charts into two types.  This chart is closer to the one shown in the original paper.
  • 48.  In conclusion, I don’t recommend you to understand LSTM with this chart. LSTM: Forward Propagation  But many papers use this type of chart, so let’s BRIEFLY take a look at this first.  This is one block of LSTM.  Just as well as the normal RNN, each block gets an input, and gives out an output.  The output goes back to the block with recurrent connections. LSTM block
  • 49. LSTM: Forward Propagation  The black dot at the center is called cell, and it contains some information.  The output is calculated with the value of the cell.  Basically LSTM repeats renewing the cell and giving out an output every time.  To be exact, this type of LSTM is called “peephole LSTM.” For simplicity, however, we are going to think about LSTM without peephole connections in next some slides.
  • 50. LSTM: Forward Propagation  An LSTM block has sections named “gates.” Forget gate Input gate Output gate  Each gate gets an input at the time step and get the output of last time step.  There is no consensus on how to call the last gate. Let’s call it a “block input” in this slide. Block input  In each gate, the linear summations of those inputs and recurrent connections are activated with sigmoid functions.  But the block input is activated with a hyperbolic tangent.
  • 51. LSTM: Forward Propagation  The output gate also restrict the output by elementwise multiplication.
  • 52. LSTM: Forward Propagation  Let’s see how each gate in an LSTM block behaves at time step t. Forget gate Input gate Output gate  The outputs of the block input and the input gate are multiplied elementwise.  First, keep it in mind that dotted lines show recurrent connection, flow of information from one time step before. Block input connection in the time step connection with time-lag  The forget gate “forgets” the cell state at the last time step, by multiplying values from -1 to 1, elementwise.  And you renew the cell state.  The output gate also restrict the output by elementwise multiplication.
  • 53. LSTM: Forward Propagation  Let’s see how each gate in an LSTM block behaves at time step t. Forget gate Input gate Output gate  The outputs of the block input and the input gate are multiplied elementwise.  First, keep it in mind that dotted lines show recurrent connection, flow of information from one time step before. Block input connection in the time step connection with time-lag  The forget gate “forgets” the cell state at the last time step, by multiplying values from -1 to 1, elementwise.  And you renew the cell state.  The output gate also restrict the output by elementwise multiplication.
  • 54. LSTM: Architecture  Instead of the chart in the “Space Odyssey” paper, I recommend you to use this LSTM char like an electronic circuit.
  • 56. LSTM : Forward Propagation  If you write down every equation on the former chart, that looks like this.
  • 57. BPTT for LSTM  This part is the climax of this slide.  It is going to be more complicated than before.  However, you have only to be careful which variances are affecting which functions, just as other types of backprops.  Backprop formulas on this slide is based on a paper “LSTM: A Search Space Odyssey.”  This slide is going to concretely show how to get the equations in the paper.
  • 58. BPTT for LSTM  From now on we are going to denote gradients in the way below  Just as BPTT of a simple RNN, to calculate , you need errors at time step s >= t+1 .
  • 59. BPTT for LSTM: Calculating  At time step of BTPP, first of all you can calculate as below. *At this timing, we already know .  Calculating is relatively complicated. For now let’s skip this part.
  • 60. BPTT for LSTM: Calculating  First you need to calculate .  The relations of variances is:  The flow of purple arrows in the chart shows how a variance directly affects values.
  • 61.  Hence BPTT for LSTM: Calculating
  • 62. BPTT for Simple LSTM: Calculating  Next we calculate :
  • 63. BPTT for Simple LSTM: Calculating  If you continue calculating  Hence
  • 64. BPTT for Simple LSTM: Calculating  The relations of variances are like:  Hence
  • 65. BPTT for Simple LSTM: Calculating  The relations of variances are like:  Hence
  • 66. BPTT for Simple LSTM: Calculating  The relations of variances are like:  Hence
  • 67. BPTT for LSTM: Calculating  In a former slide, we saw is calculated as below.  Let’s see how we can get this equation.
  • 68. BPTT for LSTM: Calculating  Just as well as back prop of simple RNN the errors comes from time step  And from time step
  • 69. BPTT for LSTM: Calculating
  • 70. BPTT for LSTM: Calculating 1 N  Let’s see how to how to calculate , especially the part
  • 71.  Then BPTT for LSTM: Calculating  Hence
  • 72. BPTT for LSTM: Calculating  Hence  In order to calculate , you need to calculate , therefore .  And in general  Let’s calculate , and then you can calculate in general just as well. *chain rule
  • 73.  Hence  And in general BPTT for LSTM: Calculating  In order to calculate , you need to calculate , therefore . *chain rule  At first let’s calculate , and then you can calculate in general just as well.
  • 74. BPTT for LSTM: Calculating  Hence  At first let’s calculate , and then you can calculate in general just as well.  In order to calculate , you need to calculate , therefore . *chain rule  And in general
  • 75. BPTT for LSTM: Calculating  Hence  Let’s calculate , as a representative.  In order to calculate , you need to calculate , therefore . *chain rule  Just as well
  • 76. Variety of Architecture  We have finally looked over how LSTM work, and you should be proud of that.  But in practice the architecture of RNN can be much more complicated. Bidirectional RNN  First, you can stack RNN diagonally for more complicated mapping. Stacked RNN  You can also propagate information backward with a bidirectional RNN.
  • 77. Bidirectional Connection  Machine translation is a typical case where you need bidirectional connections. “Ich bin der Einzige in dem Büro, der kein Deutsch sprechen kann.” “I am the only one who cannot speak German in the office.” 「そのオフィスでドイツ語が話せないのは私だけです。」  Later time steps in the original language can affect the earlier time steps in the target language.
  • 78. Further Problems  Question : What is a problem of this model?  If you naively just use the RNN we have seen, basically it gets one input and gives out one output every time step.  Or it get an input every timestep and gives out one input at the final timestep.  RNN model we have seen have a fundamental problem.
  • 79. Sequence to Sequence Model  It is easy to imagine that this simple model of RNN is not useful for machine translation.  Usually
  • 80. Architecture of Vanilla RNN  This model is not proper for problems like machine translation or voice recognition.  Let’s briefly take a look at how we can apply RNN to such tasks.
  • 81. [EOS] [BOS] その で 語 話 ない は だけオフィス ドイツ が せ の 私 です Sequence to Sequence Model: Machine Translation demIch bin der Einzige in kannBüro der kein Deutsch sprechen demIch bin der Einzige in kannBüro der kein Deutschsprechen  First you put input(original sentence) in the encoder.  At the last time step of encoder, you get a hidden state.  You initialize the RNN of decoder part with the hidden state and give [BOS] as an input.  You use the output as the input of the next time step.  You repeat this until the decoder RNN gives out [EOS]
  • 82. [EOS] [BOS] その で 語 話 ない は だけオフィス ドイツ が せ の 私 です Sequence to Sequence Model: How to Train demIch bin der Einzige in kannBüro der kein Deutsch sprechen demIch bin der Einzige in kannBüro der kein Deutsch sprechen  When you train this model, you fix the input sequence in the decoder.  For example if
  • 83. Universal Approximation AlexNet (224, 224, 3) tensors Feature extraction part Classifying part 4096-d vector 1000-d vector  I’d like you to remember that I mentioned that a CNN can be divided into a “feature extraction par ”  The 4096-d space is more related to the “meaning” of input images.  And we can assume that the last three layers is a mapping from the “meaning” vectors to 1000-d vector.
  • 84. [EOS] [BOS] Sequence to Sequence Model: Image Captioning dasIch bin der Einzige in kannBüro der kein Deutsch sprechen demIch bin der Einzige in kannBüro der kein Deutsch sprechen  Surprisingly, if you use the  First you put input(original sentence) in the encoder.
  • 85. [EOS] [BOS] Sequence to Sequence Model  However these are very simplified model of sequence to seuquece model.  In practice you have to consider s  For example, stacking RNN, bidirectional connection, attention, beam search, building a language model.  You might need at least one semester for this topic plus this natural language processing is a fast changing, hot field.
  • 86. Sequence to Sequence Model  For example the RNN model for Google Translator stacks as many as 8 LSTM layers.  On my company laptop, it took me almost one week to train stacked-2-layer LSTM on a much smaller training data.