14889574 dl ml RNN Deeplearning MMMm.ppt

Recurrent Neural
Networks (RNN)
Giuseppe Attardi
Some slides from Arun Mallya

 RNNs are called recurrent because they perform
the same task for every element of a sequence,
with the output depending on the previous
values.
 RNNs have a “memory” which captures
information about what has been calculated so
far.
 In theory RNNs can make use of information
in arbitrarily long sequences.

 The hidden state st represents the memory of the
network.
 st captures information about what happened in all
the previous time steps.
 The output ot is calculated solely based on the
memory at time t.
 st typically can’t capture information from too many
time steps ago.
 Unlike a traditional deep neural network, which uses
different parameters at each layer, a RNN shares the
same parameters (U, V, W above) across all steps.
 This reflects the fact that we are performing the same
task at each step, just with different inputs. This
greatly reduces the total number of parameters we
need to learn.

 RNN Advantages:
 Can process any length input
 Model size doesn’t increase for longer input
 Computation for step t can (in theory) use information
from many steps back
 Weights are shared across timesteps →
representations are shared
 RNN Disadvantages:
 Recurrent computation is slow
 In practice, difficult to access information from many
steps back
Slide by Richard Socher

How do RNN reduce
complexity?
 Given function f: h’, y = f(h, x)
f
h0 h1
y1
x1
f h2
y2
x2
f h3
y3
x3
……
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN.
h and h’ are vectors with
the same dimension

Deep RNN
RNNs with multiple hidden layers
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6
Slide by Arun Mallya

Bidirectional RNN
 RNNs can process the input sequence in both forward
and in reverse direction
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6
 Popular in speech recognition

Pyramid RNN
 Reducing the number of time steps
W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition,” ICASSP, 2016
Bidirectional
RNN
Significantly speed up training

Vanilla RNN
f
h h'
y
x
h'
y Wo
Wh
h’
Wi
x
h
Note, y is computed
from h’
= s( + + b)
= s( )

Vanilla RNN
Unfolding the network in time
ht = s(Wih xt + U ht1 + b)
yt = s(W0 ht)

• Reliable and controlled convergence
• Supported by most of MLframeworks
Evolutionary methods, expectation maximization, non-
parametric methods, particle swarm optimization
Methods:
Backpropagation:
Research
Tasks:
• For each timestamp of the input sequence x predict output y(synchronously)
• For the input sequence x predict the scalar value of y (e.g., at end ofsequence)
• For the input sequence x of length Lx generatethe
output sequence y of different length Ly
RNN Training
Target: obtain the network parameters that optimize the cost function
Cost functions: log loss, mean squared root error etc…

1. Unfold the network.
2. Repeat for the train data:
1. Given some input sequence 𝒙
2. For t in 0, N-1:
1. Forward-propagate
2. Initialize hidden state to the past value 𝒉𝑡−1
3. Obtain output sequence 𝒚
4. Calculate error E(y, y)
5. Back-propagate error across the unfolded network
6. Average the weights
7. Compute next hidden state value 𝒉𝑡
Back-propagation Through Time
ht = s(Wih xt + U ht1 + b)
yt = s(W0 ht)
Cross Entropy Loss: 𝐸 𝑦, 𝑦 = −
𝑡
𝑦 log 𝑦𝑡

Apply chain rule:
𝜽 - Network parameters
For time 2:
𝜕ℎ2
𝜕ℎ0
=
𝜕ℎ2
𝜕ℎ1
𝜕ℎ1
𝜕ℎ0
𝜕𝐸2
𝜕𝜃
=
𝑘=0
2
𝜕𝐸2
𝜕𝑦2
⋅
𝜕𝑦2
𝜕ℎ0
⋅
𝜕ℎ2
𝜕ℎ𝑘
⋅
𝜕ℎ𝑘
𝜕𝜃
Timestep
0
hinit
y0
x0
E0
ŷ0
Timestep
1
h0
y1
x1
E1
ŷ1
Timestep
2
h1
y0
x2
E2
ŷ2
h2

Timestep
0
hinit
y0
x0
E0
ŷ0
Timestep
1
h0
y1
x1
E1
ŷ1
Timestep
2
h1
y0
x2
E2
ŷ2
h2
𝜕𝐸0
𝜕ℎ0
𝜕𝐸1
𝜕ℎ1
𝜕𝐸2
𝜕ℎ2
𝜕ℎ2
𝜕ℎ1
𝜕ℎ1
𝜕ℎ0

Saturation
Gradient
close to 0
Saturated neurons gradients → 0
Drive previous layers gradients to 0
(especially for far time-stamps)
Known problem for deep feed-forward networks.
For recurrent networks (even shallow) makes impossible to learn long-term dependencies!
• Smaller weigh parameters leads to faster gradients vanishing.
• Very big initial parameters make the gradient descent to diverge fast (explode).
𝝏𝒉𝑡
𝝏𝒉0
=
𝝏𝒉𝑡
𝝏𝒉𝑡
−1
∙⋯∙
𝝏𝒉3 𝝏𝒉2
∙ ∙
𝝏𝒉1
𝝏𝒉2 𝝏𝒉1 𝝏𝒉0
• Decays exponentially
• Network stops learning, can’t update
• Impossible to learn correlations
• between temporally distant events
Problem: vanishing gradients

Network can not converge and
weigh parameters do not stabilize
Large increase in the norm of the gradient
during training
Diagnostics: NaNs; Cost function large fluctuations
Pascanou R. et al, On the difficulty oftraining
recurrent neural networks. arXiv (2012)
Problem: exploding gradients
Solutions:
• Use gradient clipping
• Try reduce learning rate
• Change loss function by setting constrains on weights (L1/L2norms)

Problems with Vanilla RNN
 When dealing with a time series, it tends to forget
old information. When there is a distant
relationship of unknown length, we wish to have
a “memory” to it.
 Vanishing gradient problem.

The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.

LSTM
The core idea is this cell
state Ct, it is changed
slowly, with only minor
linear interactions. It is very
easy for information to flow
along it unchanged.
ht-1
Ct-1
This sigmoid gate
determines how much
information goes thru
This decides what info
Is to add to the cell state
Output gate
Controls what
goes into output
Forget input
gate gate
Why sigmoid or tanh:
Sigmoid: 0,1 gating as switch.
Vanishing gradient problem in
LSTM is handled already.
ReLU replaces tanh ok?

it decides what component
is to be updated.
C’t provides change contents
Updating the cell state
Decide what part of the cell
state to output

LSTM – Forward/Backward
Explore the equations for LSTM forward and backward
computations:
Illustrated LSTM Forward and Backward Pass

Information Flow in a LSTM
xt
z
zi
zf zo
ht-1
ct-1
z
xt
ht-1
W
zi
xt
ht-1
Wi
zf
xt
ht-1
Wf
zo
xt
ht-1
Wo
= σ( )
= σ( )
= σ( )
Controls
forget gate
Controls
input gate
Updating
information
Controls
Output gate
These 4 matrix
computation can
be done
concurrently.
= tanh( )

ht
xt
z
zi
zf zo
yt
ht-1
ct-1 ct
tanh
ct = zf  ct-1 + ziz
ht = zo  tanh(ct)
yt = σ(W’ht)
Information flow of LSTM
Element-wise multiply

LSTM information flow
xt
z
zi
zf zo
yt
ht-1
ct-1 ct
xt+1
z
zi
zf zo
yt+1
ht
ct+1
tanh tanh
ht+1

GRU – Gated Recurrent Unit
It combines the forget and input into a single update gate.
It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.
reset gate
X,*: element-wise multiply
Update gate

Both take xt and ht-1 as inputs and compute ht to pass along.
GRUs don't need the cell layer to pass values along.
The calculations within each iteration ensure that the ht values
being passed along either retain a high amount of old information
or are jump-started with a high amount of new information.

Feed-forward vs Recurrent
Network
x f1 a1
f2 a2
f3 a3
f4 y
x1
h0
f h1
x2
f
x3
h2
f
x4
h3 f g y4
t is layer
t is time step
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
at = ft(at-1) = σ(Wtat-1 + bt)
at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)

Neural machine translation
LSTM

Sequence to sequence chat
model

Chat with context
U: Hi
U: Hi
M: Hi
M: Hello
M: Hi
M: Hello
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Neural Network Models.

Baidu’s speech recognition using
RNN

Image caption generation using attention
(From CY Lee lecture)
filter filter filter
match 0.7
CNN
z0
A vector for
each region
z0 is initial parameter, it is also learned

Image Caption Generation
CNN
A vector for
each region
0.7 0.1 0.1
0.1 0.0 0.0
weighted
sum
z1
Word 1
z0
Attention to
a region

CNN
A vector for
each region
z0
0.0 0.8 0.2
0.0 0.0 0.0
weighted
sum
z1
Word 1
z2
Word 2

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015

Deep Visual-Semantic Alignments for Generating Image Descriptions.
Source: http://cs.stanford.edu/people/karpathy/deepimagesent/Training
RNNs

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo
Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV,
2015

14889574 dl ml RNN Deeplearning MMMm.ppt

Recommended

Recommended

More Related Content

Similar to 14889574 dl ml RNN Deeplearning MMMm.ppt

Similar to 14889574 dl ml RNN Deeplearning MMMm.ppt (20)

More from ManiMaran230751

More from ManiMaran230751 (9)

Recently uploaded

Recently uploaded (20)

14889574 dl ml RNN Deeplearning MMMm.ppt

Editor's Notes