SlideShare a Scribd company logo
1 of 30
Download to read offline
Long short-term memory
Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term
memory." Neural computation 9.8 (1997): 1735-1780.
01
Long Short-Term Memory (LSTM)
Olivia Ni
• Recurrent Neural Networks (RNN)
• The Problem of Long-Term Dependencies
• LSTM Networks
• The Core Idea Behind LSTMs
• Step-by-Step LSTM Walk Through
• Variants on LSTMs
• Conclusions & References
• Appendix (BPTT  Gradient Exploding/ Vanishing)
02
Outline
• Idea:
• condition the neural network on all previous information and tie the weights
at each time step
• Assumption: temporal information matters (i.e. time series data)
03
Recurrent Neural Networks (RNN)
RNN RNNRNN
𝐼𝑛𝑝𝑢𝑡 𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡
𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡
𝐼𝑛𝑝𝑢𝑡 𝑡−1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1
𝐼𝑛𝑝𝑢𝑡 𝑡+1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1
𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1
• STM = Short-term memory
• RNN Definition:
• Model Training:
• All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent
04
Recurrent Neural Networks (RNN)
𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖
• Example: (Consider trying to predict the last word in the text)
• Issue: in theory, RNNs can handle such “long-term dependencies,” but
they cannot in practice!!
“The clouds are in the sky.”
“I grew up in France… I speak fluent French.”
05
The Problem of Long-Term Dependencies
• RNN Training Issue:
(1) The gradient is a product of Jacobian matrices, each associated with a step
in the forward computation
(2) Multiply the same matrix at each time step during BPTT
• The gradient becomes very small or very large quickly
• Vanishing or Exploding gradient
• The error surface is either very flat or very steep
06
The Problem of Long-Term Dependencies
• Possible Solutions:
• Gradient Exploding:
• Clipping (https://arxiv.org/abs/1211.5063?context=cs)
• Gradient Vanishing:
• Better Initialization (https://arxiv.org/abs/1504.00941)
• Gating Mechanism (LSTM, GRU, …, etc.)
• Attention Mechanism (https://arxiv.org/pdf/1706.03762.pdf)
07
The Problem of Long-Term Dependencies
08
LSTM Networks – The Core Idea Behind LSTMs
RNN RNNRNN
𝐼𝑛𝑝𝑢𝑡 𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡
𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡
𝐼𝑛𝑝𝑢𝑡 𝑡−1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1
𝐼𝑛𝑝𝑢𝑡 𝑡+1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1
𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1
• STM = Short-term memory
• LSTM = Long Short-term memory
LSTM LSTMLSTM
𝐼𝑛𝑝𝑢𝑡 𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡
𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡
𝐼𝑛𝑝𝑢𝑡 𝑡−1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1
𝐼𝑛𝑝𝑢𝑡 𝑡+1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1
𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1
𝐿𝑆𝑇𝑀 𝑡−1 𝐿𝑆𝑇𝑀 𝑡
𝐿𝑆𝑇𝑀 𝑡−2 𝐿𝑆𝑇𝑀 𝑡+1
09
LSTM Networks – The Core Idea Behind LSTMs
• STM = Short-term memory
• LSTM = Long Short-term memory
𝐿𝑆𝑇𝑀
𝑆𝑇𝑀
𝑆𝑇𝑀
10
LSTM Networks – Step-by-Step LSTM Walk Through (0/4)
• The cell state runs straight down
the entire chain, with only some
minor linear interactions.
 Easy for information to flow
along it unchanged
• The LSTM does have the ability
to remove or add information to
the cell state, carefully regulated
by structures called gates.
11
LSTM Networks – Step-by-Step LSTM Walk Through (1/4)
• Forget gate (sigmoid + pointwise
multiplication operation):
decides what information we’re
going to throw away from the
cell state
• 1: ‘’Complete keep this”
• 0: “Complete get rid of this”
12
LSTM Networks – Step-by-Step LSTM Walk Through (2/4)
• Input gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to store in the cell
state
Vanilla RNN
13
LSTM Networks – Step-by-Step LSTM Walk Through (3/4)
• Cell state update: forgets the
things we decided to forget
earlier and add the new
candidate values, scaled by how
much we decided to update
• 𝑓𝑡: decide which to forget
• 𝑖 𝑡: decide which to update
⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
14
LSTM Networks – Step-by-Step LSTM Walk Through (4/4)
• Output gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to output
⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
15
LSTM Networks – Variants on LSTMs (1/3)
• LSTM with Peephole Connections
• Idea: allow gate layers to look at
the cell state
16
LSTM Networks – Variants on LSTMs (2/3)
• LSTM with Coupled Forget/ Input
Gate
• Idea: we only forget when we’re
going to input something in its
place, and vice versa.
17
LSTM Networks – Variants on LSTMs (3/3)
• Gated Recurrent Unit (GRU)
• Idea:
• combine the forget and input gates
into a single “update gate”
• merge the cell state and hidden state
Update gate:
Reset gate:
State Candidate:
Current State:
Explain by
- Backpropagation
Through Time (BPTT)
RNN Training Issue:
- Gradient Vanishing
- Gradient Exploding
Review
- Backpropagation (BP)
18
Appendix – The Problem of Long-Term Dependencies
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
• Gradient Descent for Neural Networks
• Computing the gradient includes millions of parameters.
• To compute it efficiently, we use backpropagation.
• Compute the gradient based on two pre-computed terms from forward and backward pass.
19
Appendix – The Problem of Long-Term Dependencies
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
BPTT
BP
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• WLOG, we use 𝑤𝑖𝑗
𝑙
to demonstrate
• Forward pass:
20
Appendix – The Problem of Long-Term Dependencies
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
= ൝
𝑥𝑗 , 𝑖𝑓 𝑙 = 1
𝑎𝑗
𝑙−1
, 𝑖𝑓 𝑙 > 1
(𝑙 = 1) (𝑙 > 1)
BPTT
BP
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• WLOG, we use 𝑤𝑖𝑗
𝑙
to demonstrate
• Backward pass :
21
Appendix – The Problem of Long-Term Dependencies
(𝑙 = 𝐿) (𝑙 < 𝐿)
𝛿𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕 𝑦𝑖
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕σ 𝑧𝑖
𝐿
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
σ′ 𝑧𝑖
𝐿
BPTT
BP
𝛿𝑖
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
= ෍
𝑘
𝜕 𝐶 𝜃
𝜕 𝑧 𝑘
𝑙+1
𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
=
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
෍
𝑘
𝜕 𝐶 𝜃
𝜕 𝑧 𝑘
𝑙+1
𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
=
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1 𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
= σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙 ≜ 𝛿𝑖
𝑙
=
σ′ 𝑧𝑖
𝐿 𝜕 𝐶 𝜃
𝜕 𝑦𝑖
, 𝑖𝑓 𝑙 = 𝐿
σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
, 𝑖𝑓 𝑙 < 𝐿
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• WLOG, we use 𝑤𝑖𝑗
𝑙
to demonstrate
• Backward pass :
22
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙 ≜ 𝛿𝑖
𝑙
=
σ′ 𝑧𝑖
𝐿 𝜕 𝐶 𝜃
𝜕 𝑦𝑖
, 𝑖𝑓 𝑙 = 𝐿
σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
, 𝑖𝑓 𝑙 < 𝐿
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• Concluding Remarks for Backpropagation (BP)
23
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
• Recap Recurrent Neuron Network (RNN) Architectures
• Model Training:
• All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent
24
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖
25
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖
𝑾 𝑼 𝑽
𝑊(1)
⟵ 𝑊 1
−
𝜕𝐶 3
𝜃
𝜕𝑊 1
𝑊(2)
⟵ 𝑊 2
−
𝜕𝐶 3
𝜃
𝜕𝑊 2
𝑊(3)
⟵ 𝑊 3
−
𝜕𝐶 3
𝜃
𝜕𝑊 3
𝑈(1)
⟵ 𝑈 1
−
𝜕𝐶 3
𝜃
𝜕𝑈 1
𝑈(2)
⟵ 𝑈 2
−
𝜕𝐶 3
𝜃
𝜕𝑈 2
𝑈(3)
⟵ 𝑈 3
−
𝜕𝐶 3
𝜃
𝜕𝑈 3
𝑉(3)
⟵ 𝑉 3
−
𝜕𝐶 3
𝜃
𝜕𝑉 3
𝑊 ⟵ 𝑊 −
𝜕𝐶 3 𝜃
𝜕𝑊 1
−
𝜕𝐶 3 𝜃
𝜕𝑊 2
−
𝜕𝐶 3 𝜃
𝜕𝑊 3
𝑈 ⟵ 𝑈 −
𝜕𝐶 3 𝜃
𝜕𝑈 1
−
𝜕𝐶 3 𝜃
𝜕𝑈 2
−
𝜕𝐶 3 𝜃
𝜕𝑈 3
𝑉 ⟵ 𝑉 −
𝜕𝐶 3 𝜃
𝜕𝑉 3
𝐶 ≜ ෍
𝑡
𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛
Tie 𝜃
NO
Yes
26
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
𝜕𝐶 3
𝜃
𝜕𝑊
=
𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑊
𝑾 𝑼 𝑽
𝑊 ⟵ 𝑊 −
𝜕𝐶 3
𝜃
𝜕𝑊 1
−
𝜕𝐶 3
𝜃
𝜕𝑊 2
−
𝜕𝐶 3
𝜃
𝜕𝑊 3
𝑈 ⟵ 𝑈 −
𝜕𝐶 3
𝜃
𝜕𝑈 1
−
𝜕𝐶 3
𝜃
𝜕𝑈 2
−
𝜕𝐶 3
𝜃
𝜕𝑈 3
𝑉 ⟵ 𝑉 −
𝜕𝐶 3
𝜃
𝜕𝑉 3
Tie 𝜃
Yes
𝜕𝐶 3 𝜃
𝜕𝑈
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑈
𝜕𝐶 3 𝜃
𝜕𝑉
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑉
𝐶 ≜ ෍
𝑡
𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛
𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖
27
Appendix – The Problem of Long-Term Dependencies
BPTT
BP
𝜕𝑠𝑗
𝜕𝑠 𝑘
= ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
= ෑ
𝑗=𝑘+1
3
𝑊 𝑇
𝑱𝒂𝒄𝒐𝒃𝒊𝒂𝒏_𝒎𝒂𝒕𝒓𝒊𝒙 𝜎′
𝑠𝑗−1
𝜕𝐶 3
𝜃
𝜕𝑊
=
𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑊
𝜕𝐶 3 𝜃
𝜕𝑈
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑈
𝜕𝐶 3 𝜃
𝜕𝑉
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑉
𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖
• Understand the difficulty of training recurrent neural networks
• Gradient Exploding
• Gradient Vanishing
• One possible solution for solving the gradient vanishing problem is
“Gating mechanism”, which is the key concept of LSTM
• LSTM can be “deep” if we stack multiple LSTM cells
• Extensions:
• Uni-directional v.s. Bi-directional
• One-to-one, One-to-many, Many-to-one, Many-to-Many (w/ or w/o Encoder-Decoder)
28
Conclusions
• Understanding LSTM Networks
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• Prof. Hung-yi Lee Courses
https://www.youtube.com/watch?v=xCGidAeyS4M
https://www.youtube.com/watch?v=rTqmWlnwz_0
• On the difficulty of training recurrent neural networks
https://arxiv.org/abs/1211.5063
• UDACITY Courses: Intro to Deep Learning with PyTorch
https://classroom.udacity.com/courses/ud188
29
References
20
Thanks for your listening.

More Related Content

What's hot

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 

What's hot (20)

Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
RNN-LSTM.pptx
RNN-LSTM.pptxRNN-LSTM.pptx
RNN-LSTM.pptx
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 

Similar to LSTM

Long Short Term Memory LSTM
Long Short Term Memory LSTMLong Short Term Memory LSTM
Long Short Term Memory LSTM
Abdullah al Mamun
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Sasidhar Tadanki
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
paul0001
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
AbhijitVenkatesh1
 

Similar to LSTM (20)

Long Short-Term Memory
Long Short-Term MemoryLong Short-Term Memory
Long Short-Term Memory
 
Long Short Term Memory LSTM
Long Short Term Memory LSTMLong Short Term Memory LSTM
Long Short Term Memory LSTM
 
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
lepibwp74jd2rz.pdf
lepibwp74jd2rz.pdflepibwp74jd2rz.pdf
lepibwp74jd2rz.pdf
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
حافظه طولانی کوتاه مدت Long Short-Term Memory (LSTM)
حافظه طولانی کوتاه مدت Long Short-Term Memory (LSTM)حافظه طولانی کوتاه مدت Long Short-Term Memory (LSTM)
حافظه طولانی کوتاه مدت Long Short-Term Memory (LSTM)
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 

Recently uploaded

Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Cherry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 

Recently uploaded (20)

Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Early Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdfEarly Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

LSTM

  • 1. Long short-term memory Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. 01 Long Short-Term Memory (LSTM) Olivia Ni
  • 2. • Recurrent Neural Networks (RNN) • The Problem of Long-Term Dependencies • LSTM Networks • The Core Idea Behind LSTMs • Step-by-Step LSTM Walk Through • Variants on LSTMs • Conclusions & References • Appendix (BPTT  Gradient Exploding/ Vanishing) 02 Outline
  • 3. • Idea: • condition the neural network on all previous information and tie the weights at each time step • Assumption: temporal information matters (i.e. time series data) 03 Recurrent Neural Networks (RNN) RNN RNNRNN 𝐼𝑛𝑝𝑢𝑡 𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡 𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡 𝐼𝑛𝑝𝑢𝑡 𝑡−1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1 𝐼𝑛𝑝𝑢𝑡 𝑡+1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1 𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1 • STM = Short-term memory
  • 4. • RNN Definition: • Model Training: • All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent 04 Recurrent Neural Networks (RNN) 𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡 𝜃 𝑖+1 ← 𝜃 𝑖 − 𝜂𝛻𝜃 𝐶 𝜃 𝑖
  • 5. • Example: (Consider trying to predict the last word in the text) • Issue: in theory, RNNs can handle such “long-term dependencies,” but they cannot in practice!! “The clouds are in the sky.” “I grew up in France… I speak fluent French.” 05 The Problem of Long-Term Dependencies
  • 6. • RNN Training Issue: (1) The gradient is a product of Jacobian matrices, each associated with a step in the forward computation (2) Multiply the same matrix at each time step during BPTT • The gradient becomes very small or very large quickly • Vanishing or Exploding gradient • The error surface is either very flat or very steep 06 The Problem of Long-Term Dependencies
  • 7. • Possible Solutions: • Gradient Exploding: • Clipping (https://arxiv.org/abs/1211.5063?context=cs) • Gradient Vanishing: • Better Initialization (https://arxiv.org/abs/1504.00941) • Gating Mechanism (LSTM, GRU, …, etc.) • Attention Mechanism (https://arxiv.org/pdf/1706.03762.pdf) 07 The Problem of Long-Term Dependencies
  • 8. 08 LSTM Networks – The Core Idea Behind LSTMs RNN RNNRNN 𝐼𝑛𝑝𝑢𝑡 𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡 𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡 𝐼𝑛𝑝𝑢𝑡 𝑡−1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1 𝐼𝑛𝑝𝑢𝑡 𝑡+1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1 𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1 • STM = Short-term memory • LSTM = Long Short-term memory LSTM LSTMLSTM 𝐼𝑛𝑝𝑢𝑡 𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡 𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡 𝐼𝑛𝑝𝑢𝑡 𝑡−1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1 𝐼𝑛𝑝𝑢𝑡 𝑡+1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1 𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1 𝐿𝑆𝑇𝑀 𝑡−1 𝐿𝑆𝑇𝑀 𝑡 𝐿𝑆𝑇𝑀 𝑡−2 𝐿𝑆𝑇𝑀 𝑡+1
  • 9. 09 LSTM Networks – The Core Idea Behind LSTMs • STM = Short-term memory • LSTM = Long Short-term memory 𝐿𝑆𝑇𝑀 𝑆𝑇𝑀 𝑆𝑇𝑀
  • 10. 10 LSTM Networks – Step-by-Step LSTM Walk Through (0/4) • The cell state runs straight down the entire chain, with only some minor linear interactions.  Easy for information to flow along it unchanged • The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
  • 11. 11 LSTM Networks – Step-by-Step LSTM Walk Through (1/4) • Forget gate (sigmoid + pointwise multiplication operation): decides what information we’re going to throw away from the cell state • 1: ‘’Complete keep this” • 0: “Complete get rid of this”
  • 12. 12 LSTM Networks – Step-by-Step LSTM Walk Through (2/4) • Input gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to store in the cell state Vanilla RNN
  • 13. 13 LSTM Networks – Step-by-Step LSTM Walk Through (3/4) • Cell state update: forgets the things we decided to forget earlier and add the new candidate values, scaled by how much we decided to update • 𝑓𝑡: decide which to forget • 𝑖 𝑡: decide which to update ⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
  • 14. 14 LSTM Networks – Step-by-Step LSTM Walk Through (4/4) • Output gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to output ⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
  • 15. 15 LSTM Networks – Variants on LSTMs (1/3) • LSTM with Peephole Connections • Idea: allow gate layers to look at the cell state
  • 16. 16 LSTM Networks – Variants on LSTMs (2/3) • LSTM with Coupled Forget/ Input Gate • Idea: we only forget when we’re going to input something in its place, and vice versa.
  • 17. 17 LSTM Networks – Variants on LSTMs (3/3) • Gated Recurrent Unit (GRU) • Idea: • combine the forget and input gates into a single “update gate” • merge the cell state and hidden state Update gate: Reset gate: State Candidate: Current State:
  • 18. Explain by - Backpropagation Through Time (BPTT) RNN Training Issue: - Gradient Vanishing - Gradient Exploding Review - Backpropagation (BP) 18 Appendix – The Problem of Long-Term Dependencies
  • 19. 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 • Gradient Descent for Neural Networks • Computing the gradient includes millions of parameters. • To compute it efficiently, we use backpropagation. • Compute the gradient based on two pre-computed terms from forward and backward pass. 19 Appendix – The Problem of Long-Term Dependencies 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 BPTT BP
  • 20. 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 𝜕 𝑧𝑖 𝑙 𝜕 𝑤𝑖𝑗 𝑙 • WLOG, we use 𝑤𝑖𝑗 𝑙 to demonstrate • Forward pass: 20 Appendix – The Problem of Long-Term Dependencies 𝜕 𝑧𝑖 𝑙 𝜕 𝑤𝑖𝑗 𝑙 = ൝ 𝑥𝑗 , 𝑖𝑓 𝑙 = 1 𝑎𝑗 𝑙−1 , 𝑖𝑓 𝑙 > 1 (𝑙 = 1) (𝑙 > 1) BPTT BP
  • 21. 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 𝜕 𝑧𝑖 𝑙 𝜕 𝑤𝑖𝑗 𝑙 • WLOG, we use 𝑤𝑖𝑗 𝑙 to demonstrate • Backward pass : 21 Appendix – The Problem of Long-Term Dependencies (𝑙 = 𝐿) (𝑙 < 𝐿) 𝛿𝑖 𝐿 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝐿 = 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 𝜕 𝑦𝑖 𝜕 𝑧𝑖 𝐿 = 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 𝜕𝑎𝑖 𝐿 𝜕 𝑧𝑖 𝐿 = 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 𝜕σ 𝑧𝑖 𝐿 𝜕 𝑧𝑖 𝐿 = 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 σ′ 𝑧𝑖 𝐿 BPTT BP 𝛿𝑖 𝑙 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 = ෍ 𝑘 𝜕 𝐶 𝜃 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕𝑎𝑖 𝐿 𝜕𝑎𝑖 𝐿 𝜕 𝑧𝑖 𝑙 = 𝜕𝑎𝑖 𝐿 𝜕 𝑧𝑖 𝑙 ෍ 𝑘 𝜕 𝐶 𝜃 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕𝑎𝑖 𝐿 = 𝜕𝑎𝑖 𝐿 𝜕 𝑧𝑖 𝑙 ෍ 𝑘 𝛿𝑖 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕𝑎𝑖 𝐿 = σ′ 𝑧𝑖 𝑙 ෍ 𝑘 𝛿𝑖 𝑙+1 𝑤 𝑘𝑖 𝑙+1 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 ≜ 𝛿𝑖 𝑙 = σ′ 𝑧𝑖 𝐿 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 , 𝑖𝑓 𝑙 = 𝐿 σ′ 𝑧𝑖 𝑙 ෍ 𝑘 𝛿𝑖 𝑙+1 𝑤 𝑘𝑖 𝑙+1 , 𝑖𝑓 𝑙 < 𝐿
  • 22. 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 𝜕 𝑧𝑖 𝑙 𝜕 𝑤𝑖𝑗 𝑙 • WLOG, we use 𝑤𝑖𝑗 𝑙 to demonstrate • Backward pass : 22 Appendix – The Problem of Long-Term Dependencies BPTT BP 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 ≜ 𝛿𝑖 𝑙 = σ′ 𝑧𝑖 𝐿 𝜕 𝐶 𝜃 𝜕 𝑦𝑖 , 𝑖𝑓 𝑙 = 𝐿 σ′ 𝑧𝑖 𝑙 ෍ 𝑘 𝛿𝑖 𝑙+1 𝑤 𝑘𝑖 𝑙+1 , 𝑖𝑓 𝑙 < 𝐿
  • 23. 𝜕 𝐶 𝜃 𝜕 𝑤𝑖𝑗 𝑙 = 𝜕 𝐶 𝜃 𝜕 𝑧𝑖 𝑙 𝜕 𝑧𝑖 𝑙 𝜕 𝑤𝑖𝑗 𝑙 • Concluding Remarks for Backpropagation (BP) 23 Appendix – The Problem of Long-Term Dependencies BPTT BP
  • 24. • Recap Recurrent Neuron Network (RNN) Architectures • Model Training: • All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent 24 Appendix – The Problem of Long-Term Dependencies BPTT BP 𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡 𝜃 𝑖+1 ← 𝜃 𝑖 − 𝜂𝛻𝜃 𝐶 𝜃 𝑖
  • 25. 25 Appendix – The Problem of Long-Term Dependencies BPTT BP 𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡 𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1 ← 𝜃 𝑖 − 𝜂𝛻𝜃 𝐶 𝜃 𝑖 𝑾 𝑼 𝑽 𝑊(1) ⟵ 𝑊 1 − 𝜕𝐶 3 𝜃 𝜕𝑊 1 𝑊(2) ⟵ 𝑊 2 − 𝜕𝐶 3 𝜃 𝜕𝑊 2 𝑊(3) ⟵ 𝑊 3 − 𝜕𝐶 3 𝜃 𝜕𝑊 3 𝑈(1) ⟵ 𝑈 1 − 𝜕𝐶 3 𝜃 𝜕𝑈 1 𝑈(2) ⟵ 𝑈 2 − 𝜕𝐶 3 𝜃 𝜕𝑈 2 𝑈(3) ⟵ 𝑈 3 − 𝜕𝐶 3 𝜃 𝜕𝑈 3 𝑉(3) ⟵ 𝑉 3 − 𝜕𝐶 3 𝜃 𝜕𝑉 3 𝑊 ⟵ 𝑊 − 𝜕𝐶 3 𝜃 𝜕𝑊 1 − 𝜕𝐶 3 𝜃 𝜕𝑊 2 − 𝜕𝐶 3 𝜃 𝜕𝑊 3 𝑈 ⟵ 𝑈 − 𝜕𝐶 3 𝜃 𝜕𝑈 1 − 𝜕𝐶 3 𝜃 𝜕𝑈 2 − 𝜕𝐶 3 𝜃 𝜕𝑈 3 𝑉 ⟵ 𝑉 − 𝜕𝐶 3 𝜃 𝜕𝑉 3 𝐶 ≜ ෍ 𝑡 𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛 Tie 𝜃 NO Yes
  • 26. 26 Appendix – The Problem of Long-Term Dependencies BPTT BP 𝜕𝐶 3 𝜃 𝜕𝑊 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑊 = ෍ 𝑘=0 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑠 𝑘 𝜕𝑠 𝑘 𝜕𝑊 = ෍ 𝑘=0 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 ෑ 𝑗=𝑘+1 3 𝜕𝑠𝑗 𝜕𝑠𝑗−1 𝜕𝑠 𝑘 𝜕𝑊 𝑾 𝑼 𝑽 𝑊 ⟵ 𝑊 − 𝜕𝐶 3 𝜃 𝜕𝑊 1 − 𝜕𝐶 3 𝜃 𝜕𝑊 2 − 𝜕𝐶 3 𝜃 𝜕𝑊 3 𝑈 ⟵ 𝑈 − 𝜕𝐶 3 𝜃 𝜕𝑈 1 − 𝜕𝐶 3 𝜃 𝜕𝑈 2 − 𝜕𝐶 3 𝜃 𝜕𝑈 3 𝑉 ⟵ 𝑉 − 𝜕𝐶 3 𝜃 𝜕𝑉 3 Tie 𝜃 Yes 𝜕𝐶 3 𝜃 𝜕𝑈 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑈 = ෍ 𝑘=1 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑠 𝑘 𝜕𝑠 𝑘 𝜕𝑈 = ෍ 𝑘=1 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 ෑ 𝑗=𝑘+1 3 𝜕𝑠𝑗 𝜕𝑠𝑗−1 𝜕𝑠 𝑘 𝜕𝑈 𝜕𝐶 3 𝜃 𝜕𝑉 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑉 𝐶 ≜ ෍ 𝑡 𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛 𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡 𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1 ← 𝜃 𝑖 − 𝜂𝛻𝜃 𝐶 𝜃 𝑖
  • 27. 27 Appendix – The Problem of Long-Term Dependencies BPTT BP 𝜕𝑠𝑗 𝜕𝑠 𝑘 = ෑ 𝑗=𝑘+1 3 𝜕𝑠𝑗 𝜕𝑠𝑗−1 = ෑ 𝑗=𝑘+1 3 𝑊 𝑇 𝑱𝒂𝒄𝒐𝒃𝒊𝒂𝒏_𝒎𝒂𝒕𝒓𝒊𝒙 𝜎′ 𝑠𝑗−1 𝜕𝐶 3 𝜃 𝜕𝑊 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑊 = ෍ 𝑘=0 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑠 𝑘 𝜕𝑠 𝑘 𝜕𝑊 = ෍ 𝑘=0 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 ෑ 𝑗=𝑘+1 3 𝜕𝑠𝑗 𝜕𝑠𝑗−1 𝜕𝑠 𝑘 𝜕𝑊 𝜕𝐶 3 𝜃 𝜕𝑈 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑈 = ෍ 𝑘=1 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 𝜕𝑠3 𝜕𝑠 𝑘 𝜕𝑠 𝑘 𝜕𝑈 = ෍ 𝑘=1 3 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑠3 ෑ 𝑗=𝑘+1 3 𝜕𝑠𝑗 𝜕𝑠𝑗−1 𝜕𝑠 𝑘 𝜕𝑈 𝜕𝐶 3 𝜃 𝜕𝑉 = 𝜕𝐶 3 𝜃 𝜕𝑜3 𝜕𝑜3 𝜕𝑉 𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1 𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡 𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1 ← 𝜃 𝑖 − 𝜂𝛻𝜃 𝐶 𝜃 𝑖
  • 28. • Understand the difficulty of training recurrent neural networks • Gradient Exploding • Gradient Vanishing • One possible solution for solving the gradient vanishing problem is “Gating mechanism”, which is the key concept of LSTM • LSTM can be “deep” if we stack multiple LSTM cells • Extensions: • Uni-directional v.s. Bi-directional • One-to-one, One-to-many, Many-to-one, Many-to-Many (w/ or w/o Encoder-Decoder) 28 Conclusions
  • 29. • Understanding LSTM Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ • Prof. Hung-yi Lee Courses https://www.youtube.com/watch?v=xCGidAeyS4M https://www.youtube.com/watch?v=rTqmWlnwz_0 • On the difficulty of training recurrent neural networks https://arxiv.org/abs/1211.5063 • UDACITY Courses: Intro to Deep Learning with PyTorch https://classroom.udacity.com/courses/ud188 29 References
  • 30. 20 Thanks for your listening.