2. Table of Contents
1. History
3
2. Architecture
5
3. Variants
8
4. Vanishing Gradient Problem
10
5. Training in Supervised Fashion
11
6. CTC Score Function
12
4. 1995-1997:
LSTM was
proposed by
Sepp Hochreiter
and Jürgen
Schmidhuber
showing that
LSTM solves the
vanishing
gradient
problem.
1999-2000:
Felix Gers,
introduced the
forget gate
Gers &
Schmidhuber &
Cummins
added
peephole
connections
2009: An LSTM
based model
won the ICDAR
connected
handwriting
recognition
competition
2013: LSTM
networks
records 17.7%
phoneme error
rate on the
classic TIMIT
natural speech
dataset
2014-2016:
Kyunghyun Cho et
al. Created Gated
recurrent unit (GRU)
and Google started
using an LSTM for
speech recognition
on Google Voice and
Allo conversation
app
2017: Facebook
performed some
4.5 billion
automatic
translations daily
with LSTM
Microsoft
reported
reaching 94.9%
recognition
accuracy
2019: A new
RNN derived
using the
Legendre
polynomials and
outperforms the
LSTM on some
memory-related
benchmarks.
An LSTM model
climbed to third
place on the in
Large Text
Compression
Benchmark
6. Common Architecture of an LSTM Model
The Long Short-Term Memory (LSTM) Model architecture is based on Recurrent Neural
Network (RNN) architecture. One thing that sets the RNN/LSTM model apart from typical
Neural Nets is that LSTMs have feedback networks, instead of the common feedforward
network.
LSTMs are commonly composed of four parts, a cell, an input gate, output gate and forget
gate.
● The cell is the memory part of the LSTM unit and is responsible for keeping up with
the dependencies between the elements in the input sequence.
● The input gate controls how a new value comes into the cell.
● The output gate controls how the value in the cell is used to compute the output
activation of an LSTM unit.
● The forget gate controls whether a value stays in the cell.
● The weights of the connections, both into and out of the gates, determine how the
gate will operate.
10. Variations on the LSTM Model
There are quite a few variations on the LSTM model, here are a few examples:
● Peephole LSTM:
○ A peephole LSTM has peephole connections which allow the gates to access
the constant error carousel (CEC) which has an activation in the cell state.
● Peephole Convolutional LSTM:
○ Similar to the peephole LSTM, this model adds convolutional layers making it
more efficient in processing image and video data.
● Gated Recurrent Units (GRUs):
○ This model follows the common architecture of an LSTM model, but does not
have an output gate.
● Multiplicative LSTM (mLSTM):
○ This is a complex model that has achieved state of the art achievements in
natural language processing.
○ OpenAI’s unsupervised sentiment neuron is based on mLSTMs.
11. Vanishing gradient issue in training RNN
1 2 3
Back
propagation
Chain rule Long sequence
biased to
capture
shorter-term
dependencies
Solution 1:choosing the right activation function
Solution 2:initialize our weights differently
Solution 3: gated cells, make each noted a more
complex unit with gates controlling what
information is passed through → LSTM
Time step Time step Time step
12. LSTM cell from figure 4.2 in “Supervised
Sequence Labelling with Recurrent Neural
Networks” by Alex Graves
● Optimization algorithm:
gradient descent combined
with backpropagation
● Adam,RMSprop,etc
● LSTM is a differentiable
function approximator that is
typically trained with gradient
descent. Recently, non
gradient-based training
methods of LSTM have also
been considered
13. ● Widely used in speech
recognition due to its
simplicity in training and
efficiency in decoding.
● Rule:
Remove blank frames (null)
Merge repetition frames
Ex: ex ex ex → ex
● Issues:
Each output is decided
independently
Align input label with
corresponding output
Connectionist Temporal
Classification (CTC)
Paired training
data
X1,X2,X3,X4
W
Encoder -->LSTM
X1 X2 X3 X4 X5
h1 h2 h3 h4 h5
classifier
Token distribution
cross-entropy
=Softmax( hi )
Vocabulary size V
14. Successful examples in unsupervised training
Dota 2
Learning Dexterity
Each of OpenAI Five’s networks contain a
single-layer, 1024-unit LSTM that sees the
current game state (extracted from Valve’s Bot
API) and emits actions through several possible
action heads.
We represent the policy as a
recurrent neural network with
memory, namely an LSTM with an
additional hidden layer with ReLU
activations inserted between inputs
and the LSTM.
Demo of Learning Dexterity
15. Time series prediction CTC score function
Alternatives Sign language translation
Speech recognition Handwriting recognition
Drug design
Robot control Time series prediction
Rhythm learning Music
composition Grammar learning
Human action recognition
Protein homology detection
Predicting subcellular localization of proteins…..
There are much more to do with LSTM!
16. Resources
Main article: Long short-term memory
Supplemental resources:
● Learning Dexterous In-Hand Manipulation
● Open AI 5 v.s. Dora2
● END-TO-END SPEECH RECOGNITION USING A HIGH RANK LSTM-CTC BASED MODEL
● Connectionist Temporal Classification: Labelling Unsegmented Sequence
Data with Recurrent Neural Networks
● Supervised Sequence Labelling with Recurrent Neural Networks