RNN Explore

RNN Explore
RNN, LSTM, GRU, Hyperparameters
By Yan Kang

1
2
3
4
Three Recurrent Cells
Hyperparameters
Experiments and Results
Conclusion
CONTENT

Why RNN?
Standard Neural Network:
Images from: https://en.wikipedia.org/wiki/Artificial_neural_network

Why RNN?
Standard Neural Network: Only accept fixed-size vector as input
and output

and output
Why RNN?

Why RNN?
and output
X Images from:
https://en.wikipedia.org/wiki/Artificial_neural_network
http://agustis-place.blogspot.com/2010/01/4th-eso-msc-computer-assisted-task-unit.html?_sm_au_=iVVJSQ4WZH27rJM0

Vanilla RNN
Image from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
加线

Vanilla RNN

Achieve it in 1 min: Ht = tanh(Xt*Ui + ht-1*Ws + b)
Vanilla RNN

LSTM

LSTM
Limitation?
Redundant gates/parameters:

LSTM
Limitation?
The output gate was the least important for the
performance of the LSTM. When removed, ℎ" simply becomes
tanℎ(𝐶") which was sufficient for retaining most of the LSTM’s
performance.
-- Google ”An Empirical Exploration of
Recurrent Network Architectures”

LSTM
Limitation?
The LSTM unit computes the new memory content
without any separate control of the amount of information
flowing from the previous time step.
-- “Empirical Evaluation of Gated Recurrent Neural
Networks on Sequence Modeling”

GRU
GRU

GRU
LSTM
GRU

Number of
Layers
Other than using only one recurrent cell, there is another
very common way to construct the recurrent units.

Number of
Layers
Other than using only one recurrent cell, there is another
very common way to construct the recurrent units.
Stacked RNN:

Hidden Size
RNN LSTM GRU
Hidden size:
Hidden state size in RNN
Cell state and hidden state sizes in LSTM
Hidden state size in GRU

Hidden Size
RNN LSTM GRU
Hidden size:
Hidden state size in RNN
Cell state and hidden state sizes in LSTM
Hidden state size in GRU
The larger, the more complicated model Recurrent Unit could
memory and represent.

Batch Size
Image from: https://www.quora.com/Whats-the-difference-
between-gradient-descent-and-stochastic-gradient-descent
Optimization function:

Batch Size
B = |x| Gradient Descent
B > 1 & B < |x| Stochastic Gradient Descent

Batch Size
B = |x| Gradient Descent
B > 1 & B < |x| Stochastic Gradient Descent
Batch size : B – the number of instances used to update weights once.

Learning
Rate

Learning
Rate
Learning rate 𝜀 𝑡 -- how much weights are
changed in each update.
Decrease it when getting close to the target.

Learning
Rate
Two learning rate updating methods were used in experiments
First one, after each epoch, learning rate decays +
,.
Second one, after each 5 epochs, learning rate decays +
,.

𝑠0
𝑠+
𝑠,
𝑠1
Variable Length:
𝑠0’
𝑠+’
𝑠,’
𝑠1’
0 …
0 …
0 …
Batch 0
𝑠0’
𝑠+’
𝑠,’
𝑠1’
Variable Length vs Sliding Window
𝑙0
𝑙+
𝑙,
𝑙1
𝑙0
𝑙+
𝑙,
𝑙1

𝑠0
𝑠+
𝑠,
𝑠1
……𝑤00 𝑤0+ 𝑤04
Batch 0 Batch 1
𝑤0+
𝑤0,
𝑤01
𝑤05
𝑤00
𝑤0+
𝑤0,
𝑤056+
……
Sliding window:

Sliding Window:
Advantages:
Each sequence might generate tens of or even hundreds of subsequences. With same
batch size to the variable length method, it means more batches in one epoch and
more weights update times in each epochs – faster converge rate per epoch.
Disadvantages:
1) Time consuming, longer time for each epoch;
2) Assigning same label to all subsequence might be biased and might cause the
network not converge.

Variable Length: AUSLAN Dataset 2565 instances

Sliding Window: AUSLAN Dataset 2565 instances

Variable Length: Character Trajectories Dataset 2858 instances

Sliding Window: Character Trajectories Dataset 2858 instances

Variable Length: Japanese Vowels Dataset 640 instances

Sliding Window: Japanese Vowels Dataset 640 instances

RNN vs LSTM vs GRU
GRU is a simpler variant of LSTM that share many of the same properties, both of
them could prevent gradient vanishing and “remember” long term dependence. And
both of them outperform vanilla RNN on almost all the datasets and, either using
Sliding Window or Variable Length.
But GRU has fewer parameters than LSTM, and thus may train a bit faster or need
less iterations to generalize. As shown in the plots, GRU does converge slightly faster.

Hyperparameters Comparisons
• Learning Rate
• Batch Size
• Number of Layers
• Hidden Size

Learning Rate
Two learning rate updating methods were used in experiments
• First one, after each epoch, learning rate decays +
,. , totally 24 epochs.
• Second one, after each 5 epochs, learning rate decays +
,. , totally 120 epochs.
The left side in the following plots uses 24 epochs, and the right side uses 120 epochs.
Because of the change of learning rate updating mechanism, some not converging
configurations in the left (24 epochs) work pretty well in the right (120 epochs).

Learning Rate
Japanese Vowels, Sliding Window, LSTM
24 epochs 120 epochs

Learning Rate
Japanese Vowels, Sliding Window, GRU

Learning Rate
Japanese Vowels, Variable Length, LSTM

Learning Rate
Japanese Vowels, Variable Length, GRU

Batch Size
The larger batch size means that each time we update weights with more instance. So it
has lower bias but also slower converge rate.
On the contrary, small batch size updates the weights more frequently. So small batch
size converges faster but has higher bias.
What we ought to do might be finding the balance between the converge rate and the
risk.

Batch Size
Japanese Vowels
Sliding Window

Batch Size
Japanese Vowels
Variable Length

Batch Size
UWave
Full Length Sliding Window

Number of layers
Multi-layer RNN is more difficult to converge. With the number of layers
increasing, it’s slower to converge.
And even they do, we don’t gain too much from the larger hidden units inside
it, at least on Japanese Vowel dataset. The final accuracy doesn’t seem better
than the one layer recurrent networks. This matches some paper’s results that
stacked RNNs could be taken place by one layer with larger hidden size.

Number of layers
Japanese Vowels
Sliding Window

Number of layers
Japanese Vowels
Variable Length

Number of layers
UWave
Full length Sliding Window

Hidden Size
Either from Japanese Vowels or UWave, the larger the hidden size on LSTM and
GRU, the better the final accuracy would be. And different hidden size share
similar converge rate on LSTM and GRU. But the trade-off of larger hidden size is
that it takes longer time/epoch to train the network.
There’re some abnormal behavior on vanilla RNN, which might be caused by the
gradient vanishing.

Hidden Size
Japanese Vowels
Sliding Window

Hidden Size
Japanese Vowels
Variable Length

Hidden Size
UWave
Full Length Sliding Window

Conclusion
In this presentation, we first discussed:
• What are RNN, LSTM and GRU, and why using them.
• What are the definitions of the four hyperparameters.
And through roughly 800 experiments, we analyzed:
• Difference between Sliding Window and Variable Length.
• Difference among RNN, LSTM and GRU.
• What’s the influence of number of layers.
• What’s the influence of hidden size.
• What’s the influence of batch size
• What’s the influence of learning rate

Conclusion
In this presentation, we first discussed:
• What are RNN, LSTM and GRU, and why using them..
• What are the definitions of the four hyperparameters.
And through roughly 800 experiments, we analyzed:
• Difference between Sliding Window and Variable Length.
• Difference among RNN, LSTM and GRU.
• What’s the influence of number of layers.
• What’s the influence of hidden size.
• What’s the influence of batch size
• What’s the influence of learning rate
Generally speaking, GRU works better than LSTM, and, because of suffering gradient
vanishing, vanilla RNN works worst.
Sliding window is good to solve limited instance datasets, which 1) may have repetitive
feature or 2) sub-sequence could capture key feature of the full sequence.
All these four hyperparameters play important role in tuning the network.

Limitations
However there are still some limitations:
1. Variable length:
• The sequence length is too long (~100-300 for most datasets, some even
larger than 1000)

Limitations
1. Variable length:
larger than 1000)
2. Sliding window:
• Ignores the continuality between the sliced subsequences.
• Biased labeling may causes similar subsequences being labeled differently.

Limitations
1. Variable length:
larger than 1000)
2. Sliding window:
Luckily, these two limitations could be solved simultaneously.

Limitations
1. Variable length:
larger than 1000)
2. Sliding window:
Luckily, these two limitations could be solved simultaneously.
-- By Truncated Gradient

What’s next?
Truncated gradient:
• Slicing the sequences in a special order that, between neighbor batches, each instance of
the batch is continuous.
• Not like Sliding Window initializing states in each batch from random around zero, the
states from the last batch are used to initialize the next batch state.
• So that even the recurrent units are unrolled in a short range (e.g. 20 steps), the states
could be passed through and the former ‘memory’ could be saved.
𝑠0
𝑠+
𝑠,
𝑠1
……𝑤00 𝑤0+ 𝑤04
𝑤+0 𝑤++
𝑠+
……
Batch 0 Batch 1
𝑤0+
𝑤++
𝑤,+
𝑤56+,+
𝑤00
𝑤+0
𝑤,0
𝑤56+,0
Initialize state 𝑠0
𝑠,
𝑤,0
𝑤,+

What’s next?
Averaged outputs to do classification:
• Right now, we are using last time step’s output to do softmax and then using Cross
Entropy to estimate each class’s probability.
• Using the averaged outputs of all time steps or weighted averaged outputs might be a
good choice to try.

What’s next?
Averaged outputs to do classification:
• Right now, we are using last time step’s output to do softmax and then using Cross
Entropy to estimate each class’s probability.
• Using the averaged outputs of all time steps or weighted averaged outputs might be a
good choice to try.
Prediction (sequence modeling):
• Already did the sequence to sequence model with l2-norm loss function.
• What needs to be done is finding a proper way to analyze the predicted sequence.

THANK YOU
Thanks for Dmitriy’s instructions
And discussions with Feipeng and Xi

RNN Explore

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RNN Explore

Similar to RNN Explore (20)

Recently uploaded

Recently uploaded (20)

RNN Explore