This document summarizes and compares several techniques for improving RNN performance for speech recognition:
1) FastGRNN proposes techniques like low-rank matrix approximation and quantization to make GRUs faster and smaller.
2) LightGRU removes the reset gate from GRUs and replaces tanh with ReLU for improved speech recognition performance.
3) AWD-LSTM incorporates techniques like dropout, averaged SGD, and activation regularization to prevent overfitting in LSTMs.
Overall the document evaluates different approaches for making RNNs more efficient and effective for speech tasks.
2. Old Good RNNs
Cannot train RNN!!
Gradients get crazy!!
Fishes are better at remembering!!!
I watched Schmidhuber and liked him!!
I don’t care baseline, I use what the cool boys use!!
Why so big, Occam’s will cry!!
My GPU has 4GB!!
I can’t wait months to train!!
X et al. said GRUs are better!!
3. What else?
I need a RNN size model with LSTM performance !!
I need a smaller model or a better smart phone !!
FastGRNN
http://manikvarma.org/pubs/kusupati18.pdf
This forget gate makes no sense!!
May the ReLU be with you!!
I do speech recognition!!
I watched Bengio and liked him!!
LightGRU
https://arxiv.org/abs/1803.10225
I need Regularization!!!
Dropout is not good!!!
AWD-LSTM
https://arxiv.org/abs/1708.02182
4. Fast GRNN
● 2 trainable matrices vs 6 trainable matrices in a GRU layer.
● Low rank approximation of matrices: w = w1(w2).T
● Integer quantization for parameters.
● Piecewise linear approximation of non-linearities.
12. Weight Dropping
● Apply Drop-Connect to hidden to hidden connections. (All U matrices)
● Preventing recurrent unit overfitting.
● It needs not to modify optimized RNN implementations in DL frameworks.
● Apply the same dropout mask for the all sequence.
13. Average SGD and NT-ASGD
Number of steps
to start averaging Weights optimized per iterationWeights used as the
final model
PyTorch implementation:
https://github.com/pytorch/pytorch/blob/cd9b27231b51633e76e28b6a34002ab83b0660fc/torch/optim/asgd.py
NT-ASGD: Only use ASGD when validation metric fails to improve
14. Embedding Dropout
● Apply dropout in word level, that is dropout zeros-out randomly selected word
vectors.
Activation Regularization
● Panalize network for producing large changes in hidden states and large
outputs leading to overfitting.