Can recurrent neural networks warp time

Natural Language Processing Labs. By Daanv
2019.11.18
Can Recurrent Neural
Networks Warp Time?

Abstract
Gating Mechanisms
> LSTM (Long Short-Term Memories)
> GRUs (Gated Recurrent Units)
: to improve the learning of medium to long term temporal dependencies
to help with vanishing gradient issues
Proposal
> To prove the learnable gates in a recurrent model formally provide quasi-invariance
to general time transformations in the input data
Result
> Lead to a new way of initializing gate biases in LSTMs and GRUs (Chrono initialization)
> Improve learning of long term dependencies, with minimal implementation effort.

Introduce
RNN(Recurrent Neural Network)
> Vanishing gradient
> Long term dependency

Introduce
LSTM (Long Short-Term Memory) GRU (Gated Recurrent Units)

Introduce
[Sections1]
- The invariance to time transformations in the data
leads to a gate mechanism in recurrent models
- Gate values appear as time contraction or time dilation coefficients
(similar in spirit to the notion of time constant)
[Section2]
- How to initialize gate biases depending on the range of time dependencies
- (Gers & Schmidhuber, 2000) Setting the bias of the forget gate of LSTMs to 1 or 2
[Section3]
- Test the empirical benefits of the new initialization (synthetic and real world data)

From Time Warping Invariance to Gating
Sequential learning problem > being resilient to a change in time scale is crucial
RNN: non-resilient to time rescaling
> That is, a function class that can be represented by RNN does not affect time rescaling
- Input data: x(t) -> time-warped input data: x(c(t)) (the time warping c(t) is not overly complex)
- The change of time c : time rescaling & accelerations or decelerations
of the phenomena in the input data
Invariant to time warping
: The class has another(or the same) model that works on x(c(t)) in the same way that the model
works on x(t)
> Gating mechanism

Invariance to time rescaling
> Linear transformation of time (a > 0) [ex) 10 step -> a = 0.1]
(Continuous time setting is easy to time transformation)
> RNN hidden state: ht
discrete time equation]
continuous time equation]
- t : c(t) = at
- x(t) : x(at)
- h(t) : h(at)

Invariance to time rescaling
back translation to discrete time model: leaky RNN (a>0)
: The class has another(or the same) model that works on x(c(t)) in the same way that the model
works on x(t) > leaky RNN

Invariance to time warpings
Time warping invariance has to introduce learnable function g.
> quasi-invariant to time warpings
> Input gate: gt
> Forget gate: (1-gt)
> Gated RNN

Time Warpings and Gate Initialization
If the sequential data have temporal dependencies in an approximate range [T_min, T_max].
> Use a model with memory(forgetting time)
This amounts to having values of g in the range [1/T_max, 1/T_min].
If the values of both inputs and hidden layers are centered over time,
g(t) will typically take values centered around 𝜎 𝑏𝑔 .
Value of 𝜎 𝑏𝑔 in the desired range [1/T_max, 1/T_min] are obtained by choosing the biases 𝑏𝑔
between –log(T_max – 1) ~ -log(T_min – 1)
> To control the order of magnitude of the memory range of the neural networks
> Initialize the biases of g as (u: the uniform distribution)
− log(𝒰([𝑇min, 𝑇max]) − 1)

Experiments
> Test by Introducing Random time warpings in some data,
comparing the robustness of gated and ungated architectures
> Chrono LSTM initialization vs Standard initialization
in Synthetic task and real task
- Synthetic task : chrono LSTM > standard LSTM
- Real task : chrono LSTM >= standard LSTM
> Addition Test
- Text 8 (Mahoney, 2011): Next character prediction
- Penn Treebank (Mikolob et al., 2012): Next word prediction

Experiments
Pure warpings and paddings
> Uniform warping vs Variable warping
The unwarped task (without time warping or padding)
- Remember the previous character in any character sequence (too easy)
The only difficulty will come from warping
> we test the robustness of various architectures for time warping
More difficult task uses padded sequences
> Each element in the input sequence with a fixed or variable number of 0’s

Experiments

Experiments
Compare three recurrent architectures: RNN & leaky RNN & gated RNN
(all networks contain 64 recurrent units)
- RNN(ungated recurrent network)
- Leaky RNN(each unit has a constant learnable “gate” between 0 and 1)
- Gated RNN(with one gate per unit)

Experiments
Uniform time warping Uniform padding
Variable time warping Variable padding

Experiments
> Copy task
: check whether a model is able to remember information for arbitrarily long durations.
(Hochreiter & Schmidhuber, 1997)
- For a given T, input sequences consist of T+20 characters.
- The first 10 characters are drawn uniformly randomly from the first 8 letters.
- Last 10 characters are dummy characters
[T-1 dummy, letter1, letter2, letter3, letter4, letter5, letter6, letter7, letter8, dummy]
- Target sequences consists of T+10 dummy characters
Predict at random from among possible characters > loss: (Arjovsky et al., 2016)
LSTM (128 hidden units)
> Standard initialization(baseline): forget biases = 1
> Chrono initialization: Tmax = T

Experiments
Chrono initialization(red) vs Standard initialization(blue)
Standard initialization
> forget gate biases = 1
New initialization
> forget gate and input gate biases are chosen
according to the chrono initialization

Experiments
> Adding task
- training: Input sequences of length T
First sequence of numbers drawn from u([0,1]) & Second sequence containing zeros everywhere
- Target: the sum of the numbers contained in the first sequence
at the positions marked in the second sequence
Predict the mean of 2 x u([0,1]) > MSE: 0.167 (Arjovsky et al., 2016)
LSTM (128 hidden units)
> Standard initialization(baseline): forget biases = 1
> Chrono initialization: Tmax = T

Conclusion
> Gated connections appear to adjust the local time constant in the recurrent model.
> Chrono initialization is introduced, which is the basic way of initializing gate bias in LSTM.
> Chrono initialization has been shown to provide notable benefits
when faced with long-term dependencies.

THANK YOU.

Can recurrent neural networks warp time

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Can recurrent neural networks warp time

Similar to Can recurrent neural networks warp time (20)

More from Danbi Cho

More from Danbi Cho (11)

Recently uploaded

Recently uploaded (20)

Can recurrent neural networks warp time