Zoneout

Zoneout: Regularization RNNs by Randomly
Preserving Hidden Activations
Krueger et al. In CoRR 2016
Federico Raue
Reading Group at DFKI
27-September-2016

Content
Dropout in Feed-forward Networks
Related Work
Dropout in RNN
Stochastic Depth
Zoneout
Experiments
Sequential Permuted MNIST
Character level – Penn Treebank
Word level – Penn Treebank
Conclusions

1
1
N. Srivastava et al. (2014). “Dropout: A Simple Way to Prevent Neural
Networks from Overﬁtting”. In: Journal of Machine Learning Research 15.

Dropout in RNN
Train a pseudo-ensemble model2
the source network is the parent model
each sampled model is the child model
noise process → sampling node masks → extract subnetworks
2
P. Bachman et al. (2014). “Learning with pseudo-ensembles”. In:
Advances in Neural Information Processing Systems.

Dropout in RNN
Figure: First attempts of Dropout in RNN34
3
V. Pham et al. (2014). “Dropout improves recurrent neural networks for
handwriting recognition”. In: Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on. IEEE.
4
W. Zaremba et al. (2014). “Recurrent neural network regularization”. In:
arXiv preprint arXiv:1409.2329.

Dropout in RNN
Only apply to dropout feed-forward connections (up to stack)
3
4

Dropout in RNN
Only apply to dropout feed-forward connections (up to stack), and
not recurrent connection (forward through time)
3
4

Dropout in RNN
Vanilla RNN
ht = f (Wh[xt, ht−1] + bh])

Dropout in RNN
Vanilla RNN
ht = f (Wh[xt, ht−1] + bh])
Vanilla RNN + (recurrent) dropout
ht = f (Wh[xt, d(ht−1)] + bh])
d(x) =
mask ∗ x if training phase
(1 − p)x otherwise,

Dropout in LSTM




it
ft
ot
gt



 =




σ(Wi [xt, ht] + bi )
σ(Wf [xt, ht] + bf )
σ(Wo[xt, ht] + bo)
f (Wg [xt, ht] + bg )




ct = ft ∗ ct−1 + it ∗ gt
ht = ot ∗ f (ct)

Dropout in LSTM5




it
ft
ot
gt



 =




σ(Wi [xt, d(ht)] + bi )
σ(Wf [xt, d(ht)] + bf )
σ(Wo[xt, d(ht)] + bo)
f (Wg [xt, d(ht)] + bg )




ht = ot ∗ f (ct)
5
Y. Gal (2015). “A theoretically grounded application of dropout in
recurrent neural networks”. In: arXiv preprint arXiv:1512.05287.

Dropout in LSTM6




it
ft
ot
gt



 =




σ(Wo[xt, ht] + bo)




ct = d(ft ∗ ct−1 + it ∗ gt)
ht = ot ∗ f (ct)
6
T. Moon et al. (2015). “Rnndrop: A novel dropout for rnns in asr”. In:
2015 IEEE Workshop on Automatic Speech Recognition and Understanding
(ASRU). IEEE.

Dropout in LSTM7




it
ft
ot
gt



 =




σ(Wo[xt, ht] + bo)




ct = ft ∗ ct−1 + it ∗ d(gt)
ht = ot ∗ f (ct)
7
S. Semeniuta et al. (2016). “Recurrent Dropout without Memory Loss”.
In: arXiv preprint arXiv:1603.05118.

Stochastic Depth8
8
G. Huang et al. (2016). “Deep networks with stochastic depth”. In: arXiv
preprint arXiv:1603.09382.

Zoneout9
d(x) =
9
D. Krueger et al. (2016). “Zoneout: Regularizing RNNs by Randomly
Preserving Hidden Activations”. In: arXiv preprint arXiv:1606.01305.

Zoneout9
d(x) =
Dropout: τt = pt ∗ ˜τt + (1 − pt) ∗ 0
9

Zoneout9
d(x) =
Dropout: τt = pt ∗ ˜τt + (1 − pt) ∗ 0
Zoneout: τt = pt ∗ ˜τt + (1 − pt) ∗ 1
9

Zoneout
Figure: Zoneout vs Recurrent Dropout

Again – LSTM equations




it
ft
ot
gt



 =




σ(Wo[xt, ht] + bo)




ht = ot ∗ f (ct)

LSTM equations – Zoneout




it
ft
ot
gt



 =




σ(Wo[xt, ht] + bo)




ct = pt ∗ ct−1 + (1 − pt) ∗ (ft ∗ ct−1 + it ∗ gt)
ht = pt ∗ ht−1 + (1 − pt) ∗ (ot ∗ f (ct))

Zoneout + Recurrent Dropout




it
ft
ot
gt



 =




σ(Wo[xt, ht] + bo)




ct = (ft ∗ ct−1 + d(it ∗ gt)) recurrent dropout
ht = ((1 − pt) ∗ ot + pt ∗ ot−1) ∗ f (ct) zoneout

Sequential Permuted MNIST (1/3)
Sequential MNIST: pixels of an image representing a
number are presented to a RNN one at a time, in lexographic
order (left to right, top to bottom)
Permuted Sequential MNIST: the pixels are represented in
a (ﬁxed) random order
Error Classiﬁcation

Character level – Penn Treebank (1/2)
BPC = − log2 P(xt+1|yt)
xt+1 correct symbol
yt output of the algorithm

Character level – Penn Treebank (2/2)
BPC = − log2 P(xt+1|yt)

Word level – Penn Treebank (1/2)
Perplexity = dH(p)
= 2− x p(x) log2 p(x)

Word level – Penn Treebank (2/2)

Conclusions
Instead of dropping out neurons, zoneout neurons
More robust to changes in the hidden state
Identity connections of zoneout improve the ﬂow of
information through the network

Conclusions
Instead of dropping out neurons, zoneout neurons
More robust to changes in the hidden state
Identity connections of zoneout improve the ﬂow of
information through the network
Future Work: Adapt the set of probabilities of updating
various units based on the sequence input

References I
Bachman, P. et al. (2014). “Learning with pseudo-ensembles”. In:
Advances in Neural Information Processing Systems,
pp. 3365–3373.
Gal, Y. (2015). “A theoretically grounded application of dropout in
recurrent neural networks”. In: arXiv preprint arXiv:1512.05287.
Huang, G. et al. (2016). “Deep networks with stochastic depth”.
In: arXiv preprint arXiv:1603.09382.
Krueger, D. et al. (2016). “Zoneout: Regularizing RNNs by
Randomly Preserving Hidden Activations”. In: arXiv preprint
arXiv:1606.01305.
Moon, T. et al. (2015). “Rnndrop: A novel dropout for rnns in
asr”. In: 2015 IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU). IEEE, pp. 65–70.

References II
Pham, V. et al. (2014). “Dropout improves recurrent neural
networks for handwriting recognition”. In: Frontiers in
Handwriting Recognition (ICFHR), 2014 14th International
Conference on. IEEE, pp. 285–290.
Semeniuta, S. et al. (2016). “Recurrent Dropout without Memory
Loss”. In: arXiv preprint arXiv:1603.05118.
Srivastava, N. et al. (2014). “Dropout: A Simple Way to Prevent
Neural Networks from Overﬁtting”. In: Journal of Machine
Learning Research 15, pp. 1929–1958.
Zaremba, W. et al. (2014). “Recurrent neural network
regularization”. In: arXiv preprint arXiv:1409.2329.

Zoneout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Zoneout

Similar to Zoneout (20)

Recently uploaded

Recently uploaded (20)

Zoneout