Skip, residual and densely connected RNN architectures

Skip, residual and densely
connected RNN architectures
Frederic Godin - Ph.D. Researcher
Department of Electronics and Information Systems
IDLab

Fréderic Godin - Skip, residual and densely connected RNN architectures
Who is Fréderic?
Ph.D. Reseacher Deep Learning @ IDLab
Main interests:
̶ Sequence models
̶ Hybrid RNN/CNN models
Major application domain: Natural Language Processing
̶ Noisy data (E.g., Twitter data)
̶ Parsing tasks (E.g., Named Entity Recognition)
Minor application domain: Computer Vision
̶ Lung cancer detection (Kaggle competition 7th/1972)
(http://blog.kaggle.com/2017/05/16/data-science-bowl-2017-predicting-lung-cancer-solution-write-up-team-deep-breath/)
2

Agenda
1. Recurrent neural networks
2. Skip, residual and dense connections
3. Dense connections in practice
3

Recurrent neural networks
̶ Neural network with a cyclic connection
̶ Has memory
̶ Models variable-length sequences
5

Fréderic Godin - Skip, residual and densely connected RNN architectures 6
t=1 t=2 t=3 t=4
word1 word2 word3 word4E.g.:
Unfolded recurrent neural network

Stacking recurrent neural networks
7
t=1 t=2 t=3 t=4
word1 word2 word3 word4
Deep in time
...Deep
in height

Vanishing gradients
- When updating the weights using backpropagation, the
gradient tends to vanish with every neuron it crosses
- Often caused by the activation function
8

Backpropagating through stacked RNNs
9
t=1 t=2 t=3 t=4
Backpropagation in time
...
Back-
propagation
in height

Mitigating the vanishing gradient problem
In time: Long Short-Term Memory (LSTM)
10
In height:
̶ Many techniques exist in convolutional neural networks
̶ This talk: can we apply them in RNNs?
Key equation to model
depth in time

Skip, residual and dense
connections
11

Skip connection
12
Layer 2
Merge 1,2
Out 1
A direct connection between 2
non-consecutive layers
- No vanishing gradient
- 2 main flavors
- Concatenative skip
connections
- Additive skip connections
Layer 3
Layer 1

(Concatenative) skip connection
13
Concatenate output of previous
layer and skip connection
Advantage:
Provides the output of first layer
to third layer without altering it
Disadvantage:
Doubles the input size
Layer 2
Out 2
Out 1
Layer 3
Layer 1
Out 1

Additive skip connection (Residual connection)
Originates from image
classification domain
Residual connection is defined as:
14
Layer 2
Out 1 + 2
Out 1
Layer 3
Layer 1
“Residue”
Out 1 + 2 Layer 2 Out 1

Residual connections do not
make sense in RNNs
Layer 2 also depends on h(t-1)
15
Layer 2
Out 1 + 2
Out 1
Layer 3
Layer 1
Additive skip connection (Residual connection)
in RNN
Additive skip connection
Out 1 + 2 Layer 2 Out 1
h(t-1) ht
y
x

Fréderic Godin - Skip, residual and densely connected RNN architectures 16
Layer 2
Out 1 + 2
Out 1
Layer 3
Layer 1
Additive skip connection
Sum output of previous layer and
skip connection
Advantage:
Input size to next layer does not
increase
Disadvantage:
Can create noisy input to next layer

Densely connecting layers
Add a skip connection between every
output and every input of every layer
Advantage:
- Direct paths between every layer
- Hierarchy of features as input to
every layer
Disadvantage: (L-1)*L connections
17
Layer 2
Out 2
Out 1
Layer 3
Layer 1
Out 1
Out 3
Layer 4
Out 2Out 1

Densely connected layers
in practice
18

Language modeling
Building a model which captures statistical characteristics of
a language:
In practice: predicting next word in a sentence
19

Example architecture
20
...
Classification layer
LSTM
LSTM
Embedding
layer

Training details
21
Stochastic Gradient Descent with learning scheme
Uniform initialization [-0.05:0.05]
Dropout with probability 0.6

Experimental results
22
Model Hidden states # Layers # Params Perplexity
Stacked LSTM
(Zaremba et al., 2014)
650 2 20M 82.7
1500 2 66M 78.4
Stacked LSTM
200 2 5M 100.9
200 3 5M 108.8
350 2 9M 87.9
Densely Connected LSTM
200 2 9M 80.4
200 3 11M 78.5
200 4 14M 76.9
Lower perplexity is better

Character-to-word language modeling
23
...
Classification layer
LSTM
LSTM
Highway layer
ConvNet
Embedding layer

Experimental results
24
Model Hidden states # Layers # Params Perplexity
Stacked LSTM
(Zaremba et al., 2014)
650 2 20M 82.7
1500 2 66M 78.4
CharCNN (Kim et al. 2016) 650 2 19M 78.9
Densely Connected LSTM
200 3 11M 78.5
200 4 14M 76.9
Densely Connected CharCNN* 200 4 20M 74.6
*Not published
Lower perplexity is better

Conclusion
Densely connecting all layers improves language modeling
performance
Avoids vanishing gradients
Creates hierarchy of features, available
to each layer
We use six times fewer parameters to obtain the same result
as a stacked LSTM
26

Q&A
Also more details in our publication:
Fréderic Godin, Joni Dambre & Wesley De Neve
“Improving Language Modeling using Densely Connected
Recurrent Neural Networks”
https://arxiv.org/abs/1707.06130
27

Fréderic Godin
Ph.D. Researcher Deep Learning
IDLab
E frederic.godin@ugent.be
@frederic_godin
www.fredericgodin.com
idlab.technology / idlab.ugent.be

Skip, residual and densely connected RNN architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Skip, residual and densely connected RNN architectures

Similar to Skip, residual and densely connected RNN architectures (20)

Recently uploaded

Recently uploaded (20)

Skip, residual and densely connected RNN architectures