Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Skip, residual and densely connected RNN architectures

1,132 views

Published on

Slides presented during the Datascience Meetup @Sentiance. Based on the following paper:
"Improving Language Modeling using Densely Connected Recurrent Neural Networks".

See http://www.fredericgodin.com/publications/ for more info.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Skip, residual and densely connected RNN architectures

  1. 1. Skip, residual and densely connected RNN architectures Frederic Godin - Ph.D. Researcher Department of Electronics and Information Systems IDLab
  2. 2. Fréderic Godin - Skip, residual and densely connected RNN architectures Who is Fréderic? Ph.D. Reseacher Deep Learning @ IDLab Main interests: ̶ Sequence models ̶ Hybrid RNN/CNN models Major application domain: Natural Language Processing ̶ Noisy data (E.g., Twitter data) ̶ Parsing tasks (E.g., Named Entity Recognition) Minor application domain: Computer Vision ̶ Lung cancer detection (Kaggle competition 7th/1972) (http://blog.kaggle.com/2017/05/16/data-science-bowl-2017-predicting-lung-cancer-solution-write-up-team-deep-breath/) 2
  3. 3. Fréderic Godin - Skip, residual and densely connected RNN architectures Agenda 1. Recurrent neural networks 2. Skip, residual and dense connections 3. Dense connections in practice 3
  4. 4. Recurrent neural networks 4
  5. 5. Fréderic Godin - Skip, residual and densely connected RNN architectures Recurrent neural networks ̶ Neural network with a cyclic connection ̶ Has memory ̶ Models variable-length sequences 5
  6. 6. Fréderic Godin - Skip, residual and densely connected RNN architectures 6 t=1 t=2 t=3 t=4 word1 word2 word3 word4E.g.: Unfolded recurrent neural network
  7. 7. Fréderic Godin - Skip, residual and densely connected RNN architectures Stacking recurrent neural networks 7 t=1 t=2 t=3 t=4 word1 word2 word3 word4 Deep in time ...Deep in height
  8. 8. Fréderic Godin - Skip, residual and densely connected RNN architectures Vanishing gradients - When updating the weights using backpropagation, the gradient tends to vanish with every neuron it crosses - Often caused by the activation function 8
  9. 9. Fréderic Godin - Skip, residual and densely connected RNN architectures Backpropagating through stacked RNNs 9 t=1 t=2 t=3 t=4 word1 word2 word3 word4 Backpropagation in time ... Back- propagation in height
  10. 10. Fréderic Godin - Skip, residual and densely connected RNN architectures Mitigating the vanishing gradient problem In time: Long Short-Term Memory (LSTM) 10 In height: ̶ Many techniques exist in convolutional neural networks ̶ This talk: can we apply them in RNNs? Key equation to model depth in time
  11. 11. Skip, residual and dense connections 11
  12. 12. Fréderic Godin - Skip, residual and densely connected RNN architectures Skip connection 12 Layer 2 Merge 1,2 Out 1 A direct connection between 2 non-consecutive layers - No vanishing gradient - 2 main flavors - Concatenative skip connections - Additive skip connections Layer 3 Layer 1
  13. 13. Fréderic Godin - Skip, residual and densely connected RNN architectures (Concatenative) skip connection 13 Concatenate output of previous layer and skip connection Advantage: Provides the output of first layer to third layer without altering it Disadvantage: Doubles the input size Layer 2 Out 2 Out 1 Layer 3 Layer 1 Out 1
  14. 14. Fréderic Godin - Skip, residual and densely connected RNN architectures Additive skip connection (Residual connection) Originates from image classification domain Residual connection is defined as: 14 Layer 2 Out 1 + 2 Out 1 Layer 3 Layer 1 “Residue” Out 1 + 2 Layer 2 Out 1
  15. 15. Fréderic Godin - Skip, residual and densely connected RNN architectures Residual connections do not make sense in RNNs Layer 2 also depends on h(t-1) 15 Layer 2 Out 1 + 2 Out 1 Layer 3 Layer 1 Additive skip connection (Residual connection) in RNN Additive skip connection Out 1 + 2 Layer 2 Out 1 h(t-1) ht y x
  16. 16. Fréderic Godin - Skip, residual and densely connected RNN architectures 16 Layer 2 Out 1 + 2 Out 1 Layer 3 Layer 1 Additive skip connection Sum output of previous layer and skip connection Advantage: Input size to next layer does not increase Disadvantage: Can create noisy input to next layer
  17. 17. Fréderic Godin - Skip, residual and densely connected RNN architectures Densely connecting layers Add a skip connection between every output and every input of every layer Advantage: - Direct paths between every layer - Hierarchy of features as input to every layer Disadvantage: (L-1)*L connections 17 Layer 2 Out 2 Out 1 Layer 3 Layer 1 Out 1 Out 3 Layer 4 Out 2Out 1
  18. 18. Densely connected layers in practice 18
  19. 19. Fréderic Godin - Skip, residual and densely connected RNN architectures Language modeling Building a model which captures statistical characteristics of a language: In practice: predicting next word in a sentence 19
  20. 20. Fréderic Godin - Skip, residual and densely connected RNN architectures Example architecture 20 word2 word3 word4 word5 word1 word2 word3 word4 ... Classification layer LSTM LSTM Embedding layer
  21. 21. Fréderic Godin - Skip, residual and densely connected RNN architectures Training details 21 Stochastic Gradient Descent with learning scheme Uniform initialization [-0.05:0.05] Dropout with probability 0.6
  22. 22. Fréderic Godin - Skip, residual and densely connected RNN architectures Experimental results 22 Model Hidden states # Layers # Params Perplexity Stacked LSTM (Zaremba et al., 2014) 650 2 20M 82.7 1500 2 66M 78.4 Stacked LSTM 200 2 5M 100.9 200 3 5M 108.8 350 2 9M 87.9 Densely Connected LSTM 200 2 9M 80.4 200 3 11M 78.5 200 4 14M 76.9 Lower perplexity is better
  23. 23. Fréderic Godin - Skip, residual and densely connected RNN architectures Character-to-word language modeling 23 word2 word3 word4 word5 word1 word2 word3 word4 ... Classification layer LSTM LSTM Highway layer ConvNet Embedding layer
  24. 24. Fréderic Godin - Skip, residual and densely connected RNN architectures Experimental results 24 Model Hidden states # Layers # Params Perplexity Stacked LSTM (Zaremba et al., 2014) 650 2 20M 82.7 1500 2 66M 78.4 CharCNN (Kim et al. 2016) 650 2 19M 78.9 Densely Connected LSTM 200 3 11M 78.5 200 4 14M 76.9 Densely Connected CharCNN* 200 4 20M 74.6 *Not published Lower perplexity is better
  25. 25. Conclusion 25
  26. 26. Fréderic Godin - Skip, residual and densely connected RNN architectures Conclusion Densely connecting all layers improves language modeling performance Avoids vanishing gradients Creates hierarchy of features, available to each layer We use six times fewer parameters to obtain the same result as a stacked LSTM 26
  27. 27. Fréderic Godin - Skip, residual and densely connected RNN architectures Q&A Also more details in our publication: Fréderic Godin, Joni Dambre & Wesley De Neve “Improving Language Modeling using Densely Connected Recurrent Neural Networks” https://arxiv.org/abs/1707.06130 27
  28. 28. Fréderic Godin Ph.D. Researcher Deep Learning IDLab E frederic.godin@ugent.be @frederic_godin www.fredericgodin.com idlab.technology / idlab.ugent.be

×