Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PR-043: HyperNetworks

237 views

Published on

Paper review: "HyperNetworks" by David Ha, Andrew Dai, Quoc V. Le (ICLR2017)
Presented at Tensorflow-KR paper review forum (#PR12) by Taesu Kim
Paper link: https://arxiv.org/abs/1609.09106
Video link: https://www.youtube.com/watch?v=-tUQXSdEsMk (in Korean)

http://www.neosapience.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PR-043: HyperNetworks

  1. 1. HyperNetworks Presented by Taesu Kim Oct 29, 2017 Daivd Ha, Andrew Dai, Quoc V. Le Google Brain Published at ICLR 2017
  2. 2. HyperNetworks overview › An approach of using one network to generate the weight for another network › Motivated by HyperNEAT (Stanley et al 2009) and tried to resemble genotype and phenotype in nature › HyperNetwork can be viewed relaxed form of weight sharing across layers. › It generates non-shared weights for LSTM and achieved near state-of-the-art result › It generates shared weights for CNN and achieve respectable results with fewer learnable parameters
  3. 3. Conventional Networks Feedforward Networks Recurrent Networks
  4. 4. Static HyperNetworks
  5. 5. HyperCNN
  6. 6. Dynamic HyperNetworks
  7. 7. HyperRNN
  8. 8. Modified HyperRNN › HyperRNN requires Nz times larger memory requirements than basic RNN › Make it more scalable and memory efficient › Use intermediate hidden vector to parameterize a weight matrix: d(z) is linear projection of z
  9. 9. HyperLSTM https://github.com/hardmaru/supercell/ LSTM implementation
  10. 10. MNIST and CIFAR-10 40-1: N=6 k=1 40-2: N=6 k=2
  11. 11. Character-level Penn Treebank Language Model › 1000 units of MainLSTM & Two version of HyperLSTM – 128 units of HyperLSTM cell & 4 embedding size – 128 units of HyperLSTM cell & 16 embedding size à dropout keep probability of 85% › HyperLSTM outperforms than standard LSTM › HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and Hyper LSTM achieves the best test perp.
  12. 12. Hutter Prize Wikipedia Language Model › 1800 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 250 › 2048 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 300 › HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and Hyper LSTM achieves the best test perp. › HyperLSTM converges more quickly compared to LSTM and Layer Norm LSTM
  13. 13. Hutter Prize Wikipedia Language Model › Visualizing how the weight scaling vectors of the main LSTM change during the character sampling process. › Regions of low intensity, where the weights of the main LSTM are relatively static, the types of phrases generated seem more deterministic – For example, the weights do not change much during the words Europeans, possessions and reservation. › The regions of high intensity is when the Hyper LSTM cell is making relatively large changes to the weights of the main LSTM
  14. 14. Hutter Prize Wikipedia Language Model › Normalized Histogram plots of 𝜙(𝑐$) for different models during sampling – 𝜙(𝑐$) is the hidden state of the LSTM before applying the output gate. – › Layer Norm reduces the saturation effects compared to the vanilla LSTM….. › In HyperLSTM, most of the time the cell is saturated – HyperLSTM cell’s dynamic weight adjustment policy appears to be doing something very different compared to statistical normalization. – Although this policy came up with ended up providing similar performance as LayerNorm
  15. 15. Handwriting sequence generation › 12179 handwritten lines from 221 writers › LSTM input is (x, y) coordinate of the pen location and binary indicator of pen-up/pen-down › It can see that many of these weight changes occur at the boundaries between words, and between characters › Dynamically generate the generative model is one of the key advantages of HyperLSTM over a normal LSTM
  16. 16. Machine translation › WMT’14 En→Fr using the same test/validation set split described in the GNMT paper. – GMNT network has 8 layers each of encoder/decoder › HyperLSTM cell improves the performance of the existing GNMT model, achieving state- of-the-art single model results for this dataset. › It is demonstrated the applicability of Hyper Networks to large-scale models used in production systems.
  17. 17. Follow us: Contact us: contact@neosapience.com For more information: http://www.neosapience.com

×