Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Daniel Shank, Data Scientist, Talla at MLconf SF 2016

920 views

Published on

Neural Turing Machines: Perils and Promise: Daniel Shank is a Senior Data Scientist at Talla, a company developing a platform for intelligent information discovery and delivery. His focus is on developing machine learning techniques to handle various business automation tasks, such as scheduling, polls, expert identification, as well as doing work on NLP. Before joining Talla as the company’s first employee in 2015, Daniel worked with TechStars Boston and did consulting work for ThriveHive, a small business focused marketing company in Boston. He studied economics at the University of Chicago.

Published in: Technology
  • Be the first to comment

Daniel Shank, Data Scientist, Talla at MLconf SF 2016

  1. 1. Neural Turing Machines: Perils and Promise Daniel Shank
  2. 2. Overview 1.Neural Turing Machines 2.Applications and Performance 3.Challenges and Recommendations 4.Dynamic Neural Computers
  3. 3. 2 28 Neural Turing Machines
  4. 4. 3 28 What’s a Turing Machine? Model of a computer Memory tape Read and write heads
  5. 5. 4 28 What’s a Neural Turing Machine? Neural Network “Controller” Memory Learns from sequence Graves et al 2014, arXiv:1410.5401v2
  6. 6. 5 28 Neural Turing Machines are Differentiable Turing Machines ‘Sharp’ functions made smooth Can train with backpropagation
  7. 7. 6 28 Applications and Performance
  8. 8. 7 28 Neural Turing Machines can… Learn simple algorithms (Copy, repeat, recognize simple formal languages...) Generalize Do well at language modeling Do well at bAbI
  9. 9. 8 28 Generalization on Copy/Repeat task Graves et al 2014
  10. 10. 9 28 Neural Turing Machines Outperform LSTMs Graves et al 2014
  11. 11. 10 28 Balanced Parenthesis Tristan Deleu https://medium.com/snips-ai/
  12. 12. 11 28 bAbI dataset 1 Mary moved to the bathroom. 2 John went to the hallway. 3 Where is Mary? bathroom 1 4 Daniel went back to the hallway. 5 Sandra moved to the garden. 6 Where is Daniel? hallway 4 7 John moved to the office. 8 Sandra journeyed to the bathroom. 9 Where is Daniel? hallway 4 10 Mary moved to the hallway. 11 Daniel travelled to the office. 12 Where is Daniel? office 11 13 John went back to the garden. 14 John moved to the bedroom. 15 Where is Sandra? bathroom 8 1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2 Small vocabulary Stories Context https://research.facebook.com/research/babi/
  13. 13. 12 28 bAbI results Empirical Study on Deep Learning Models for Question Answering Yu et al. 2015
  14. 14. 13 28 Challenges and Recommendations
  15. 15. 14 28 Problems Architecture dependent Large number of parameters Doesn’t benefit much from GPU acceleration Hard to train
  16. 16. 15 28 Hard to train Numerical Instability Using memory is hard Needs smart optimization Difficult to use in practice
  17. 17. 16 28 Combating Numerical Instability: Gradient clipping Limits training speed of parameters Particularly helpful for learning long range dependencies
  18. 18. 17 28 Loss clipping Cap total response to a given training batch Helpful in addition to gradient clipping
  19. 19. 18 28 Graves’ RMSprop A version of back propagation used to train the network Used in many of Graves’ RNN papers: 𝑛𝑖 = 𝛼 + 1 − 𝛼 𝜖𝑖 2 𝑔𝑖 = 𝛼𝑔𝑖 + 1 − 𝛼 𝜖𝑖 Δ𝑖 = 𝛽Δ𝑖 − 𝛾 𝜖𝑖 𝑛𝑖 − 𝑔𝑖 2 + 𝛾 + 𝛿 𝑤𝑖 = 𝑤𝑖 + Δ𝑖 Similar to normalizing gradient updates by their variance, important for the NTM’s high-variability changes in loss.
  20. 20. 19 28 Adam Optimizer Works well for many tasks Comes pre-loaded in most ML frameworks Like Graves’ RMSprop, smooths gradients
  21. 21. 20 28 Attention to initialization Memory initialization extremely important Poor initialization can prevent convergence Pay particularly close attention to the starting value of the memory
  22. 22. 21 28 Short sequences first (“Curriculum Learning”) 1) Feed in short training data 2) When loss hits a target, increase the size of the input 3) Repeat
  23. 23. 22 28 Dynamic Neural Computers
  24. 24. 23 28 Neural Turing Machines “V2” Similar to NTMs, except… No index shift based addressing Can ‘allocate’ and ‘deallocate’ memory Remembers recent memory use
  25. 25. 24 28 Architecture updates(1) Graves et al. 2016
  26. 26. 25 28 Architecture updates(2) Graves et al. 2016
  27. 27. 26 28 Dynamic Neural Computer Performance on Inference Tasks Graves et al. 2016
  28. 28. 27 28 Dynamic Neural Computer bAbI Results Graves et al. 2016
  29. 29. 28 28 References Implementations: Tensorflow: https://github.com/carpedm20/NTM-tensorflow Go: https://github.com/fumin/ntm Torch: https://github.com/kaishengtai/torch-ntm Node.JS: https://github.com/gcgibson/NTM Lasagne: https://github.com/snipsco/ntm-lasagne Theano: https://github.com/shawntan/neural-turing-machines Papers: Graves et al. 2016 – Hybrid computing using a neural network with dynamic external memory Graves et al. 2014 – Neural Turing Machines Yu et al. 2015 – Empirical Study on Deep Learning Models for Question Answering Rae et al. 2016 – Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
  30. 30. 29 28 NTM operations The Convolutional Shift parameter has proven to be one of if not the most problematic.
  31. 31. 30 28

×