Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Neural Turing
Machines: Perils and
Promise
Daniel Shank

Overview
1.Neural Turing Machines
2.Applications and Performance
3.Challenges and Recommendations
4.Dynamic Neural Computers

3
28
What’s a Turing Machine?
Model of a computer
Memory tape
Read and write heads

4
28
What’s a Neural Turing Machine?
Neural Network “Controller”
Memory
Learns from sequence
Graves et al 2014,
arXiv:1410.5401v2

5
28
Neural Turing Machines are
Differentiable Turing Machines
‘Sharp’ functions made smooth
Can train with backpropagation

6
28
Applications and Performance

7
28
Neural Turing Machines can…
Learn simple algorithms (Copy, repeat,
recognize simple formal languages...)
Generalize
Do well at language modeling
Do well at bAbI

8
28
Generalization on Copy/Repeat task
Graves et al 2014

9
28
Neural Turing Machines Outperform
LSTMs
Graves et al 2014

10
28
Balanced Parenthesis
Tristan Deleu https://medium.com/snips-ai/

11
28
bAbI dataset
1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary? bathroom 1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel? hallway 4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel? hallway 4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel? office 11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra? bathroom 8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra? bathroom 2
Small vocabulary
Stories
Context
https://research.facebook.com/research/babi/

12
28
bAbI results
Empirical Study on Deep Learning Models for Question Answering
Yu et al. 2015

13
28
Challenges and Recommendations

14
28
Problems
Architecture dependent
Large number of parameters
Doesn’t benefit much from GPU acceleration
Hard to train

15
28
Hard to train
Numerical Instability
Using memory is hard
Needs smart optimization
Difficult to use in practice

16
28
Combating Numerical Instability:
Gradient clipping
Limits training speed of parameters
Particularly helpful for learning long range
dependencies

17
28
Loss clipping
Cap total response to a given training batch
Helpful in addition to gradient clipping

18
28
Graves’ RMSprop
A version of back propagation used to train the network
Used in many of Graves’ RNN papers:
𝑛𝑖 = 𝛼 + 1 − 𝛼 𝜖𝑖
2
𝑔𝑖 = 𝛼𝑔𝑖 + 1 − 𝛼 𝜖𝑖
Δ𝑖 = 𝛽Δ𝑖 − 𝛾
𝜖𝑖
𝑛𝑖 − 𝑔𝑖
2
+ 𝛾 + 𝛿
𝑤𝑖 = 𝑤𝑖 + Δ𝑖
Similar to normalizing gradient updates by their variance, important
for the NTM’s high-variability changes in loss.

19
28
Adam Optimizer
Works well for many tasks
Comes pre-loaded in most ML frameworks
Like Graves’ RMSprop, smooths gradients

20
28
Attention to initialization
Memory initialization extremely important
Poor initialization can prevent convergence
Pay particularly close attention to the
starting value of the memory

21
28
Short sequences first (“Curriculum
Learning”)
1) Feed in short training data
2) When loss hits a target, increase the size
of the input
3) Repeat

22
28
Dynamic Neural Computers

23
28
Neural Turing Machines “V2”
Similar to NTMs, except…
No index shift based addressing
Can ‘allocate’ and ‘deallocate’ memory
Remembers recent memory use

24
28
Architecture updates(1)
Graves et al. 2016

25
28
Architecture updates(2)
Graves et al. 2016

26
28
Dynamic Neural Computer
Performance on Inference Tasks
Graves et al. 2016

27
28
Dynamic Neural Computer bAbI
Results
Graves et al. 2016

28
28
References
Implementations:
Tensorflow: https://github.com/carpedm20/NTM-tensorflow
Go: https://github.com/fumin/ntm
Torch: https://github.com/kaishengtai/torch-ntm
Node.JS: https://github.com/gcgibson/NTM
Lasagne: https://github.com/snipsco/ntm-lasagne
Theano: https://github.com/shawntan/neural-turing-machines
Papers:
Graves et al. 2016 – Hybrid computing using a neural network with dynamic
external memory
Graves et al. 2014 – Neural Turing Machines
Yu et al. 2015 – Empirical Study on Deep Learning Models for Question Answering
Rae et al. 2016 – Scaling Memory-Augmented Neural Networks with Sparse Reads
and Writes

29
28
NTM operations
The Convolutional Shift parameter has proven
to be one of if not the most problematic.

Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Similar to Daniel Shank, Data Scientist, Talla at MLconf SF 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Editor's Notes