Neural Turing Machine

Presentation on
“Neural Turing Machine”
Kiho Suh

PR12, May 7th 2017

Kiho Suh - PR12
About Paper
• Published on October 2014
(v1)

• Updated on December 2014
(v2)

• https://arxiv.org/abs/1410.5401
Kiho Suh - PR12

Kiho Suh - PR12
What is Turing
Machine?
• Model of a Computer

• Memory tape with Read and
Write heads

• Controller(Program) attends to
speciﬁc element

• Discrete, so not possible to
train with backpropagation

Kiho Suh - PR12
What is Neural Turing Machine (NTM)?
• Inspired by Turing Machine to perform tasks that computer can
perform very well but machine learning cannot perform well.
• Narrow gap between neural networks and algorithms
• ‘Differentiable’ Turing Machine
• ‘Sharp’ functions made smooth and can train with
backpropagation.
Modiﬁed from Daniel Shank

Kiho Suh - PR12
What is Neural Turing Machine (NTM)?
Neural Network

(CPU)
External Memory

(RAM)
+
Neural net that separates
computation from memory
Computer that learns programs
or algorithms from input and
output examples (copy, sort …)

[7,9,3,2] [2,3,7,9]
[4,3,0,5] [0,3,4,5]
[6,9,1,2] [1,2,6,9]
[7,2,8,3] [2,3,7,8]
…
Modiﬁed from Alex Graves’ slides

Kiho Suh - PR12
NTM Architecture
(Feedforward or

Recurrent)
(real value matrix)
(select portions of

the memory)

Kiho Suh - PR12
Why not RNN or
LSTM?
• Memory is tied up in the
activations of the latent state of
the network.
• High Computation Cost: More
memory, bigger size of the
network
• Content Fragility: Being
constantly updated with new
info
MemoryActivation

Kiho Suh - PR12
Why NTM?
Rather than artiﬁcially increasing the size of the hidden state
in the RNN or LSTM,
we would like to arbitrarily increase the amount of
knowledge we add to the model while making minimal
changes to the model itself.

Kiho Suh - PR12
Innovations
1. Memory augmented networks
2. Attention mechanism: a novel idea in 2014 - check out Neural Machine
Translation by Jointly Learning to Align and Translate (Bahnadau, Cho, Bengio
2014)
3. Writing mechanism unlike other memory augmented networks such as Memory
Networks (Weston et al. 2014) and End-to-End Memory Networks(Sukhbaatar et
al. 2015).

Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller

Kiho Suh - PR12
Attention
• Want to focus on speciﬁc parts of the memory that
network wants to read and write to.
• When this paper came out in 2014, attention was novel
idea. Now, it is standard.
• Controller outputs to parameterize a distribution
(weighting) over the rows (locations in the memory matrix)
• Weight <- secret sauce

Kiho Suh - PR12
Data Structure and
Accessors
• Content only: Memory is
accessed like an associative
map. Inspired by brain.

• Content and location: Key ﬁnds
an array, shift indexes into it.

• Location only: Shift iterates
from the last focus. Inspired by
computer.
One-shot Learning with Memory-Augmented Neural Networks
(Santoro et al. 2016)
3 * 5 = 15

3 * 5 = 14.95

O

X

Kiho Suh - PR12
Addressing
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller

Kiho Suh - PR12
Selective Memory
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
• Key design: how network interacts with memory.
• Making sure not interacting with whole memory at
once.
• Do not want to lose nice properties of
independence between memory and computation.
Right image from Tristen Deleu’s slide

Kiho Suh - PR12
Content Addressing
wt
c <- softmax(βt K [ kt, Mt(j) ] )
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Complete pattern or speciﬁc
version of approximate guess

Kiho Suh - PR12
Content Addressing
[0 0 0 1 0 0] [.15 .10 .47 .08 .13 .17] [.16 .16 .16 .16 .16 .16]
1
1
2
2
1
4
3
2
1[ ]
Mt (memory)
[3 2 1]
kt (key vector)
1
4
5
0
0
1
1
0
0
βt = 50 βt = 5 βt = 0
wt
c <- softmax(βt K [ kt, Mt(j) ] )
cosine similarity
βt (key strength)
Modiﬁed from Mark Chang’s slide

Kiho Suh - PR12
Interpolation (Location Addressing)
0 =< gt =< 1
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Complete pattern or speciﬁc
version of approximate guess

Kiho Suh - PR12
Interpolation (Location Addressing)
0.9
0.1
0
0
0
0
[ ]
0
0
1
0
0
0
[ ]
wtc (content weight) wt-1 (previous ﬁnal weight)
when gt=1: when gt=0.5 when gt=0
[0 0 1 0 0 0] [.45 .05 .50 0 0 0] [.9 .1 .0 0 0 0]
0 =< gt =< 1
Right part becomes 0,
and Content only
Left part becomes 0,
and location only
gt (interpolation weight)

Kiho Suh - PR12
Convolutional Shift (Location Addressing)
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Controller says how wide area
should be affected by the action
of the head

Kiho Suh - PR12
w(i) <- w(i-1)*s(1) + w(i)s(0) + w(i+1)s(-1)
s = [1 0 0] s = [0 0 1] s = [.5 0 .5]
-1 0 1 -1 0 1 -1 0 1
[.45 .05 .50 0 0 0]
[.05 .50 0 0 0 .45]
[.45 .05 .50 0 0 0]
[0 .45 .05 .50 0 0]
[.45 .05 .50 0 0 0]
[.25 .475 .025 .25 0 .225]
~
All the numbers shift to left. All the numbers shift to right.
All the numbers give half of
itself to left and right.
[wi-1
g wi
g wi+1
g]
[s-1 s0 s1]
wi
~
wtg (interpolated weight)
st (shift weight)

Kiho Suh - PR12
Sharpening (Location Addressing)
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
The convolution in the previous
step can blur so sharpening.
Finally obtain the address (weight
value for each memory location)
of the memory that Controller
thinks we need.

Kiho Suh - PR12
[0 .45 .05 .50 0 0]
[0 .37 0 .62 0 0][0 0 0 1 0 0] [.16 .16 .16 .16 .16 .16]
γt = 50 γt = 5 γt = 0
γt >= 1
because γ is smaller than 1,

here the wt is even more blurred.
because γ is much bigger compared

to 5 and 0, the array is sharpened.
wt (shifted weight)

γt (sharpening)
~

From Tristen Deleu’s blog

Addressing is “soft” and distributed
across the entire memory. However, it is
focused on very few cells quantitatively.

Kiho Suh - PR12
Writing
Memory
et at
r t Mt
Reading
Erase
Add
Writing
Mt-1wt
Mt
~ Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller

Memory
Memory Address
Memory Block
Block
Length
0 1 … i … n
0
j
m
……N x M
From Mark Chang’s slide

Erase Operation
Erase Operation:
0
1
1
11 2
21 3
42 1
0 00 00.9 0.1
0 1 … i … n
0
j
m
……
11 2
3
1
0.1 1.8
0.2 3.6
Head Location:
Erase Vector:
Memory :
N x M
N
M

Add Operation
Add Operation:
1
1
0
0 00 00.9 0.1
0 1 … i … n
11 2
3
1
0.1 1.8
0.2 3.6
2
3
10.2 3.6
1.9
1.9
1.1
1.0
Add Vector:
Memory :
Head Location:
0
j
m
……
N x M
N
M

Kiho Suh - PR12
Reading
Memory
et at
r t Mt
Reading
Erase
Add
Writing
Mt-1wt
Mt
~ Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller

Read Operation
11 2
21 3
42 1
Read Operation:
0 00 00.9 0.1
0 1 … i … n
Read Vector:
Head Location:
Memory :
1.1
1.0
2.2
N x M
N

Kiho Suh - PR12
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted

Kiho Suh - PR12
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])loss between [2,9,3,7] and [2,3,7,9]
unsorted sorted

Kiho Suh - PR12
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller
unsorted sorted

Kiho Suh - PR12
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller
unsorted sorted
loss between [0,3,4,5] and [0,3,4,5]

Kiho Suh - PR12
Experiments
NTM with
Feedforward Controller
NTM with
LSTM
LSTM
Network

Kiho Suh - PR12
Copy
From top to bottom: External Inputs/Outputs,
Adds/Reads Vectors to Memory,
Write/Read Weightings

Kiho Suh - PR12
Copy
NTM copy task
generalization
(train length ≤ 20,
test 120)
NTM not only copy, but also generalize!
So NTM learns program.
LSTM copy task
generalization
shift error

Kiho Suh - PR12
Repeat Copy
NTM learns its ﬁrst
for-loop, using
content to jump,
iteration to step,
and a variable to
count to N.

Kiho Suh - PR12
Associate Recall
NTM correctly
produces the red box
item after they see the
green box item. Similar
to dictionary.
Matching item Query item Next-to-matching item

Kiho Suh - PR12
Associate Recall

Kiho Suh - PR12
Associate Recall (Generalization)
Number of
incorrect bits

Kiho Suh - PR12
Dynamic N-Gram
The goal of dynamic n-gram task was to test whether NTM could rapidly
adapt to new predictive distributions.
Mismatching

Kiho Suh - PR12
Dynamic N-Gram

Kiho Suh - PR12
Priority Sort
Write head writes to locations according to a linear
function of priority and Read head reads from
locations in increasing order

Kiho Suh - PR12
NTM Architecture
Addressing
kt
wt
wt-1
et at
gt
r t Mt
stβt γt
WritingReading
Memory
Controller

Kiho Suh - PR12
What to improve?
• Memory management problem -> Dynamic Allocation
• Time Retrieval Memory in Order -> Temporal Matrix
• Graph Algorithm for wider range of tasks
• Reinforcement Learning
Differentiable Neural Computer!!!
Hybrid Computing using a neural network with dynamic
external memory (Graves et al. 2016)

Kiho Suh - PR12
Discussion
• Why some results are better with NTM+feedforward (i.e. associate
recall) while others are better with NTM+LSTM (i.e. copy)?
• The paper shows in order of “content addressing” and
“interpolation.” However, wouldn’t it make more sense to do
“interpolation” and then “content addressing?”
• Differentiable might not be the best way to learn program,
because it is inherently fragile. Programs are discrete in native.
Every bit really counts. So gradient descent might not be
desirable. Is NTM’s approach still right way to go?
• Any Question?

Kiho Suh - PR12
Reference
• https://www.slideshare.net/ckmarkohchang/neural-turing-machine-tutorial-51270912
• https://www.slideshare.net/SessionsEvents/daniel-shank-data-scientist-talla-at-mlconf-
sf-2016?qid=5b26ca7a-6a33-43d0-a8d8-92adce30306e&v=&b=&from_search=6
• https://www.youtube.com/watch?v=_H0i0IhEO2g&t=532s
• https://norman3.github.io/papers/docs/neural_turing_machine
• https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-
lasagne-2cdce6837315
• https://www.slideshare.net/yuzurukato/neural-turing-machines-43179669
• https://arxiv.org/pdf/1409.0473.pdf
• https://arxiv.org/abs/1605.06065

Neural Turing Machine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural Turing Machine

Similar to Neural Turing Machine (20)

Recently uploaded

Recently uploaded (20)

Neural Turing Machine