#PR12 #6번째논문 #Neural_Turing_Machine
제가 발표한 논문은 Neural Turing Machine입니다. Neural network와 memory가 나누어져 있고요 알고리즘을 학습하는 구조입니다.
video in korean: https://www.youtube.com/watch?v=2wbDiZCWQtY&t=1071s
2. Kiho Suh - PR12
About Paper
• Published on October 2014
(v1)
• Updated on December 2014
(v2)
• https://arxiv.org/abs/1410.5401
Kiho Suh - PR12
3. Kiho Suh - PR12
What is Turing
Machine?
• Model of a Computer
• Memory tape with Read and
Write heads
• Controller(Program) attends to
specific element
• Discrete, so not possible to
train with backpropagation
4. Kiho Suh - PR12
What is Neural Turing Machine (NTM)?
• Inspired by Turing Machine to perform tasks that computer can
perform very well but machine learning cannot perform well.
• Narrow gap between neural networks and algorithms
• ‘Differentiable’ Turing Machine
• ‘Sharp’ functions made smooth and can train with
backpropagation.
Modified from Daniel Shank
5. Kiho Suh - PR12
What is Neural Turing Machine (NTM)?
Neural Network
(CPU)
External Memory
(RAM)
+
Neural net that separates
computation from memory
Computer that learns programs
or algorithms from input and
output examples (copy, sort …)
[7,9,3,2] [2,3,7,9]
[4,3,0,5] [0,3,4,5]
[6,9,1,2] [1,2,6,9]
[7,2,8,3] [2,3,7,8]
…
Modified from Alex Graves’ slides
6. Kiho Suh - PR12
NTM Architecture
(Feedforward or
Recurrent)
(real value matrix)
(select portions of
the memory)
7. Kiho Suh - PR12
Why not RNN or
LSTM?
• Memory is tied up in the
activations of the latent state of
the network.
• High Computation Cost: More
memory, bigger size of the
network
• Content Fragility: Being
constantly updated with new
info
MemoryActivation
8. Kiho Suh - PR12
Why NTM?
Rather than artificially increasing the size of the hidden state
in the RNN or LSTM,
we would like to arbitrarily increase the amount of
knowledge we add to the model while making minimal
changes to the model itself.
9. Kiho Suh - PR12
Innovations
1. Memory augmented networks
2. Attention mechanism: a novel idea in 2014 - check out Neural Machine
Translation by Jointly Learning to Align and Translate (Bahnadau, Cho, Bengio
2014)
3. Writing mechanism unlike other memory augmented networks such as Memory
Networks (Weston et al. 2014) and End-to-End Memory Networks(Sukhbaatar et
al. 2015).
10. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
11. Kiho Suh - PR12
Attention
• Want to focus on specific parts of the memory that
network wants to read and write to.
• When this paper came out in 2014, attention was novel
idea. Now, it is standard.
• Controller outputs to parameterize a distribution
(weighting) over the rows (locations in the memory matrix)
• Weight <- secret sauce
12. Kiho Suh - PR12
Data Structure and
Accessors
• Content only: Memory is
accessed like an associative
map. Inspired by brain.
• Content and location: Key finds
an array, shift indexes into it.
• Location only: Shift iterates
from the last focus. Inspired by
computer.
One-shot Learning with Memory-Augmented Neural Networks
(Santoro et al. 2016)
3 * 5 = 15
3 * 5 = 14.95
O
X
13. Kiho Suh - PR12
Addressing
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
14. Kiho Suh - PR12
Selective Memory
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
• Key design: how network interacts with memory.
• Making sure not interacting with whole memory at
once.
• Do not want to lose nice properties of
independence between memory and computation.
Right image from Tristen Deleu’s slide
15. Kiho Suh - PR12
Content Addressing
wt
c <- softmax(βt K [ kt, Mt(j) ] )
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Complete pattern or specific
version of approximate guess
17. Kiho Suh - PR12
Interpolation (Location Addressing)
0 =< gt =< 1
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Complete pattern or specific
version of approximate guess
18. Kiho Suh - PR12
Interpolation (Location Addressing)
0.9
0.1
0
0
0
0
[ ]
0
0
1
0
0
0
[ ]
wtc (content weight) wt-1 (previous final weight)
when gt=1: when gt=0.5 when gt=0
[0 0 1 0 0 0] [.45 .05 .50 0 0 0] [.9 .1 .0 0 0 0]
0 =< gt =< 1
Right part becomes 0,
and Content only
Left part becomes 0,
and location only
gt (interpolation weight)
Modified from Mark Chang’s slide
19. Kiho Suh - PR12
Convolutional Shift (Location Addressing)
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
Controller says how wide area
should be affected by the action
of the head
20. Kiho Suh - PR12
Convolutional Shift (Location Addressing)
w(i) <- w(i-1)*s(1) + w(i)s(0) + w(i+1)s(-1)
s = [1 0 0] s = [0 0 1] s = [.5 0 .5]
-1 0 1 -1 0 1 -1 0 1
[.45 .05 .50 0 0 0]
[.05 .50 0 0 0 .45]
[.45 .05 .50 0 0 0]
[0 .45 .05 .50 0 0]
[.45 .05 .50 0 0 0]
[.25 .475 .025 .25 0 .225]
~
All the numbers shift to left. All the numbers shift to right.
All the numbers give half of
itself to left and right.
[wi-1
g wi
g wi+1
g]
[s-1 s0 s1]
wi
~
wtg (interpolated weight)
st (shift weight)
Modified from Mark Chang’s slide
21. Kiho Suh - PR12
Sharpening (Location Addressing)
Mt
Content
Interploation
Conv Shift
Sharpening
Addressing
kt gt stβt γt
wt-1
wt
c
wt
g
wt
~
wt
The convolution in the previous
step can blur so sharpening.
Finally obtain the address (weight
value for each memory location)
of the memory that Controller
thinks we need.
22. Kiho Suh - PR12
Convolutional Shift (Location Addressing)
[0 .45 .05 .50 0 0]
[0 .37 0 .62 0 0][0 0 0 1 0 0] [.16 .16 .16 .16 .16 .16]
γt = 50 γt = 5 γt = 0
γt >= 1
because γ is smaller than 1,
here the wt is even more blurred.
because γ is much bigger compared
to 5 and 0, the array is sharpened.
wt (shifted weight)
γt (sharpening)
~
Modified from Mark Chang’s slide
23. From Tristen Deleu’s blog
Addressing is “soft” and distributed
across the entire memory. However, it is
focused on very few cells quantitatively.
24. Kiho Suh - PR12
Writing
Memory
et at
r t Mt
Reading
Erase
Add
Writing
Mt-1wt
Mt
~ Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
26. Erase Operation
Erase Operation:
0
1
1
11 2
21 3
42 1
0 00 00.9 0.1
0 1 … i … n
0
j
m
……
11 2
3
1
0.1 1.8
0.2 3.6
Head Location:
Erase Vector:
Memory :
From Mark Chang’s slide
N x M
N
M
27. Add Operation
Add Operation:
1
1
0
0 00 00.9 0.1
0 1 … i … n
11 2
3
1
0.1 1.8
0.2 3.6
2
3
10.2 3.6
1.9
1.9
1.1
1.0
Add Vector:
Memory :
Head Location:
0
j
m
……
From Mark Chang’s slide
N x M
N
M
28. Kiho Suh - PR12
Reading
Memory
et at
r t Mt
Reading
Erase
Add
Writing
Mt-1wt
Mt
~ Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
29. Read Operation
11 2
21 3
42 1
Read Operation:
0 00 00.9 0.1
0 1 … i … n
Read Vector:
Head Location:
Memory :
1.1
1.0
2.2
From Mark Chang’s slide
N x M
N
30. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
31. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
32. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
33. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
34. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
35. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
36. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
37. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
38. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
39. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])
unsorted sorted
40. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([7,9,3,2], [2,3,7,9])loss between [2,9,3,7] and [2,3,7,9]
unsorted sorted
41. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
42. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
43. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
44. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
45. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
46. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
47. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
48. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
49. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
50. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
51. Kiho Suh - PR12
NTM Architecture in more detail
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
for example: input - ([4,3,0,5], [0,3,4,5])
unsorted sorted
loss between [0,3,4,5] and [0,3,4,5]
52. Kiho Suh - PR12
Experiments
NTM with
Feedforward Controller
NTM with
LSTM
LSTM
Network
53. Kiho Suh - PR12
Copy
From top to bottom: External Inputs/Outputs,
Adds/Reads Vectors to Memory,
Write/Read Weightings
54. Kiho Suh - PR12
Copy
NTM copy task
generalization
(train length ≤ 20,
test 120)
NTM not only copy, but also generalize!
So NTM learns program.
LSTM copy task
generalization
shift error
56. Kiho Suh - PR12
Repeat Copy
From top to bottom: External Inputs/Outputs,
Adds/Reads Vectors to Memory,
Write/Read Weightings
NTM learns its first
for-loop, using
content to jump,
iteration to step,
and a variable to
count to N.
59. Kiho Suh - PR12
Associate Recall
From top to bottom: External Inputs/Outputs,
Adds/Reads Vectors to Memory,
Write/Read Weightings
NTM correctly
produces the red box
item after they see the
green box item. Similar
to dictionary.
Matching item Query item Next-to-matching item
61. Kiho Suh - PR12
Associate Recall (Generalization)
Number of
incorrect bits
62. Kiho Suh - PR12
Dynamic N-Gram
The goal of dynamic n-gram task was to test whether NTM could rapidly
adapt to new predictive distributions.
Mismatching
64. Kiho Suh - PR12
Priority Sort
Write head writes to locations according to a linear
function of priority and Read head reads from
locations in increasing order
66. Kiho Suh - PR12
Innovations
1. Memory augmented networks
2. Attention mechanism: a novel idea in 2014 - check out Neural Machine
Translation by Jointly Learning to Align and Translate (Bahnadau, Cho, Bengio
2014)
3. Writing mechanism unlike other memory augmented networks such as Memory
Networks (Weston et al. 2014) and End-to-End Memory Networks(Sukhbaatar et
al. 2015).
67. Kiho Suh - PR12
NTM Architecture
Addressing
kt
wt
wt-1
et at
gt
r t-1 input output Mt-1
r t Mt
stβt γt
WritingReading
Memory
Controller
68. Kiho Suh - PR12
What to improve?
• Memory management problem -> Dynamic Allocation
• Time Retrieval Memory in Order -> Temporal Matrix
• Graph Algorithm for wider range of tasks
• Reinforcement Learning
Differentiable Neural Computer!!!
Hybrid Computing using a neural network with dynamic
external memory (Graves et al. 2016)
69. Kiho Suh - PR12
Discussion
• Why some results are better with NTM+feedforward (i.e. associate
recall) while others are better with NTM+LSTM (i.e. copy)?
• The paper shows in order of “content addressing” and
“interpolation.” However, wouldn’t it make more sense to do
“interpolation” and then “content addressing?”
• Differentiable might not be the best way to learn program,
because it is inherently fragile. Programs are discrete in native.
Every bit really counts. So gradient descent might not be
desirable. Is NTM’s approach still right way to go?
• Any Question?