Memory Networks, Neural Turing Machines, and Question Answering

1/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Memory Networks, Neural Turing Machines,
and Question Answering
Akram El-Korashy1
1Max Planck Institute for Informatics
November 30, 2015
Deep Learning Seminar.
Papers by Weston et al. (ICLR2015), Graves et al. (2014), and
Sukhbaatar et al. (2015)

2/27
Outline
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments

3/27
Why memory?
Human’s working memory is a capacity for short-term storage
of information and its rule-based manipulation. . .
Therefore, an NTM1resembles a working memory system, as it
is designed to solve tasks that require the application of
approximate rules to “rapidly-created variables”.
1
Neural Turing Machine. I will use it interchangeably with Memory
Networks, depending on which paper I am citing.

4/27
Why memory? Why not RNNs and LSTM?
The memory in these models is the state of the network, which
is latent (i.e., hidden; no exlpicit access) and inherently
unstable over long timescales. [Sukhbaatar2015]
Unlike a standard network, NTM interacts with a memory matrix
using selective read and write operations that can focus on
(almost) a single memory location. [Graves2014]

5/27
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.

5/27
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.
Main differences
In a memory network model, the query can be made over
multiple sentences, unlike machine translation.
The memory model makes several hops on the memory
before making an output.
The network architecture of the memory scoring is a
simple linear layer, as opposed to a sophisticated gated
architecture in previous work.

6/27
Why memory? What’s the main usage?
Memory as non-compact storage
Explicitly update memory slots mi on test time by making use of
a “generalization” component that determines “what” is to be
stored from input x, and “where” to store it (choosing among
the memory slots).
Storing stories for Question Answering
Given a story (i.e., a sequence of sentences), training of the
output component of the memory network can learn scoring
functions (i.e., similarity) between query sentences and existing
memory slots from previous sentences.

7/27
Overview of a memory model
A memory model that is trained only end-to-end.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]

7/27
Trained model takes a set of inputs x1, ..., xn to be stored in
the memory, a query q, and outputs an answer a.
[Sukhbaatar2015]

7/27
Each of xi, q, a contains symbols coming from a dictionary
with V words.
[Sukhbaatar2015]

7/27
All x is written to memory up to a ﬁxed buffer size, then ﬁnd
a continuous representation for the x and q.
[Sukhbaatar2015]

7/27
The continuous representation is then processed via
multiple hops to output a.
[Sukhbaatar2015]

7/27
This allows back-propagation of the error signal through
multiple memory accesses back to input during training.
[Sukhbaatar2015]

7/27
A, B, C are embedding matrices (of size d × V) used to
convert the input to the d-dimensional vectors mi.
[Sukhbaatar2015]

7/27
A match is computed between u and each memory mi by
taking the inner product followed by a softmax.
[Sukhbaatar2015]

7/27
The response vector o from the memory is the weighted
sum: o = i pici.
[Sukhbaatar2015]

7/27
The ﬁnal prediction (answer to the query) is computed with
the help of a weight matrix as: ˆa = Softmax(W(o + u)).
[Sukhbaatar2015]

8/27
Plan
1 Introduction
Training
Training
Results

9/27
Synthetic QA tasks, supporting subset
There are a total of 20 different types of tasks that test
different forms of reasoning and deduction.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]

9/27
Note that for each question, only some subset of the
statements contain information needed for the answer, and
the others are essentially irrelevant distractors (e.g., the
ﬁrst sentence in the ﬁrst example).

9/27
In the Memory Networks of Weston et al., this supporting
subset was explicitly indicated to the model during training.

9/27
In what is called end-to-end training of memory networks,
this information is no longer provided.

9/27
20 QA tasks. A task is a set of example problems. A
problem is a set of I sentences xi where I ≤ 320, a
question q and an answer a.

9/27
The vocabulary is of size V = 177! Two versions of the
data are used, one that has 1000 training problems per
task, and one with 10,000 per task.

10/27
Model Architecture
K = 3 hops were used.
Adjacent weight sharing was used to ease training and reduce
the number of parameters.
Adjacent weight tying
1 The output embedding of a layer is input to the layer
above. (Ak+1 = Ck )
2 Answer prediction is the same as the ﬁnal output.
(WT = CK )
3 Question embedding is the same as the input to the ﬁrst
layer. (B = A1)

11/27
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.

11/27
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.
Temporal Encoding: Modify the memory vector with a
special matrix that encodes temporal information. 2
Now, mi = j Axij + TA(i), where TA(i) is the ith row of a
special temporal matrix TA.
All the T matrices are learned during training. They are
subject to the sharing constraints as between A and C.
2
There isn’t enough detail on what constraints this matrix should be
subject to, if any.

12/27
Training
Loss function and learning parameters
Embedding Matrices A, B and C, as well as W are jointly
learnt.
Loss function is a standard cross entropy between ˆa and
the true label a.
Stochastic gradient descent is used with learning rate of
η = 0.01, with annealing.

13/27
Training
Parameters and Techniques
RN: Learning time invariance by injecting random noise to
regularize TA
LS: Linear start: Remove all softmax except for the answer
prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01
for normal training.)
LW: Layer-wise, RNN-like weight tying. Otherwise,
adjacent weight tying.
BoW or PE: sentence representation.
joint: training on all 20 tasks jointly vs independently.
[Sukhbaatar2015]

14/27
RN: Learning time invariance by injecting random noise to
regularize TA
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]

14/27
LS: Linear start: Remove all softmax except for the answer prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01 for normal training.)
[Sukhbaatar2015]

14/27
LW: Layer-wise, RNN-like weight tying. Otherwise, adjacent
weight tying.
[Sukhbaatar2015]

14/27
BoW or PE: sentence representation.
[Sukhbaatar2015]

14/27
take-home msg: More memory hops give improved
performance.
[Sukhbaatar2015]

14/27
take-home msg: Joint training on various tasks sometimes
helps.
[Sukhbaatar2015]

15/27
Set of Supporting Facts
Figure: Instances of successful prediction of the supporting
sentences.

16/27
Plan
1 Introduction
Training
Training
Results

17/27
IGOR
The memory network consists of a memory m and 4 learned
components
1 I: (input feature map) - converts the incoming input to the
internal feature representation.
2 G: (generalization) - updates old memories given the new
input.
3 O: (output feature map) - produces a new output, given the
new input and the current memory state.
4 R: (response) - converts the output into the response
format desired.

18/27
Model Flow
The core of inference lies in the O and R modules. The O
module produces output features by ﬁnding k supporting
memories given x.
For k = 1, the highest scoring supporting memory is
retrieved: o1 = O1(x, m) = argmax
i=1,...,N
sO(x, mi).
For k = 2, a second supporting memory is additionally
computed: o2 = O2(x, m) = argmax
i=1,...,N
sO([x, mo1
], mi).
In the single-word response setting, where W is the set of
all words in the dict., then r = argmax
w∈W
sR([x, mo1
, mo2
], w).

19/27
Training
Max-margin, SGD
Supporting sentences annotations are available as part of the training
data. Thus, scoring functions are trained by minimizing a margin
ranking loss over the model parameters UO and UR using SGD.
Figure: For a given question x with true response r and supporting
sentences mO1
, mO2
(i.e., k = 2), this expression is minimized over
parameters UO and UR:
where ¯f, ¯f and ¯r are all other choices than the correct labels, and γ is the margin.

20/27
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efﬁcient memory scoring
Idea: hash the inputs I(x) into buckets, and score memories mi lying
in the same buckets only.

20/27
Results
large-scale QA
word hash: a bucket per dict. word, containing all sentences that
contain this word.

20/27
Results
large-scale QA
cluster hash: Run K-means to cluster word vectors (UO)i , giving K
buckets. Hash sentence to all buckets in which its words belong.

21/27
Results
simulation QA
Figure: The task is a simple simulation of 4 characters, 3 objects and
5 rooms - with characters moving around, picking up and dropping
objects. (Similar to the 10k dataset of MemN2N)

22/27
Results
simulation QA - sample test rseults
Figure: Sample test set predictions (in red) for the simulation in the
setting of word-based input and where answers are sentences and an
LSTM is used as the R component of the MemNN.

23/27
Plan
1 Introduction
Training
Training
Results

24/27
Architecture
More sophisticated memory “controller”.
Figure: Content-addressing is implemented by learning similarity
measures, analogous to MemNN. Additionally, the controller offers
simulation of location-based addressing by implementing a rotational
shift of a weighting.

25/27
NTM learns a Copy task
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.

25/27
... on which LSTM fails
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.

26/27
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.

26/27
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.
Thank you!

27/27
References
End-To-End Memory Networks, Sainbayar Sukhbaatar,
Arthur Szlam, Jason Weston, Rob Fergus, 2015.
Memory Networks, Jason Weston, Sumit Chopra, Antoine
Bordes, 2015
Neural Turing Machines, Alex Graves, Greg Wayne, Ivo
Danihelka, 2014
Deep learning at Oxford 2015, Nando de Freitas

Memory Networks, Neural Turing Machines, and Question Answering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Memory Networks, Neural Turing Machines, and Question Answering

Similar to Memory Networks, Neural Turing Machines, and Question Answering (20)

Recently uploaded

Recently uploaded (20)

Memory Networks, Neural Turing Machines, and Question Answering