Episodic Memory Reader:
Learning What to Remember
for Question Answering from Streaming Data
Moonsu Han1*, Minki Kang1*, Hyunwoo Jung1,3 and Sung Ju Hwang1,2
KAIST1, Daejeon, South Korea
AITRICS2, Seoul, South Korea
NAVER Clova3, Seongnam, South Korea
1
Motivation
2
When interacting with users, an agent should remember the information of the user.
The agent does not know when and what information and questions will be given.
User
Agent
It was a hard day. I do not like noisy places.
How about a rest at home?
I sometimes read books when I have a break.
……
Today is a holiday.
How about taking a break and
reading a book at home?
(Clova WAVE)
John
I like potatoes.
What is my name?
NEUT: John
Memory Scalability Problem
3
The best way to preserve the information is to store all of it in external memory.
However, it cannot retain and learn all information due to limitations on memory
capacity.
User
POS: Library
NEG: Noisy
POS: Baseball
NEUT: Senior
POS: Potato
…
Read
External Memory
NEUT: John
Agent
(Clova WAVE)
Write
Learning What to Remember from Streaming Data
4
We substitute this problem for a novel question answering (QA) task, where the
machine does not know when the questions will be given from streaming data.
Inspired by this motivation, we define a new problem that can arise in the real-world
and that should be addressed in the future to build a conversation agent.
Data 1
Data 2
External Memory
QA model
Supporting fact 1
Data 1
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Data 1
Data 2Data 1
Data 2
…
Supporting fact T
Supporting fact 3
Data 95
Data 150
Data 154
Supporting fact TSupporting fact T
Data 2
WriteRead
Answer T
Supporting fact T
Learning What to Remember from Streaming Data
5
The model can answer unknown queries by storing incoming data until the external
memory is full.
Write
External Memory
QA model
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
Data 3
Data 1
Data 2
Data 4
Supporting fact 1
…
Data 1
Data 1
Data 1
Data 2Read Data 3
Data 2
Data 2
Supporting fact 1Data 4
Data 3
Data 4
Supporting fact 1
Supporting fact 1
Query 1
Answer 1
Supporting fact 1
Learning What to Remember from Streaming Data
6
When the memory is full, the model should determine which memory entry or
incoming data to discard.
External Memory
Supporting fact T
QA model
Supporting fact 1
Data 1
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
In this situation, it can easily decide to replace the entry since almost all memory
entries are useless.
Write
Data 2
Data 3
Data 4
Supporting fact T
Supporting fact T
Supporting fact T
Supporting fact T
Learning What to Remember from Streaming Data
7
In the case of when the memory is full of supporting facts, it is difficult to decide which
memory entry to delete.
External Memory
Supporting fact T
QA model
Supporting fact 1
Supporting fact 2
Supporting fact 7
Supporting fact 8
Supporting fact 9
Which
memory
entry is no
longer
needed?
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Therefore, there is a need for a model that learns general importance of an instance
of data and which data is important at what time.
Problem Definition
8
Given a data stream 𝑋 = 𝑥 1 , … , 𝑥 𝑇 , a model should learn a function 𝑭: 𝑿 → 𝑴
that maps it to the set of memory entries 𝑀 = 𝑚 1 , … , 𝑚 𝑁 where 𝑇 ≫ 𝑁.
How can it learn such a function that maximizes the performance on an unseen
future task without knowing what problems it will be given at the time?
External Memory 𝑀
m(1)
m(2)
m(N)
𝑥(3)
𝑥(1)
𝑥(2)
Streaming Data 𝑋
Query 𝑄
…
𝑥(𝑇)
𝑥 4
…
m(3)
𝐹: 𝑋 → 𝑀
𝑥(5)
m(1)
: 𝑥(1)
m(2)
: 𝑥(8)
m(N)
: 𝑥(𝑇−3)
…
m(3)
: 𝑥(𝑇−9)
Result
&
Performance
Difference from Existing Methods
9
Our question answering task requires a model that can sequentially handle streaming
data without knowing the query.
It is difficult to solve this problem with existing QA methods due to their lack of
scalability.
QA model
𝑥(1)
𝑥(2)
Streaming Data 𝑋
…
Query 𝑄
…
𝑥(10)
𝑥(11)
𝑥(𝑇)
Answer
…
𝑀𝑜𝑑𝑒𝑙(𝑥 1
)
𝑀𝑜𝑑𝑒𝑙(𝑥 2
)
𝑀𝑜𝑑𝑒𝑙(𝑥 𝑇
)
𝑀𝑜𝑑𝑒𝑙(𝑄)
Episodic Memory Reader (EMR)
10
To solve a novel QA task, we propose Episodic Memory Reader (EMR) that sequentially
reads the streaming data and stores the information into the external memory.
It replaces memories that are less important for answering unseen questions when
the external memory is full.
𝑥(𝑇−8)
𝑡 𝑇−8
𝑚1
(𝑇−10)
𝑚2
(𝑇−10)
𝑚3
(𝑇−10)
𝑚4
(𝑇−10)
𝑥(𝑇−9)
𝑡 𝑇−9
EMR
Read Write
External
Memory
𝑚1
(𝑇−9)
𝑚2
(𝑇−9)
𝑚3
(𝑇−9)
𝑥(𝑇−9)
…
𝑥(𝑇)
𝑡 𝑇
EMR
… 𝑡 𝑇+1
Query
QA
model
“Answer”EMR
Learning What to Remember Using RL
11
We use Reinforcement Learning to make which information is important be learned.
We intend that if the agent do the good action, the QA model will output positive
reward then that action is reinforced.
Environment
State
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Pre-trained
QA Model
Current Input
External
Memory …
Action
Streaming
Observe
Replacement
Evaluate
Reward
Episodic
Memory
Reader
(Agent)
The Components of Proposed Model
12
EMR consists of Data encoder, Memory encoder, and Value network to output the
importance between memories.
It learns how to retain important information in order to maximize its QA accuracy at
a future timepoint.
…
Data Encoder
Memory
Encoder
External
Memory
QA model
Reward (e.g. F1score, Acc.)
𝑄𝑢𝑒𝑟𝑦
“Answer”
Episodic Memory Reader (EMR)
Read
2
…
Multi-layer
Perceptron
Policy Network (Actor)
Value Network (Critic)
…
Multi-layer
Perceptron
…Policy (π)
π 𝑚 𝑠)
…
GRU
Cell
𝑉𝑎𝑙𝑢𝑒 (𝑉)
Write
Data Encoder
13
Data encoder encodes the input data to the memory vector representation.
The model for the encoder is varied based on the type of the input.
…
Data Encoder
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
Text Input
…𝑓𝑒𝑛𝑠 𝑎𝑙𝑠𝑜 𝑘𝑛𝑜𝑤𝑛 𝑠𝑒𝑣𝑒𝑟𝑎𝑙
Embedding Layer
biGRU
…
Processing of Data Encoder
𝑒(𝑡)
= 𝑚6
(𝑡)
Data Encoder
14
Data encoder encodes the input data to the memory vector representation.
The model for the encoder is varied based on the type of the input.
…
Data Encoder
External Memory
Image Input
CNN
Processing of Data Encoder
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑒(𝑡)
= 𝑚6
(𝑡)
Memory Encoder
15
Memory Encoder computes the replacement probability by considering the
importance of memory entries.
We devise 3 different types of memory encoder.
1. EMR-Independent
2. EMR-biGRU
3. EMR-Transformer
External
Memory
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
Memory
Encoder
Policy
Network
External
Memory
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
Policy (π)
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
…
Replace
Memory Encoder (EMR-Independent)
16
Similar to [Gülçehre18], EMR-Independent captures the relative importance of each
memory entry independently to the new data instance.
[Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018
𝛼1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
𝛼2
(𝑡)
𝛼3
(𝑡)
𝛼4
(𝑡)
𝛼5
(𝑡)
𝛾1
(𝑡)
𝛾2
(𝑡)
𝛾3
(𝑡)
𝛾4
(𝑡)
𝛾5
(𝑡)
𝑣1
(𝑡−1)
𝑣2
(𝑡−1)
𝑣3
(𝑡−1)
𝑣4
(𝑡−1)
𝑣5
(𝑡−1)
𝑔1
(𝑡)
𝑔2
(𝑡)
𝑔3
(𝑡)
𝑔4
(𝑡)
𝑔5
(𝑡)
Memory Encoder
(Independent)
Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
Major drawback is that the evaluation of each memory depends only on the input.
Memory Encoder (EMR-biGRU)
17
EMR-biGRU computes the importance of each memory entry considering relative
relationships between memory entries using a bidirectional GRU.
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
Memory Encoder (biGRU) Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
𝑚6
(𝑡)
ℎ6
(𝑡)
Memory Encoder (EMR-biGRU)
18
EMR-biGRU computes the importance of each memory entry considering relative
relationships between memory entries using bidirectional GRU.
External Memory Memory Encoder (biGRU) Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
However, the importance of each memory entry can be learned in relation to its
neighbor rather than independently.
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)
Memory Encoder (EMR-Transformer)
19
EMR-Transformer computes the relative importance of each memory entry using a
self-attention mechanism from [Vaswani17].
[Vaswani17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need. NIPS 2017
External Memory Memory Encoder
(Transformer)
Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)
Value Network
20
Two types of reinforcement learning are used – A3C [Mnih16] or REINFORCE [Williams92].
We adopt Deep Sets from [Zaheer17] to make set representation ℎ 𝑡
.
[Zaheer17] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. R. Salakhutdinov, A. J. Smola, Deep Sets. NIPS 2017
[Williams92] R. J. Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. ML 1992
[Mnih16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
External Memory Memory Encoder
(Transformer)
Multi-layer
Perceptron
GRU Cellℎ(𝑡)
𝑉(𝑡)
ℎ(𝑡−1)
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)
Value Network
Example of EMR in Action
21
For better understanding, we make the summary of the working process of our model
using TriviaQA example.
We assume the EMR-Transformer with the external memory which can hold 100 words.
Q: Which US state lends its name to a baked pudding,
made with ice cream, sponge and meringue?
Meringue is type of dessert often associated with Italian
Swiss and French cuisine made from whipped egg whites
and sugar
Sentence is streamed at a constant rate from Wikipedia
Document.
→ 20 words per each time
Example of EMR in Action
22
The streamed sentence is encoded and stored to the external memory.
It writes the data representation to the memory until it becomes full.
Meringue is type of dessert often associated with Italian
Swiss and French cuisine made from whipped egg whites
and sugar
𝑚1
1
Data Encoder
Embedding Layer
…𝑀𝑒𝑟𝑖𝑛𝑔𝑢𝑒 𝑖𝑠 𝑡𝑦𝑝𝑒 𝑠𝑢𝑔𝑎𝑟
𝑒 𝑡
biGRU
…
Wikipedia
Document
External Memory
Question
Example of EMR in action
23
If the memory is full, it needs to decide which memory entry to be deleted.
For this end, Episodic Memory Reader (EMR) ‘reads’ memory entries and current
input data using the memory encoder.
Data Encoder
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
𝑚6
6
External Memory
Memory
Encoder
Read
Wikipedia
Document
Question
Example of EMR in Action
24
Then, the memory encoder outputs the policy – action probability for deleting the
least important entry.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
𝑚6
6
External Memory
Memory
Encoder
Read
𝑚3
6
Example of EMR in Action
25
The entry with the highest probability is deleted and other entries are pushed.
And new data instance is “written” on the last entry of the external memory.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
7
𝑚2
7
𝑚4
7
𝑚5
7
External Memory
Memory
Encoder
Read
𝑚6
6
Example of EMR in action
26
The entry with the highest probability is deleted and other entries are pushed.
And new data instance is “written” on the last entry of the external memory.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
7
𝑚2
7
𝑚3
7
𝑚4
7
𝑚6
7
External Memory
Memory
Encoder
Read
Write
𝑚5
7
Example of EMR in action
27
When it encounters the question, the QA model outputs the answer of given question
using the sentences in the external memory.
Wikipedia
Document
Question
𝑚1
𝑇
𝑚2
𝑇
𝑚3
𝑇
𝑚4
𝑇
𝑚5
𝑇
External Memory
Meringue is type of dessert
often associated with French
swiss and Italian cuisine made
from whipped egg whites and
sugar or aquafaba and sugar and
into egg whites used for
decoration on pie or spread on
sheet or baked Alaska base and
baked swiss meringue is
hydrates from refined sugar
…
Sentences in External Memory
QA model
(BERT)
Q: Which US state lends its name to a baked pudding, made
with ice cream, sponge and meringue?
A: Alaska
Training the Episodic Memory Reader
28
For training, the model is trained after each “data stream”.
Also, deleting entry is selected stochastically based on the policy.
Wikipedia
Document
Question External Memory
Episodic
Memory
Reader
Streaming Data
𝑥1
Pre-trained
QA model
(BERT)
History
Read Write
Policy (π)
π 𝑚 𝑠)
…
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
Training the Episodic Memory Reader
29
After the memory update, the performance of the task is evaluated.
To evaluate the policy, we provide the future query at every time step only during
training time.
In TriviaQA, it is the F1 Score and it is used as the reward for reinforcement learning.
Wikipedia
Document
Question
Streaming Data
Policy (π)
π 𝑚 𝑠)
…
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
History
F1 Score
(Reward)
Training the Episodic Memory Reader
30
Then, reward and action probabilities are stored for further training steps.
This process is repeated until the data stream ends.
Wikipedia
Document
Question
Streaming Data
Policy (π)
π 𝑚 𝑠)
…
F1 Score
(Reward)
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
History
Store
Training the Episodic Memory Reader
31
At the end of the stream, QA Loss is computed from the QA model and RL Loss is
computed from the reinforcement learning algorithm using stored history.
Then, the QA model and Episodic Memory Reader are jointly trained.
Wikipedia
Document
Question
Streaming Data
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
HistoryQA Loss
Reinforcement
Learning
RL Loss
Reinforcement Learning Loss
32
RL loss is based on the basic Actor-Critic learning loss. When computing RL loss, we
also use the Entropy Loss to explore various possibilities.
History
Reinforcement Learning
𝑅𝑖 = 0.99 ∗ 𝑅𝑖−1 + 𝑟𝑖
𝐿 𝑣𝑎𝑙𝑢𝑒 =
𝑖
1
2
∗ 𝑅𝑖 − 𝑉𝑖
2
𝐿 𝑝𝑜𝑙𝑖𝑐𝑦 = 𝑖 −(𝑟𝑖 + 0.99 ∗ 𝑉𝑖+1 − 𝑉𝑖) ∗ 𝑙𝑜𝑔 𝑝𝑖 − 0.01 ∗ 𝑒𝑖
RL Loss = 𝐿 𝑣𝑎𝑙𝑢𝑒 + 𝐿 𝑝𝑜𝑙𝑖𝑐𝑦
i-th history
Action Probability 𝑝𝑖
Value 𝑉𝑖
Reward 𝑟𝑖
Entropy 𝑒𝑖 =
𝑗 𝑝𝑖𝑗 𝑙𝑜𝑔𝑝𝑖𝑗
Experiment - Baselines
33
EMR that makes the policy in an independent method based on Dynamic Least
Recently Used from [Gülçehre18].
• FIFO (First-In First-Out) • Uniform • LIFO (Last-In First-Out)
• EMR-Independent
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
Policy (π)
π 𝑚 𝑠)
𝑚5
6
Policy (π)
π 𝑚 𝑠)
𝑚5
6
Policy (π)
π 𝑚 𝑠)
We experiment our EMR-biGRU and EMR-Transformer against several baselines.
[Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018
Dataset (bAbI Dataset)
34
We evaluate our models and baselines on three question answering datasets.
[Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015
• bAbI [Weston15]: A synthetic dataset for episodic question answering,
consisting of 20 tasks with small amount of vocabulary.
Original Task 2 Noisy Task 2
Index Context
Mary journeyed to the bathroom
Sandra went to the garden
Sandra put down the milk there
1
2
6
…
…
Where is the milk? Garden [2, 6]
Daniel went to the garden
Daniel dropped the football
8
17
…
…
Where is the football? Bedroom [12, 17]
Index Context
Sandra moved to the kitchen
Wolves are afraid of cats
Mary is green
1
2
6
…
…
Where is the milk? Garden [1, 4]
Mice are afraid of wolves
Mary journeyed to the kitchen
38
42
…
…
Where is the apple? Kitchen [34, 42]
→ It can be solved by remembering a person or an object.
Dataset (TriviaQA Dataset)
35
We evaluate our models and baselines on three question answering datasets.
[Joshi17] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL 2017
[Context]
001 World War I (WWI or WW1), also known as the First World War, or the Great War, was a global war originating in …
002 More than 70 million military personnel, including 60 million Europeans, were mobilised in one of the largest wars in …
550 Some war memorials date the end of the war as being when the Versailles Treaty was signed in 1919, …
770 Britain, rationing was finally imposed in early 1918, limited to meat, sugar, and fats (butter and margarine), but not bread.
……
[Question]
Where was the peace treaty signed that brought World War I to an end?
[Answer]
Versailles castle
• TriviaQA [Joshi17]: a realistic text-based question answering dataset,
including 95K question-answer pairs from 662K documents.
→ It requires a model with high-level reasoning and capability for reading
large amount of sentences in document.
Dataset (TVQA Dataset)
36
We evaluate our models and baselines on three question answering datasets.
[Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018
[Video Clip]
[Question] What is Kutner writing on when talking to Lawrence?
[Answer 1] Kutner is writing on a clipboard.
[Answer 2] Kutner is writing on a laptop.
[Answer 3] Kutner is writing on a notepad.
[Answer 4] Kutner is writing on an index card.
[Answer 5] Kutner is writing on his hand.
… …
• TVQA [Lei18]: a localized, compositional video question answering dataset
containing 153K question-answer pairs from 22K clips in 6 TV series.
→ It requires a model that is able to understand multi-modal information.
Sentence 3
Sentence N-14
Experiment on bAbI Dataset
37
We combine our model with MemN2N [Weston15] on bAbI dataset.
[Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015
In this experiment, we set varying memory size to evaluate the efficiency of our model.
EMR
QA
model
(MemN2N)
Sentence 1
Sentence 2
Query M
External Memory
Sentence 1
Sentence 2
Write
Sentence N-13
Sentence N
Streaming Data
Read
………
Sentence 2
Sentence 2
“Answer”
Query 1
Result (Accuracy)
38
Both of our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Our methods are able to retain the supporting facts even with small number of
memory entries.
Original Noisy
Result (Solvable)
39
Also, we report how many supporting facts the models retain in the external memory.
Two EMR variants significantly outperform EMR-Independent as well as rule-based
memory scheduling policies.
Original Noisy
Chunk 3 (20 words)
Chunk 2 (20 words)
Experiment on TriviaQA Dataset
40
We combine our model with BERT [Devlin18] on TriviaQA dataset.
[Devlin18] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019
We extract the indices of span prediction since TriviaQA does not provide it.
We embed 20 words into each memory entry and hold 400 words at maximum.
Wikipedia
Document
Question
Chunk 1 (20 words)
Query
Chunk 16 (20 words)
Chunk N (20 words)
Streaming Data
……
EMR
QA
model
(BERT)
External MemoryWrite
Read
“Answer”
Chunk 3
Chunk N
Chunk 7
Chunk 17
Chunk N
…
Result (Accuracy)
41
Both our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Model ExactMatch F1score
FIFO 24.53 27.22
Uniform 28.30 34.39
LIFO 46.23 50.10
EMR-Independent 38.05 41.15
EMR-biGRU 52.20 57.57
EMR-Transformer 48.43 53.81
LIFO performs quite well unlike the other rule-based scheduling policies because
most answers are spanned in earlier part of the documents.
Result of TriviaQA Indices of answers following doc. length
Experiment on TVQA Dataset
42
Multi-stream model from [Lei18] is used as QA model with ours.
[Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018
A subtitle is attached to the frame which has the start of the subtitle.
One frame and the corresponding subtitle are jointly embedded into each memory
entry.
Frame | No Sub. 3
Frame | Subtitle 2
Video Clip
Question
Frame | Subtitle 1
Query
Frame | Subtitle 4
Frame | SubtitleN
Streaming Data
…
EMR
QA model
(Multi-stream)
External MemoryWrite
Read
“Answer”
Frame | Subtitle 2
Frame | SubtitleN
Frame | Subtitle 5
Frame | Subtitle9
Frame | No Sub. 5
…
Result (Accuracy)
43
Both our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Our methods are able to retain the frames and subtitles even with small number of
memory entries.
Example of TVQA Result
44
[Question] < 00:55.00 ~ 01:06.33 >
Who enters the coffee shop after Ross shows everyone the paper?
[Answer]
1) Joey 2) Rachel 3) Monica 4) Chandler 5) Phoebe
[Video Clip]
00:01 00:02 00:03 1:32 1:33
…
[Subtitle]
00:03  UNKNAME: Hey. I got some bad news. What?
00:05  UNKNAME: That’s no way to sell newspapers ...
(Ellipsis)
01:31  UNKNAME: Your food is abysmal!
[Subtitle]
𝑚0  UNKNAME: No. Monica’s restaurant got a horrible review ...
𝑚1  UNKNAME: I didn’t want her to see it, so I ran around and ...
𝑚2  Joey: This is bad. And I’ve had bad reviews.
[Memory Information after Reading Streaming Data]
𝑚3  Monica: Oh, my God! Look at all the newspapers.
𝑚4  UNKNAME: They say there’s no such thing as …
𝑚0 𝑚1 𝑚2 𝒎 𝟑 𝑚4
Visualization of Memory Entries
45
EMR learns general importance by storing the information corresponding to solving
unknown queries in the external memory.
Conclusion
46
• We propose a novel task of learning to remember important instances from
streaming data and show it on the question answering task.
Codes available at https://github.com/h19920918/emr
• Episodic Memory Reader (EMR) learns general importance by considering
relative importance between memory entries without knowing queries in order
to maximize the performance of the QA task.
• Results show that our models retain the information for answering even with
small number of memory entries relative to the length of streams.
• We believe that our work can be an essential part toward building real-world
conversation agent.
Thank you
Q&A

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

  • 1.
    Episodic Memory Reader: LearningWhat to Remember for Question Answering from Streaming Data Moonsu Han1*, Minki Kang1*, Hyunwoo Jung1,3 and Sung Ju Hwang1,2 KAIST1, Daejeon, South Korea AITRICS2, Seoul, South Korea NAVER Clova3, Seongnam, South Korea 1
  • 2.
    Motivation 2 When interacting withusers, an agent should remember the information of the user. The agent does not know when and what information and questions will be given. User Agent It was a hard day. I do not like noisy places. How about a rest at home? I sometimes read books when I have a break. …… Today is a holiday. How about taking a break and reading a book at home? (Clova WAVE)
  • 3.
    John I like potatoes. Whatis my name? NEUT: John Memory Scalability Problem 3 The best way to preserve the information is to store all of it in external memory. However, it cannot retain and learn all information due to limitations on memory capacity. User POS: Library NEG: Noisy POS: Baseball NEUT: Senior POS: Potato … Read External Memory NEUT: John Agent (Clova WAVE) Write
  • 4.
    Learning What toRemember from Streaming Data 4 We substitute this problem for a novel question answering (QA) task, where the machine does not know when the questions will be given from streaming data. Inspired by this motivation, we define a new problem that can arise in the real-world and that should be addressed in the future to build a conversation agent. Data 1 Data 2 External Memory QA model Supporting fact 1 Data 1 Streaming Data Query 1 … Supporting fact T Query T … Data 1 Data 2Data 1 Data 2 … Supporting fact T Supporting fact 3 Data 95 Data 150 Data 154 Supporting fact TSupporting fact T Data 2 WriteRead Answer T Supporting fact T
  • 5.
    Learning What toRemember from Streaming Data 5 The model can answer unknown queries by storing incoming data until the external memory is full. Write External Memory QA model Supporting fact 1 Data 1 Data 2 Streaming Data Query 1 … Supporting fact T Query T Data 3 Data 1 Data 2 Data 4 Supporting fact 1 … Data 1 Data 1 Data 1 Data 2Read Data 3 Data 2 Data 2 Supporting fact 1Data 4 Data 3 Data 4 Supporting fact 1 Supporting fact 1 Query 1 Answer 1 Supporting fact 1
  • 6.
    Learning What toRemember from Streaming Data 6 When the memory is full, the model should determine which memory entry or incoming data to discard. External Memory Supporting fact T QA model Supporting fact 1 Data 1 Supporting fact 1 Data 1 Data 2 Streaming Data Query 1 … Supporting fact T Query T … In this situation, it can easily decide to replace the entry since almost all memory entries are useless. Write Data 2 Data 3 Data 4 Supporting fact T Supporting fact T Supporting fact T Supporting fact T
  • 7.
    Learning What toRemember from Streaming Data 7 In the case of when the memory is full of supporting facts, it is difficult to decide which memory entry to delete. External Memory Supporting fact T QA model Supporting fact 1 Supporting fact 2 Supporting fact 7 Supporting fact 8 Supporting fact 9 Which memory entry is no longer needed? Supporting fact 1 Data 1 Data 2 Streaming Data Query 1 … Supporting fact T Query T … Therefore, there is a need for a model that learns general importance of an instance of data and which data is important at what time.
  • 8.
    Problem Definition 8 Given adata stream 𝑋 = 𝑥 1 , … , 𝑥 𝑇 , a model should learn a function 𝑭: 𝑿 → 𝑴 that maps it to the set of memory entries 𝑀 = 𝑚 1 , … , 𝑚 𝑁 where 𝑇 ≫ 𝑁. How can it learn such a function that maximizes the performance on an unseen future task without knowing what problems it will be given at the time? External Memory 𝑀 m(1) m(2) m(N) 𝑥(3) 𝑥(1) 𝑥(2) Streaming Data 𝑋 Query 𝑄 … 𝑥(𝑇) 𝑥 4 … m(3) 𝐹: 𝑋 → 𝑀 𝑥(5) m(1) : 𝑥(1) m(2) : 𝑥(8) m(N) : 𝑥(𝑇−3) … m(3) : 𝑥(𝑇−9) Result & Performance
  • 9.
    Difference from ExistingMethods 9 Our question answering task requires a model that can sequentially handle streaming data without knowing the query. It is difficult to solve this problem with existing QA methods due to their lack of scalability. QA model 𝑥(1) 𝑥(2) Streaming Data 𝑋 … Query 𝑄 … 𝑥(10) 𝑥(11) 𝑥(𝑇) Answer … 𝑀𝑜𝑑𝑒𝑙(𝑥 1 ) 𝑀𝑜𝑑𝑒𝑙(𝑥 2 ) 𝑀𝑜𝑑𝑒𝑙(𝑥 𝑇 ) 𝑀𝑜𝑑𝑒𝑙(𝑄)
  • 10.
    Episodic Memory Reader(EMR) 10 To solve a novel QA task, we propose Episodic Memory Reader (EMR) that sequentially reads the streaming data and stores the information into the external memory. It replaces memories that are less important for answering unseen questions when the external memory is full. 𝑥(𝑇−8) 𝑡 𝑇−8 𝑚1 (𝑇−10) 𝑚2 (𝑇−10) 𝑚3 (𝑇−10) 𝑚4 (𝑇−10) 𝑥(𝑇−9) 𝑡 𝑇−9 EMR Read Write External Memory 𝑚1 (𝑇−9) 𝑚2 (𝑇−9) 𝑚3 (𝑇−9) 𝑥(𝑇−9) … 𝑥(𝑇) 𝑡 𝑇 EMR … 𝑡 𝑇+1 Query QA model “Answer”EMR
  • 11.
    Learning What toRemember Using RL 11 We use Reinforcement Learning to make which information is important be learned. We intend that if the agent do the good action, the QA model will output positive reward then that action is reinforced. Environment State Supporting fact 1 Data 1 Data 2 Streaming Data Query 1 … Supporting fact T Query T … Pre-trained QA Model Current Input External Memory … Action Streaming Observe Replacement Evaluate Reward Episodic Memory Reader (Agent)
  • 12.
    The Components ofProposed Model 12 EMR consists of Data encoder, Memory encoder, and Value network to output the importance between memories. It learns how to retain important information in order to maximize its QA accuracy at a future timepoint. … Data Encoder Memory Encoder External Memory QA model Reward (e.g. F1score, Acc.) 𝑄𝑢𝑒𝑟𝑦 “Answer” Episodic Memory Reader (EMR) Read 2 … Multi-layer Perceptron Policy Network (Actor) Value Network (Critic) … Multi-layer Perceptron …Policy (π) π 𝑚 𝑠) … GRU Cell 𝑉𝑎𝑙𝑢𝑒 (𝑉) Write
  • 13.
    Data Encoder 13 Data encoderencodes the input data to the memory vector representation. The model for the encoder is varied based on the type of the input. … Data Encoder 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) External Memory Text Input …𝑓𝑒𝑛𝑠 𝑎𝑙𝑠𝑜 𝑘𝑛𝑜𝑤𝑛 𝑠𝑒𝑣𝑒𝑟𝑎𝑙 Embedding Layer biGRU … Processing of Data Encoder 𝑒(𝑡) = 𝑚6 (𝑡)
  • 14.
    Data Encoder 14 Data encoderencodes the input data to the memory vector representation. The model for the encoder is varied based on the type of the input. … Data Encoder External Memory Image Input CNN Processing of Data Encoder 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) 𝑒(𝑡) = 𝑚6 (𝑡)
  • 15.
    Memory Encoder 15 Memory Encodercomputes the replacement probability by considering the importance of memory entries. We devise 3 different types of memory encoder. 1. EMR-Independent 2. EMR-biGRU 3. EMR-Transformer External Memory 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) Memory Encoder Policy Network External Memory 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) Policy (π) 𝜋 𝑖 𝑀 𝑡 , 𝑒 𝑡 ; 𝜃 … Replace
  • 16.
    Memory Encoder (EMR-Independent) 16 Similarto [Gülçehre18], EMR-Independent captures the relative importance of each memory entry independently to the new data instance. [Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018 𝛼1 (𝑡) 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) External Memory 𝛼2 (𝑡) 𝛼3 (𝑡) 𝛼4 (𝑡) 𝛼5 (𝑡) 𝛾1 (𝑡) 𝛾2 (𝑡) 𝛾3 (𝑡) 𝛾4 (𝑡) 𝛾5 (𝑡) 𝑣1 (𝑡−1) 𝑣2 (𝑡−1) 𝑣3 (𝑡−1) 𝑣4 (𝑡−1) 𝑣5 (𝑡−1) 𝑔1 (𝑡) 𝑔2 (𝑡) 𝑔3 (𝑡) 𝑔4 (𝑡) 𝑔5 (𝑡) Memory Encoder (Independent) Policy Network Multi-layer Perceptron 𝜋 𝑖 𝑀 𝑡 , 𝑒 𝑡 ; 𝜃 Major drawback is that the evaluation of each memory depends only on the input.
  • 17.
    Memory Encoder (EMR-biGRU) 17 EMR-biGRUcomputes the importance of each memory entry considering relative relationships between memory entries using a bidirectional GRU. 𝑚1 (𝑡) 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) External Memory 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) ℎ1 (𝑡) ℎ2 (𝑡) ℎ3 (𝑡) ℎ4 (𝑡) ℎ5 (𝑡) Memory Encoder (biGRU) Policy Network Multi-layer Perceptron 𝜋 𝑖 𝑀 𝑡 , 𝑒 𝑡 ; 𝜃 𝑚6 (𝑡) ℎ6 (𝑡)
  • 18.
    Memory Encoder (EMR-biGRU) 18 EMR-biGRUcomputes the importance of each memory entry considering relative relationships between memory entries using bidirectional GRU. External Memory Memory Encoder (biGRU) Policy Network Multi-layer Perceptron 𝜋 𝑖 𝑀 𝑡 , 𝑒 𝑡 ; 𝜃 However, the importance of each memory entry can be learned in relation to its neighbor rather than independently. 𝑚1 (𝑡) 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) ℎ1 (𝑡) ℎ2 (𝑡) ℎ3 (𝑡) ℎ4 (𝑡) ℎ5 (𝑡) 𝑚6 (𝑡) ℎ6 (𝑡)
  • 19.
    Memory Encoder (EMR-Transformer) 19 EMR-Transformercomputes the relative importance of each memory entry using a self-attention mechanism from [Vaswani17]. [Vaswani17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need. NIPS 2017 External Memory Memory Encoder (Transformer) Policy Network Multi-layer Perceptron 𝜋 𝑖 𝑀 𝑡 , 𝑒 𝑡 ; 𝜃 𝑚1 (𝑡) 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) ℎ1 (𝑡) ℎ2 (𝑡) ℎ3 (𝑡) ℎ4 (𝑡) ℎ5 (𝑡) 𝑚6 (𝑡) ℎ6 (𝑡)
  • 20.
    Value Network 20 Two typesof reinforcement learning are used – A3C [Mnih16] or REINFORCE [Williams92]. We adopt Deep Sets from [Zaheer17] to make set representation ℎ 𝑡 . [Zaheer17] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. R. Salakhutdinov, A. J. Smola, Deep Sets. NIPS 2017 [Williams92] R. J. Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. ML 1992 [Mnih16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous Methods for Deep Reinforcement Learning. ICML 2016 External Memory Memory Encoder (Transformer) Multi-layer Perceptron GRU Cellℎ(𝑡) 𝑉(𝑡) ℎ(𝑡−1) 𝑚1 (𝑡) 𝑚1 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) 𝑚6 (𝑡) 𝑚2 (𝑡) 𝑚3 (𝑡) 𝑚4 (𝑡) 𝑚5 (𝑡) ℎ1 (𝑡) ℎ2 (𝑡) ℎ3 (𝑡) ℎ4 (𝑡) ℎ5 (𝑡) 𝑚6 (𝑡) ℎ6 (𝑡) Value Network
  • 21.
    Example of EMRin Action 21 For better understanding, we make the summary of the working process of our model using TriviaQA example. We assume the EMR-Transformer with the external memory which can hold 100 words. Q: Which US state lends its name to a baked pudding, made with ice cream, sponge and meringue? Meringue is type of dessert often associated with Italian Swiss and French cuisine made from whipped egg whites and sugar Sentence is streamed at a constant rate from Wikipedia Document. → 20 words per each time
  • 22.
    Example of EMRin Action 22 The streamed sentence is encoded and stored to the external memory. It writes the data representation to the memory until it becomes full. Meringue is type of dessert often associated with Italian Swiss and French cuisine made from whipped egg whites and sugar 𝑚1 1 Data Encoder Embedding Layer …𝑀𝑒𝑟𝑖𝑛𝑔𝑢𝑒 𝑖𝑠 𝑡𝑦𝑝𝑒 𝑠𝑢𝑔𝑎𝑟 𝑒 𝑡 biGRU … Wikipedia Document External Memory Question
  • 23.
    Example of EMRin action 23 If the memory is full, it needs to decide which memory entry to be deleted. For this end, Episodic Memory Reader (EMR) ‘reads’ memory entries and current input data using the memory encoder. Data Encoder 𝑚1 6 𝑚2 6 𝑚3 6 𝑚4 6 𝑚5 6 𝑚6 6 External Memory Memory Encoder Read Wikipedia Document Question
  • 24.
    Example of EMRin Action 24 Then, the memory encoder outputs the policy – action probability for deleting the least important entry. Policy (π) π 𝑚 𝑠) … 𝑚1 6 𝑚2 6 𝑚3 6 𝑚4 6 𝑚5 6 𝑚6 6 External Memory Memory Encoder Read 𝑚3 6
  • 25.
    Example of EMRin Action 25 The entry with the highest probability is deleted and other entries are pushed. And new data instance is “written” on the last entry of the external memory. Policy (π) π 𝑚 𝑠) … 𝑚1 7 𝑚2 7 𝑚4 7 𝑚5 7 External Memory Memory Encoder Read 𝑚6 6
  • 26.
    Example of EMRin action 26 The entry with the highest probability is deleted and other entries are pushed. And new data instance is “written” on the last entry of the external memory. Policy (π) π 𝑚 𝑠) … 𝑚1 7 𝑚2 7 𝑚3 7 𝑚4 7 𝑚6 7 External Memory Memory Encoder Read Write 𝑚5 7
  • 27.
    Example of EMRin action 27 When it encounters the question, the QA model outputs the answer of given question using the sentences in the external memory. Wikipedia Document Question 𝑚1 𝑇 𝑚2 𝑇 𝑚3 𝑇 𝑚4 𝑇 𝑚5 𝑇 External Memory Meringue is type of dessert often associated with French swiss and Italian cuisine made from whipped egg whites and sugar or aquafaba and sugar and into egg whites used for decoration on pie or spread on sheet or baked Alaska base and baked swiss meringue is hydrates from refined sugar … Sentences in External Memory QA model (BERT) Q: Which US state lends its name to a baked pudding, made with ice cream, sponge and meringue? A: Alaska
  • 28.
    Training the EpisodicMemory Reader 28 For training, the model is trained after each “data stream”. Also, deleting entry is selected stochastically based on the policy. Wikipedia Document Question External Memory Episodic Memory Reader Streaming Data 𝑥1 Pre-trained QA model (BERT) History Read Write Policy (π) π 𝑚 𝑠) … 𝑥12 𝑥10 … 𝑥 𝑇 𝑥11 𝑞 …
  • 29.
    Training the EpisodicMemory Reader 29 After the memory update, the performance of the task is evaluated. To evaluate the policy, we provide the future query at every time step only during training time. In TriviaQA, it is the F1 Score and it is used as the reward for reinforcement learning. Wikipedia Document Question Streaming Data Policy (π) π 𝑚 𝑠) … 𝑥12 𝑥10 … 𝑥 𝑇 𝑥11 𝑞 … 𝑥1 Episodic Memory Reader External Memory Pre-trained QA model (BERT) History F1 Score (Reward)
  • 30.
    Training the EpisodicMemory Reader 30 Then, reward and action probabilities are stored for further training steps. This process is repeated until the data stream ends. Wikipedia Document Question Streaming Data Policy (π) π 𝑚 𝑠) … F1 Score (Reward) 𝑥12 𝑥10 … 𝑥 𝑇 𝑥11 𝑞 … 𝑥1 Episodic Memory Reader External Memory Pre-trained QA model (BERT) History Store
  • 31.
    Training the EpisodicMemory Reader 31 At the end of the stream, QA Loss is computed from the QA model and RL Loss is computed from the reinforcement learning algorithm using stored history. Then, the QA model and Episodic Memory Reader are jointly trained. Wikipedia Document Question Streaming Data 𝑥12 𝑥10 … 𝑥 𝑇 𝑥11 𝑞 … 𝑥1 Episodic Memory Reader External Memory Pre-trained QA model (BERT) HistoryQA Loss Reinforcement Learning RL Loss
  • 32.
    Reinforcement Learning Loss 32 RLloss is based on the basic Actor-Critic learning loss. When computing RL loss, we also use the Entropy Loss to explore various possibilities. History Reinforcement Learning 𝑅𝑖 = 0.99 ∗ 𝑅𝑖−1 + 𝑟𝑖 𝐿 𝑣𝑎𝑙𝑢𝑒 = 𝑖 1 2 ∗ 𝑅𝑖 − 𝑉𝑖 2 𝐿 𝑝𝑜𝑙𝑖𝑐𝑦 = 𝑖 −(𝑟𝑖 + 0.99 ∗ 𝑉𝑖+1 − 𝑉𝑖) ∗ 𝑙𝑜𝑔 𝑝𝑖 − 0.01 ∗ 𝑒𝑖 RL Loss = 𝐿 𝑣𝑎𝑙𝑢𝑒 + 𝐿 𝑝𝑜𝑙𝑖𝑐𝑦 i-th history Action Probability 𝑝𝑖 Value 𝑉𝑖 Reward 𝑟𝑖 Entropy 𝑒𝑖 = 𝑗 𝑝𝑖𝑗 𝑙𝑜𝑔𝑝𝑖𝑗
  • 33.
    Experiment - Baselines 33 EMRthat makes the policy in an independent method based on Dynamic Least Recently Used from [Gülçehre18]. • FIFO (First-In First-Out) • Uniform • LIFO (Last-In First-Out) • EMR-Independent 𝑚1 6 𝑚2 6 𝑚3 6 𝑚4 6 𝑚5 6 External Memory 𝑚1 6 𝑚2 6 𝑚3 6 𝑚4 6 𝑚5 6 External Memory 𝑚1 6 𝑚2 6 𝑚3 6 𝑚4 6 𝑚5 6 External Memory Policy (π) π 𝑚 𝑠) 𝑚5 6 Policy (π) π 𝑚 𝑠) 𝑚5 6 Policy (π) π 𝑚 𝑠) We experiment our EMR-biGRU and EMR-Transformer against several baselines. [Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018
  • 34.
    Dataset (bAbI Dataset) 34 Weevaluate our models and baselines on three question answering datasets. [Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015 • bAbI [Weston15]: A synthetic dataset for episodic question answering, consisting of 20 tasks with small amount of vocabulary. Original Task 2 Noisy Task 2 Index Context Mary journeyed to the bathroom Sandra went to the garden Sandra put down the milk there 1 2 6 … … Where is the milk? Garden [2, 6] Daniel went to the garden Daniel dropped the football 8 17 … … Where is the football? Bedroom [12, 17] Index Context Sandra moved to the kitchen Wolves are afraid of cats Mary is green 1 2 6 … … Where is the milk? Garden [1, 4] Mice are afraid of wolves Mary journeyed to the kitchen 38 42 … … Where is the apple? Kitchen [34, 42] → It can be solved by remembering a person or an object.
  • 35.
    Dataset (TriviaQA Dataset) 35 Weevaluate our models and baselines on three question answering datasets. [Joshi17] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL 2017 [Context] 001 World War I (WWI or WW1), also known as the First World War, or the Great War, was a global war originating in … 002 More than 70 million military personnel, including 60 million Europeans, were mobilised in one of the largest wars in … 550 Some war memorials date the end of the war as being when the Versailles Treaty was signed in 1919, … 770 Britain, rationing was finally imposed in early 1918, limited to meat, sugar, and fats (butter and margarine), but not bread. …… [Question] Where was the peace treaty signed that brought World War I to an end? [Answer] Versailles castle • TriviaQA [Joshi17]: a realistic text-based question answering dataset, including 95K question-answer pairs from 662K documents. → It requires a model with high-level reasoning and capability for reading large amount of sentences in document.
  • 36.
    Dataset (TVQA Dataset) 36 Weevaluate our models and baselines on three question answering datasets. [Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018 [Video Clip] [Question] What is Kutner writing on when talking to Lawrence? [Answer 1] Kutner is writing on a clipboard. [Answer 2] Kutner is writing on a laptop. [Answer 3] Kutner is writing on a notepad. [Answer 4] Kutner is writing on an index card. [Answer 5] Kutner is writing on his hand. … … • TVQA [Lei18]: a localized, compositional video question answering dataset containing 153K question-answer pairs from 22K clips in 6 TV series. → It requires a model that is able to understand multi-modal information.
  • 37.
    Sentence 3 Sentence N-14 Experimenton bAbI Dataset 37 We combine our model with MemN2N [Weston15] on bAbI dataset. [Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015 In this experiment, we set varying memory size to evaluate the efficiency of our model. EMR QA model (MemN2N) Sentence 1 Sentence 2 Query M External Memory Sentence 1 Sentence 2 Write Sentence N-13 Sentence N Streaming Data Read ……… Sentence 2 Sentence 2 “Answer” Query 1
  • 38.
    Result (Accuracy) 38 Both ofour models (EMR-biGRU and EMR-Transformer) outperform the baselines. Our methods are able to retain the supporting facts even with small number of memory entries. Original Noisy
  • 39.
    Result (Solvable) 39 Also, wereport how many supporting facts the models retain in the external memory. Two EMR variants significantly outperform EMR-Independent as well as rule-based memory scheduling policies. Original Noisy
  • 40.
    Chunk 3 (20words) Chunk 2 (20 words) Experiment on TriviaQA Dataset 40 We combine our model with BERT [Devlin18] on TriviaQA dataset. [Devlin18] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019 We extract the indices of span prediction since TriviaQA does not provide it. We embed 20 words into each memory entry and hold 400 words at maximum. Wikipedia Document Question Chunk 1 (20 words) Query Chunk 16 (20 words) Chunk N (20 words) Streaming Data …… EMR QA model (BERT) External MemoryWrite Read “Answer” Chunk 3 Chunk N Chunk 7 Chunk 17 Chunk N …
  • 41.
    Result (Accuracy) 41 Both ourmodels (EMR-biGRU and EMR-Transformer) outperform the baselines. Model ExactMatch F1score FIFO 24.53 27.22 Uniform 28.30 34.39 LIFO 46.23 50.10 EMR-Independent 38.05 41.15 EMR-biGRU 52.20 57.57 EMR-Transformer 48.43 53.81 LIFO performs quite well unlike the other rule-based scheduling policies because most answers are spanned in earlier part of the documents. Result of TriviaQA Indices of answers following doc. length
  • 42.
    Experiment on TVQADataset 42 Multi-stream model from [Lei18] is used as QA model with ours. [Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018 A subtitle is attached to the frame which has the start of the subtitle. One frame and the corresponding subtitle are jointly embedded into each memory entry. Frame | No Sub. 3 Frame | Subtitle 2 Video Clip Question Frame | Subtitle 1 Query Frame | Subtitle 4 Frame | SubtitleN Streaming Data … EMR QA model (Multi-stream) External MemoryWrite Read “Answer” Frame | Subtitle 2 Frame | SubtitleN Frame | Subtitle 5 Frame | Subtitle9 Frame | No Sub. 5 …
  • 43.
    Result (Accuracy) 43 Both ourmodels (EMR-biGRU and EMR-Transformer) outperform the baselines. Our methods are able to retain the frames and subtitles even with small number of memory entries.
  • 44.
    Example of TVQAResult 44 [Question] < 00:55.00 ~ 01:06.33 > Who enters the coffee shop after Ross shows everyone the paper? [Answer] 1) Joey 2) Rachel 3) Monica 4) Chandler 5) Phoebe [Video Clip] 00:01 00:02 00:03 1:32 1:33 … [Subtitle] 00:03  UNKNAME: Hey. I got some bad news. What? 00:05  UNKNAME: That’s no way to sell newspapers ... (Ellipsis) 01:31  UNKNAME: Your food is abysmal! [Subtitle] 𝑚0  UNKNAME: No. Monica’s restaurant got a horrible review ... 𝑚1  UNKNAME: I didn’t want her to see it, so I ran around and ... 𝑚2  Joey: This is bad. And I’ve had bad reviews. [Memory Information after Reading Streaming Data] 𝑚3  Monica: Oh, my God! Look at all the newspapers. 𝑚4  UNKNAME: They say there’s no such thing as … 𝑚0 𝑚1 𝑚2 𝒎 𝟑 𝑚4
  • 45.
    Visualization of MemoryEntries 45 EMR learns general importance by storing the information corresponding to solving unknown queries in the external memory.
  • 46.
    Conclusion 46 • We proposea novel task of learning to remember important instances from streaming data and show it on the question answering task. Codes available at https://github.com/h19920918/emr • Episodic Memory Reader (EMR) learns general importance by considering relative importance between memory entries without knowing queries in order to maximize the performance of the QA task. • Results show that our models retain the information for answering even with small number of memory entries relative to the length of streams. • We believe that our work can be an essential part toward building real-world conversation agent.
  • 47.