Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

Episodic Memory Reader:
Learning What to Remember
for Question Answering from Streaming Data
Moonsu Han1*, Minki Kang1*, Hyunwoo Jung1,3 and Sung Ju Hwang1,2
KAIST1, Daejeon, South Korea
AITRICS2, Seoul, South Korea
NAVER Clova3, Seongnam, South Korea
1

Motivation
2
When interacting with users, an agent should remember the information of the user.
The agent does not know when and what information and questions will be given.
User
Agent
It was a hard day. I do not like noisy places.
How about a rest at home?
I sometimes read books when I have a break.
……
Today is a holiday.
How about taking a break and
reading a book at home?
(Clova WAVE)

John
I like potatoes.
What is my name?
NEUT: John
Memory Scalability Problem
3
The best way to preserve the information is to store all of it in external memory.
However, it cannot retain and learn all information due to limitations on memory
capacity.
User
POS: Library
NEG: Noisy
POS: Baseball
NEUT: Senior
POS: Potato
…
Read
External Memory
NEUT: John
Agent
(Clova WAVE)
Write

Learning What to Remember from Streaming Data
4
We substitute this problem for a novel question answering (QA) task, where the
machine does not know when the questions will be given from streaming data.
Inspired by this motivation, we define a new problem that can arise in the real-world
and that should be addressed in the future to build a conversation agent.
Data 1
Data 2
External Memory
QA model
Supporting fact 1
Data 1
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Data 1
Data 2Data 1
Data 2
…
Supporting fact T
Supporting fact 3
Data 95
Data 150
Data 154
Supporting fact TSupporting fact T
Data 2
WriteRead
Answer T
Supporting fact T

5
The model can answer unknown queries by storing incoming data until the external
memory is full.
Write
External Memory
QA model
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
Data 3
Data 1
Data 2
Data 4
Supporting fact 1
…
Data 1
Data 1
Data 1
Data 2Read Data 3
Data 2
Data 2
Supporting fact 1Data 4
Data 3
Data 4
Supporting fact 1
Supporting fact 1
Query 1
Answer 1
Supporting fact 1

6
When the memory is full, the model should determine which memory entry or
incoming data to discard.
External Memory
Supporting fact T
QA model
Supporting fact 1
Data 1
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
In this situation, it can easily decide to replace the entry since almost all memory
entries are useless.
Write
Data 2
Data 3
Data 4
Supporting fact T
Supporting fact T
Supporting fact T
Supporting fact T

7
In the case of when the memory is full of supporting facts, it is difficult to decide which
memory entry to delete.
External Memory
Supporting fact T
QA model
Supporting fact 1
Supporting fact 2
Supporting fact 7
Supporting fact 8
Supporting fact 9
Which
memory
entry is no
longer
needed?
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Therefore, there is a need for a model that learns general importance of an instance
of data and which data is important at what time.

Problem Definition
8
Given a data stream 𝑋 = 𝑥 1 , … , 𝑥 𝑇 , a model should learn a function 𝑭: 𝑿 → 𝑴
that maps it to the set of memory entries 𝑀 = 𝑚 1 , … , 𝑚 𝑁 where 𝑇 ≫ 𝑁.
How can it learn such a function that maximizes the performance on an unseen
future task without knowing what problems it will be given at the time?
External Memory 𝑀
m(1)
m(2)
m(N)
𝑥(3)
𝑥(1)
𝑥(2)
Streaming Data 𝑋
Query 𝑄
…
𝑥(𝑇)
𝑥 4
…
m(3)
𝐹: 𝑋 → 𝑀
𝑥(5)
m(1)
: 𝑥(1)
m(2)
: 𝑥(8)
m(N)
: 𝑥(𝑇−3)
…
m(3)
: 𝑥(𝑇−9)
Result
&
Performance

Difference from Existing Methods
9
Our question answering task requires a model that can sequentially handle streaming
data without knowing the query.
It is difficult to solve this problem with existing QA methods due to their lack of
scalability.
QA model
𝑥(1)
𝑥(2)
Streaming Data 𝑋
…
Query 𝑄
…
𝑥(10)
𝑥(11)
𝑥(𝑇)
Answer
…
𝑀𝑜𝑑𝑒𝑙(𝑥 1
)
𝑀𝑜𝑑𝑒𝑙(𝑥 2
)
𝑀𝑜𝑑𝑒𝑙(𝑥 𝑇
)
𝑀𝑜𝑑𝑒𝑙(𝑄)

Episodic Memory Reader (EMR)
10
To solve a novel QA task, we propose Episodic Memory Reader (EMR) that sequentially
reads the streaming data and stores the information into the external memory.
It replaces memories that are less important for answering unseen questions when
the external memory is full.
𝑥(𝑇−8)
𝑡 𝑇−8
𝑚1
(𝑇−10)
𝑚2
(𝑇−10)
𝑚3
(𝑇−10)
𝑚4
(𝑇−10)
𝑥(𝑇−9)
𝑡 𝑇−9
EMR
Read Write
External
Memory
𝑚1
(𝑇−9)
𝑚2
(𝑇−9)
𝑚3
(𝑇−9)
𝑥(𝑇−9)
…
𝑥(𝑇)
𝑡 𝑇
EMR
… 𝑡 𝑇+1
Query
QA
model
“Answer”EMR

Learning What to Remember Using RL
11
We use Reinforcement Learning to make which information is important be learned.
We intend that if the agent do the good action, the QA model will output positive
reward then that action is reinforced.
Environment
State
Supporting fact 1
Data 1
Data 2
Streaming Data
Query 1
…
Supporting fact T
Query T
…
Pre-trained
QA Model
Current Input
External
Memory …
Action
Streaming
Observe
Replacement
Evaluate
Reward
Episodic
Memory
Reader
(Agent)

The Components of Proposed Model
12
EMR consists of Data encoder, Memory encoder, and Value network to output the
importance between memories.
It learns how to retain important information in order to maximize its QA accuracy at
a future timepoint.
…
Data Encoder
Memory
Encoder
External
Memory
QA model
Reward (e.g. F1score, Acc.)
𝑄𝑢𝑒𝑟𝑦
“Answer”
Episodic Memory Reader (EMR)
Read
2
…
Multi-layer
Perceptron
Policy Network (Actor)
Value Network (Critic)
…
Multi-layer
Perceptron
…Policy (π)
π 𝑚 𝑠)
…
GRU
Cell
𝑉𝑎𝑙𝑢𝑒 (𝑉)
Write

Data Encoder
13
Data encoder encodes the input data to the memory vector representation.
The model for the encoder is varied based on the type of the input.
…
Data Encoder
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
Text Input
…𝑓𝑒𝑛𝑠 𝑎𝑙𝑠𝑜 𝑘𝑛𝑜𝑤𝑛 𝑠𝑒𝑣𝑒𝑟𝑎𝑙
Embedding Layer
biGRU
…
Processing of Data Encoder
𝑒(𝑡)
= 𝑚6
(𝑡)

Data Encoder
14
Data encoder encodes the input data to the memory vector representation.
The model for the encoder is varied based on the type of the input.
…
Data Encoder
External Memory
Image Input
CNN
Processing of Data Encoder
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑒(𝑡)
= 𝑚6
(𝑡)

Memory Encoder
15
Memory Encoder computes the replacement probability by considering the
importance of memory entries.
We devise 3 different types of memory encoder.
1. EMR-Independent
2. EMR-biGRU
3. EMR-Transformer
External
Memory
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
Memory
Encoder
Policy
Network
External
Memory
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
Policy (π)
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
…
Replace

Memory Encoder (EMR-Independent)
16
Similar to [Gülçehre18], EMR-Independent captures the relative importance of each
memory entry independently to the new data instance.
[Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018
𝛼1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
𝛼2
(𝑡)
𝛼3
(𝑡)
𝛼4
(𝑡)
𝛼5
(𝑡)
𝛾1
(𝑡)
𝛾2
(𝑡)
𝛾3
(𝑡)
𝛾4
(𝑡)
𝛾5
(𝑡)
𝑣1
(𝑡−1)
𝑣2
(𝑡−1)
𝑣3
(𝑡−1)
𝑣4
(𝑡−1)
𝑣5
(𝑡−1)
𝑔1
(𝑡)
𝑔2
(𝑡)
𝑔3
(𝑡)
𝑔4
(𝑡)
𝑔5
(𝑡)
Memory Encoder
(Independent)
Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
Major drawback is that the evaluation of each memory depends only on the input.

Memory Encoder (EMR-biGRU)
17
EMR-biGRU computes the importance of each memory entry considering relative
relationships between memory entries using a bidirectional GRU.
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
External Memory
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
Memory Encoder (biGRU) Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
𝑚6
(𝑡)
ℎ6
(𝑡)

Memory Encoder (EMR-biGRU)
18
EMR-biGRU computes the importance of each memory entry considering relative
relationships between memory entries using bidirectional GRU.
External Memory Memory Encoder (biGRU) Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
However, the importance of each memory entry can be learned in relation to its
neighbor rather than independently.
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)

Memory Encoder (EMR-Transformer)
19
EMR-Transformer computes the relative importance of each memory entry using a
self-attention mechanism from [Vaswani17].
[Vaswani17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need. NIPS 2017
External Memory Memory Encoder
(Transformer)
Policy Network
Multi-layer
Perceptron
𝜋 𝑖 𝑀 𝑡
, 𝑒 𝑡
; 𝜃
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)

Value Network
20
Two types of reinforcement learning are used – A3C [Mnih16] or REINFORCE [Williams92].
We adopt Deep Sets from [Zaheer17] to make set representation ℎ 𝑡
.
[Zaheer17] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. R. Salakhutdinov, A. J. Smola, Deep Sets. NIPS 2017
[Williams92] R. J. Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. ML 1992
[Mnih16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
External Memory Memory Encoder
(Transformer)
Multi-layer
Perceptron
GRU Cellℎ(𝑡)
𝑉(𝑡)
ℎ(𝑡−1)
𝑚1
(𝑡)
𝑚1
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
𝑚6
(𝑡)
𝑚2
(𝑡)
𝑚3
(𝑡)
𝑚4
(𝑡)
𝑚5
(𝑡)
ℎ1
(𝑡)
ℎ2
(𝑡)
ℎ3
(𝑡)
ℎ4
(𝑡)
ℎ5
(𝑡)
𝑚6
(𝑡)
ℎ6
(𝑡)
Value Network

Example of EMR in Action
21
For better understanding, we make the summary of the working process of our model
using TriviaQA example.
We assume the EMR-Transformer with the external memory which can hold 100 words.
Q: Which US state lends its name to a baked pudding,
made with ice cream, sponge and meringue?
Meringue is type of dessert often associated with Italian
Swiss and French cuisine made from whipped egg whites
and sugar
Sentence is streamed at a constant rate from Wikipedia
Document.
→ 20 words per each time

22
The streamed sentence is encoded and stored to the external memory.
It writes the data representation to the memory until it becomes full.
Meringue is type of dessert often associated with Italian
Swiss and French cuisine made from whipped egg whites
and sugar
𝑚1
1
Data Encoder
Embedding Layer
…𝑀𝑒𝑟𝑖𝑛𝑔𝑢𝑒 𝑖𝑠 𝑡𝑦𝑝𝑒 𝑠𝑢𝑔𝑎𝑟
𝑒 𝑡
biGRU
…
Wikipedia
Document
External Memory
Question

Example of EMR in action
23
If the memory is full, it needs to decide which memory entry to be deleted.
For this end, Episodic Memory Reader (EMR) ‘reads’ memory entries and current
input data using the memory encoder.
Data Encoder
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
𝑚6
6
External Memory
Memory
Encoder
Read
Wikipedia
Document
Question

24
Then, the memory encoder outputs the policy – action probability for deleting the
least important entry.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
𝑚6
6
External Memory
Memory
Encoder
Read
𝑚3
6

25
The entry with the highest probability is deleted and other entries are pushed.
And new data instance is “written” on the last entry of the external memory.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
7
𝑚2
7
𝑚4
7
𝑚5
7
External Memory
Memory
Encoder
Read
𝑚6
6

26
The entry with the highest probability is deleted and other entries are pushed.
And new data instance is “written” on the last entry of the external memory.
Policy (π)
π 𝑚 𝑠)
…
𝑚1
7
𝑚2
7
𝑚3
7
𝑚4
7
𝑚6
7
External Memory
Memory
Encoder
Read
Write
𝑚5
7

27
When it encounters the question, the QA model outputs the answer of given question
using the sentences in the external memory.
Wikipedia
Document
Question
𝑚1
𝑇
𝑚2
𝑇
𝑚3
𝑇
𝑚4
𝑇
𝑚5
𝑇
External Memory
Meringue is type of dessert
often associated with French
swiss and Italian cuisine made
from whipped egg whites and
sugar or aquafaba and sugar and
into egg whites used for
decoration on pie or spread on
sheet or baked Alaska base and
baked swiss meringue is
hydrates from refined sugar
…
Sentences in External Memory
QA model
(BERT)
Q: Which US state lends its name to a baked pudding, made
with ice cream, sponge and meringue?
A: Alaska

Training the Episodic Memory Reader
28
For training, the model is trained after each “data stream”.
Also, deleting entry is selected stochastically based on the policy.
Wikipedia
Document
Question External Memory
Episodic
Memory
Reader
Streaming Data
𝑥1
Pre-trained
QA model
(BERT)
History
Read Write
Policy (π)
π 𝑚 𝑠)
…
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…

29
After the memory update, the performance of the task is evaluated.
To evaluate the policy, we provide the future query at every time step only during
training time.
In TriviaQA, it is the F1 Score and it is used as the reward for reinforcement learning.
Wikipedia
Document
Question
Streaming Data
Policy (π)
π 𝑚 𝑠)
…
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
History
F1 Score
(Reward)

30
Then, reward and action probabilities are stored for further training steps.
This process is repeated until the data stream ends.
Wikipedia
Document
Question
Streaming Data
Policy (π)
π 𝑚 𝑠)
…
F1 Score
(Reward)
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
History
Store

31
At the end of the stream, QA Loss is computed from the QA model and RL Loss is
computed from the reinforcement learning algorithm using stored history.
Then, the QA model and Episodic Memory Reader are jointly trained.
Wikipedia
Document
Question
Streaming Data
𝑥12
𝑥10
…
𝑥 𝑇
𝑥11
𝑞
…
𝑥1 Episodic
Memory
Reader
External Memory
Pre-trained
QA model
(BERT)
HistoryQA Loss
Reinforcement
Learning
RL Loss

Reinforcement Learning Loss
32
RL loss is based on the basic Actor-Critic learning loss. When computing RL loss, we
also use the Entropy Loss to explore various possibilities.
History
Reinforcement Learning
𝑅𝑖 = 0.99 ∗ 𝑅𝑖−1 + 𝑟𝑖
𝐿 𝑣𝑎𝑙𝑢𝑒 =
𝑖
1
2
∗ 𝑅𝑖 − 𝑉𝑖
2
𝐿 𝑝𝑜𝑙𝑖𝑐𝑦 = 𝑖 −(𝑟𝑖 + 0.99 ∗ 𝑉𝑖+1 − 𝑉𝑖) ∗ 𝑙𝑜𝑔 𝑝𝑖 − 0.01 ∗ 𝑒𝑖
RL Loss = 𝐿 𝑣𝑎𝑙𝑢𝑒 + 𝐿 𝑝𝑜𝑙𝑖𝑐𝑦
i-th history
Action Probability 𝑝𝑖
Value 𝑉𝑖
Reward 𝑟𝑖
Entropy 𝑒𝑖 =
𝑗 𝑝𝑖𝑗 𝑙𝑜𝑔𝑝𝑖𝑗

Experiment - Baselines
33
EMR that makes the policy in an independent method based on Dynamic Least
Recently Used from [Gülçehre18].
• FIFO (First-In First-Out) • Uniform • LIFO (Last-In First-Out)
• EMR-Independent
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
𝑚1
6
𝑚2
6
𝑚3
6
𝑚4
6
𝑚5
6
External Memory
Policy (π)
π 𝑚 𝑠)
𝑚5
6
Policy (π)
π 𝑚 𝑠)
𝑚5
6
Policy (π)
π 𝑚 𝑠)
We experiment our EMR-biGRU and EMR-Transformer against several baselines.
[Gülçehre18] Ç. Gülçehre, S. Chandar, K. Cho, Y. Bengio, Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. NC 2018

Dataset (bAbI Dataset)
34
We evaluate our models and baselines on three question answering datasets.
[Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015
• bAbI [Weston15]: A synthetic dataset for episodic question answering,
consisting of 20 tasks with small amount of vocabulary.
Original Task 2 Noisy Task 2
Index Context
Mary journeyed to the bathroom
Sandra went to the garden
Sandra put down the milk there
1
2
6
…
…
Where is the milk? Garden [2, 6]
Daniel went to the garden
Daniel dropped the football
8
17
…
…
Where is the football? Bedroom [12, 17]
Index Context
Sandra moved to the kitchen
Wolves are afraid of cats
Mary is green
1
2
6
…
…
Where is the milk? Garden [1, 4]
Mice are afraid of wolves
Mary journeyed to the kitchen
38
42
…
…
Where is the apple? Kitchen [34, 42]
→ It can be solved by remembering a person or an object.

Dataset (TriviaQA Dataset)
35
[Joshi17] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL 2017
[Context]
001 World War I (WWI or WW1), also known as the First World War, or the Great War, was a global war originating in …
002 More than 70 million military personnel, including 60 million Europeans, were mobilised in one of the largest wars in …
550 Some war memorials date the end of the war as being when the Versailles Treaty was signed in 1919, …
770 Britain, rationing was finally imposed in early 1918, limited to meat, sugar, and fats (butter and margarine), but not bread.
……
[Question]
Where was the peace treaty signed that brought World War I to an end?
[Answer]
Versailles castle
• TriviaQA [Joshi17]: a realistic text-based question answering dataset,
including 95K question-answer pairs from 662K documents.
→ It requires a model with high-level reasoning and capability for reading
large amount of sentences in document.

Dataset (TVQA Dataset)
36
[Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018
[Video Clip]
[Question] What is Kutner writing on when talking to Lawrence?
[Answer 1] Kutner is writing on a clipboard.
[Answer 2] Kutner is writing on a laptop.
[Answer 3] Kutner is writing on a notepad.
[Answer 4] Kutner is writing on an index card.
[Answer 5] Kutner is writing on his hand.
… …
• TVQA [Lei18]: a localized, compositional video question answering dataset
containing 153K question-answer pairs from 22K clips in 6 TV series.
→ It requires a model that is able to understand multi-modal information.

Sentence 3
Sentence N-14
Experiment on bAbI Dataset
37
We combine our model with MemN2N [Weston15] on bAbI dataset.
[Weston15] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus: End-To-End Memory Networks. NIPS 2015
In this experiment, we set varying memory size to evaluate the efficiency of our model.
EMR
QA
model
(MemN2N)
Sentence 1
Sentence 2
Query M
External Memory
Sentence 1
Sentence 2
Write
Sentence N-13
Sentence N
Streaming Data
Read
………
Sentence 2
Sentence 2
“Answer”
Query 1

Result (Accuracy)
38
Both of our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Our methods are able to retain the supporting facts even with small number of
memory entries.
Original Noisy

Result (Solvable)
39
Also, we report how many supporting facts the models retain in the external memory.
Two EMR variants significantly outperform EMR-Independent as well as rule-based
memory scheduling policies.
Original Noisy

Chunk 3 (20 words)
Chunk 2 (20 words)
Experiment on TriviaQA Dataset
40
We combine our model with BERT [Devlin18] on TriviaQA dataset.
[Devlin18] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019
We extract the indices of span prediction since TriviaQA does not provide it.
We embed 20 words into each memory entry and hold 400 words at maximum.
Wikipedia
Document
Question
Chunk 1 (20 words)
Query
Chunk 16 (20 words)
Chunk N (20 words)
Streaming Data
……
EMR
QA
model
(BERT)
External MemoryWrite
Read
“Answer”
Chunk 3
Chunk N
Chunk 7
Chunk 17
Chunk N
…

Result (Accuracy)
41
Both our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Model ExactMatch F1score
FIFO 24.53 27.22
Uniform 28.30 34.39
LIFO 46.23 50.10
EMR-Independent 38.05 41.15
EMR-biGRU 52.20 57.57
EMR-Transformer 48.43 53.81
LIFO performs quite well unlike the other rule-based scheduling policies because
most answers are spanned in earlier part of the documents.
Result of TriviaQA Indices of answers following doc. length

Experiment on TVQA Dataset
42
Multi-stream model from [Lei18] is used as QA model with ours.
[Lei18] J. Lei, L. Yu, M. Bansal, T. L. Berg, TVQA: Localized, Compositional Video Question Answering. EMNLP 2018
A subtitle is attached to the frame which has the start of the subtitle.
One frame and the corresponding subtitle are jointly embedded into each memory
entry.
Frame | No Sub. 3
Frame | Subtitle 2
Video Clip
Question
Frame | Subtitle 1
Query
Frame | Subtitle 4
Frame | SubtitleN
Streaming Data
…
EMR
QA model
(Multi-stream)
External MemoryWrite
Read
“Answer”
Frame | Subtitle 2
Frame | SubtitleN
Frame | Subtitle 5
Frame | Subtitle9
Frame | No Sub. 5
…

Result (Accuracy)
43
Both our models (EMR-biGRU and EMR-Transformer) outperform the baselines.
Our methods are able to retain the frames and subtitles even with small number of
memory entries.

Example of TVQA Result
44
[Question] < 00:55.00 ~ 01:06.33 >
Who enters the coffee shop after Ross shows everyone the paper?
[Answer]
1) Joey 2) Rachel 3) Monica 4) Chandler 5) Phoebe
[Video Clip]
00:01 00:02 00:03 1:32 1:33
…
[Subtitle]
00:03  UNKNAME: Hey. I got some bad news. What?
00:05  UNKNAME: That’s no way to sell newspapers ...
(Ellipsis)
01:31  UNKNAME: Your food is abysmal!
[Subtitle]
𝑚0  UNKNAME: No. Monica’s restaurant got a horrible review ...
𝑚1  UNKNAME: I didn’t want her to see it, so I ran around and ...
𝑚2  Joey: This is bad. And I’ve had bad reviews.
[Memory Information after Reading Streaming Data]
𝑚3  Monica: Oh, my God! Look at all the newspapers.
𝑚4  UNKNAME: They say there’s no such thing as …
𝑚0 𝑚1 𝑚2 𝒎 𝟑 𝑚4

Visualization of Memory Entries
45
EMR learns general importance by storing the information corresponding to solving
unknown queries in the external memory.

Conclusion
46
• We propose a novel task of learning to remember important instances from
streaming data and show it on the question answering task.
Codes available at https://github.com/h19920918/emr
• Episodic Memory Reader (EMR) learns general importance by considering
relative importance between memory entries without knowing queries in order
to maximize the performance of the QA task.
• Results show that our models retain the information for answering even with
small number of memory entries relative to the length of streams.
• We believe that our work can be an essential part toward building real-world
conversation agent.

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

Recommended

Recommended

More Related Content

Similar to Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

Similar to Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data (20)

More from LGCNSairesearch

More from LGCNSairesearch (7)

Recently uploaded

Recently uploaded (20)

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data