Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)

Latent action Reinforcement learning
in End-to-end Dialogue System
Tiancheng Zhao, Kaige Xie, Maxine Esenazi: Rethinking Action Spaces for Reinforcement Learning in End-to-
end Dialog Agents with Latent Variable Models. NAACL-HLT 2019
2019. 07. 23.
presented by Jeong-Gwan Lee
1

2
Table of contents
¨ Multi-turn goal-oriented Dialog System
• Component of Dialog system
• Type of action space in dialog system
¨ Baseline model
• RNN Encoder-Decoder model
• Word-level Reinforcement Learning
¨ Latent Action Reinforcement Learning
• Supervise pre-training & RL step
• Gaussian Latent Actions
• Categorical Latent Actions(with summation fusion)
• Attention Fusion
• Optimization Approaches (Full ELBO vs. Lite ELBO)
¨ Experiments (MultiWoz)
• Setting
• Results
¨ Summary

3
Multi-turn goal-oriented dialog (MultiWoz)
”I am looking for a place to to stay that has cheap price range it
should be in a type of hotel"
"okay , do you have a specific area you want to stay in ?"
"no , i just need to make sure it s cheap. oh , and i need parking"
"i found 1 cheap hotel for you that include -s parking .
do you like me to book it ?",
"yes , please . 6 people 3 nights starting on thursday ."
i am sorry but i was not able to book that for you for 3
days. is there another day you would like to stay or
perhaps a shorter stay ?",
"how about only 2 nights .",
"booking was successful . reference number is [hotel_reference].
anything else i can do for you ?",
"no , that will be all . goodbye ."
"thank you for using our services."
User side System side
Red : inform, sky-blue : request(or book)

4
Components of Dialog system
NLU DST Policy(Action) NLG
”I am looking for a place to to stay that has cheap price range it should be in a
type of hotel"
[NLU] ”I am looking for a place to to stay that has cheap price range it should
be in a type of hotel"
[DST] [“type” : Hotel, “price_range” : cheap]
[Policy] What the system’s next action?

5
Types of action space in dialog system
NLU DST Policy(RL) NLG
[Policy] [“Hotel parking?”, ”Hotel internet?”, …]
¨ The action space is defined by hand-crafted semantic representations
such as dialog acts and slot values
• Limit : only handle simple domains whose entire action space can be captured by hand-
crafted representations.

6
Types of action space in dialog system
NLU DST Word-level RL
¨ To apply RL to E2E dialog systems, the action space is defined as the
entire vocabulary. (Word-level RL)
• Every response output word is considered to be an action selection step.
• Limit
• direct application of word-level RL leads to degenerate behavior: the response
decoder deviates from human language and generates utterances that are
incomprehensible.
• Suffers from a long horizon(UT), leading to slow and sub optimal convergence.
[Word-level RL] [“parking”, “you”, ”need” ”internet”, …]

7
RNN Encoder-Decoder model
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Yoshua Bengio
"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” EMNLP 2014
Decoder
RNN
<GO>
Encoder
RNN RNN RNN…RNN RNN RNN RNN…
Output sequence
Input sequence

8
Baseline approach(Word-level RL)
¨ E2E response generation can be treated as a conditional language
generation task.
¨ Training with RL usually has 2 steps:
supervised pre-training and policy gradient reinforcement learning.
• The supervised learning step maximizes the log likelihood on the training dialogs.
• RL step uses policy gradients, e.g., the REINFORCE[0] algorithm
[0] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement
learning." Machine learning 8.3-4 (1992): 229-256.
RL:SL=A:B è A policy gradient update, B supervised learning update

9
Baseline model (Supervised Learning)
Encoder
Bi-
RNN
Bi-
RNN
Bi-
RNN
Bi-
RNN
…
Decoder
RNN
<GO>
RNN RNN RNN…
Output sequence
Input sequence
Attention
Belief
State
label
DB
label
Summary
Summary
Linear
Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
modelling." arXiv preprint arXiv:1810.00278 (2018).

10
Baseline model (RL)
Output sequence
Decoder
softmax
RNN
<GO>
softmax
RNN
softmax
RNN
softmax
RNN
…
Categorical
sampling!
Categorical
sampling!

11
Latent Action Reinforcement Learning
¨ Define a latent variable
¨ The conditional distribution is factorized into
(1) given a dialog context , we first sample a latent action from
, where is the dialog encoder network
(2) generate the response by sampling based on via
, where is the response decoder network.

12
Latent Action Reinforcement Learning
¨ Compared to Eq 2,
• Shortens the horizon from TU to T.
• Latent action space is designed to be low-dimensional, much smaller
than V.
• The policy gradient only updates the encoder and the decoder
stays intact.

13
Gaussian Latent Actions
Belief
State
label
DB
labelSummary
Linear
Decoder
RNN
<GO>
RNN RNN RNN…
Output sequence
Gaussian Latent Actions
Linear
Encoder Summary
decoder initial state
To compute policy gradient in Eq 3,
Use reparametrization trick[1] to backprop.
[1] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).

14
Categorical Latent Actions
Decoder
RNN
<GO>
RNN RNN RNN…
Belief
State
label
DB
labelSummary
Linear
Encoder Summary
(K) (K) (K)
(K)(K) (K)
To compute policy gradient in Eq 3,
(M, K)
(M, K)
(M, K)
(M, D)
(D)
Use Gumbel-max trick[2] to backprop.
[2]Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint
arXiv:1611.01144 (2016).
gumbel-max sampling

15
Attention Fusion
Decoder
RNN
<GO>
RNN RNN RNN
…
Output sequence
Belief
State
label
DB
labelSummary
Linear
Encoder Summary
(K) (K) (K)
(K) (K) (K)
Attention
(M, K)
(M, K)
(M, K)
(M, D)
(D)

16
Optimization Approaches
¨ Full ELBO
¨ Lite ELBO
• Full ELBO can suffer from exposure bias at latent space, i.e. the
decoder only sees z sampled from q at training time and never
experiences z sampled from p, which is always used at testing time.
• It sets the posterior network the same as our encoder,
• Add the additional regularization term
that encourages the posterior be similar to certain prior distribution
: a neural network that approximate the posterior distribution
and are achieved by

17
Experiment Settings
¨ Multi-Woz dataset
• 10438 dialogs on 6 different domains.
• This paper focuses on the Dialog-Context-to-text Generation task.
• It assumes that the model has access to the ground-truth belief state
and is asked to generate the next response given user utterance.
Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
modelling." arXiv preprint arXiv:1810.00278 (2018).

18
Multi-turn goal-oriented dialog (MultiWoz)
"usr": [ "am looking for a place to to stay that has [value_pricerange] price
range it should be in a type of hotel",
"no , i just need to make sure it s [value_pricerange] . oh , and i need
parking",
"yes , please . [value_count] people [value_count] nights starting on
[value_day] .",
"how about only [value_count] nights .",
"no , that will be all . goodbye ."
]
"sys": [ "okay , do you have a specific area you want to stay in ?",
"i found [value_count] [value_pricerange] hotel for you that include -
s parking . do you like me to book it ?",
"i am sorry but i was not able to book that for you for [value_day] . is
there another day you would like to stay or perhaps a shorter stay ?",
" booking was successful . reference number is [hotel_reference] .
anything else i can do for you ? ",
" thank you for using our services . "
],
"bs” : belief state label (94 dimension)
“db” : data base label (30 dimension)

20
Language Constrained Reward(LCR) curve
¨ ROC-style curve to visualize the trade-off between high task
success and being faithful human language.
• It records two measures:
(1) Perplexity of a given model on the test data
(2) this model’s average cumulative task reward
• It creates a 2D plots where the x-axis is the maximum PPL allowed,
and the y-axis is the best achievable reward with the PPL budget.
Gaussian is under ”without RL”

21
Summary
¨ End-to-end models that latent actions be expressive enough to capture
response semantics in complex domains, decoupling the discourse-level
decision making process from natural language generation.
¨ A novel training objective(lite ELBO) that outperforms the typical
evidence lower bound
¨ Attention mechanism for integrating discrete latent variables(LiteAttnCat)
in the decoder to better model long responses.

Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)

Similar to Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL) (20)

Recently uploaded

Recently uploaded (20)

Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)