SlideShare a Scribd company logo
Latent action Reinforcement learning
in End-to-end Dialogue System
Tiancheng Zhao, Kaige Xie, Maxine Esenazi: Rethinking Action Spaces for Reinforcement Learning in End-to-
end Dialog Agents with Latent Variable Models. NAACL-HLT 2019
2019. 07. 23.
presented by Jeong-Gwan Lee
1
2
Table of contents
¨ Multi-turn goal-oriented Dialog System
• Component of Dialog system
• Type of action space in dialog system
¨ Baseline model
• RNN Encoder-Decoder model
• Word-level Reinforcement Learning
¨ Latent Action Reinforcement Learning
• Supervise pre-training & RL step
• Gaussian Latent Actions
• Categorical Latent Actions(with summation fusion)
• Attention Fusion
• Optimization Approaches (Full ELBO vs. Lite ELBO)
¨ Experiments (MultiWoz)
• Setting
• Results
¨ Summary
3
Multi-turn goal-oriented dialog (MultiWoz)
”I am looking for a place to to stay that has cheap price range it
should be in a type of hotel"
"okay , do you have a specific area you want to stay in ?"
"no , i just need to make sure it s cheap. oh , and i need parking"
"i found 1 cheap hotel for you that include -s parking .
do you like me to book it ?",
"yes , please . 6 people 3 nights starting on thursday ."
i am sorry but i was not able to book that for you for 3
days. is there another day you would like to stay or
perhaps a shorter stay ?",
"how about only 2 nights .",
"booking was successful . reference number is [hotel_reference].
anything else i can do for you ?",
"no , that will be all . goodbye ."
"thank you for using our services."
User side System side
Red : inform, sky-blue : request(or book)
4
Components of Dialog system
NLU DST Policy(Action) NLG
”I am looking for a place to to stay that has cheap price range it should be in a
type of hotel"
[NLU] ”I am looking for a place to to stay that has cheap price range it should
be in a type of hotel"
[DST] [“type” : Hotel, “price_range” : cheap]
[Policy] What the system’s next action?
5
Types of action space in dialog system
NLU DST Policy(RL) NLG
[DST] [“type” : Hotel, “price_range” : cheap]
[Policy] [“Hotel parking?”, ”Hotel internet?”, …]
¨ The action space is defined by hand-crafted semantic representations
such as dialog acts and slot values
• Limit : only handle simple domains whose entire action space can be captured by hand-
crafted representations.
6
Types of action space in dialog system
NLU DST Word-level RL
[DST] [“type” : Hotel, “price_range” : cheap]
¨ To apply RL to E2E dialog systems, the action space is defined as the
entire vocabulary. (Word-level RL)
• Every response output word is considered to be an action selection step.
• Limit
• direct application of word-level RL leads to degenerate behavior: the response
decoder deviates from human language and generates utterances that are
incomprehensible.
• Suffers from a long horizon(UT), leading to slow and sub optimal convergence.
[Word-level RL] [“parking”, “you”, ”need” ”internet”, …]
7
RNN Encoder-Decoder model
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Yoshua Bengio
"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” EMNLP 2014
Decoder
RNN
<GO>
Encoder
RNN RNN RNN…RNN RNN RNN RNN…
Output sequence
Input sequence
8
Baseline approach(Word-level RL)
¨ E2E response generation can be treated as a conditional language
generation task.
¨ Training with RL usually has 2 steps:
supervised pre-training and policy gradient reinforcement learning.
• The supervised learning step maximizes the log likelihood on the training dialogs.
• RL step uses policy gradients, e.g., the REINFORCE[0] algorithm
[0] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement
learning." Machine learning 8.3-4 (1992): 229-256.
RL:SL=A:B è A policy gradient update, B supervised learning update
9
Baseline model (Supervised Learning)
Encoder
Bi-
RNN
Bi-
RNN
Bi-
RNN
Bi-
RNN
…
Decoder
RNN
<GO>
RNN RNN RNN…
Output sequence
Input sequence
Attention
Belief
State
label
DB
label
Summary
Summary
Linear
Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
modelling." arXiv preprint arXiv:1810.00278 (2018).
10
Baseline model (RL)
Output sequence
Decoder
softmax
RNN
<GO>
softmax
RNN
softmax
RNN
softmax
RNN
…
Categorical
sampling!
Categorical
sampling!
11
Latent Action Reinforcement Learning
¨ Define a latent variable
¨ The conditional distribution is factorized into
(1) given a dialog context , we first sample a latent action from
, where is the dialog encoder network
(2) generate the response by sampling based on via
, where is the response decoder network.
12
Latent Action Reinforcement Learning
¨ Compared to Eq 2,
• Shortens the horizon from TU to T.
• Latent action space is designed to be low-dimensional, much smaller
than V.
• The policy gradient only updates the encoder and the decoder
stays intact.
13
Gaussian Latent Actions
Belief
State
label
DB
labelSummary
Linear
Decoder
RNN
<GO>
RNN RNN RNN…
Output sequence
Gaussian Latent Actions
Linear
Encoder Summary
decoder initial state
To compute policy gradient in Eq 3,
Use reparametrization trick[1] to backprop.
[1] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
14
Categorical Latent Actions
Decoder
RNN
<GO>
RNN RNN RNN…
Belief
State
label
DB
labelSummary
Linear
Encoder Summary
(K) (K) (K)
(K)(K) (K)
Categorical Latent Actions
To compute policy gradient in Eq 3,
(M, K)
(M, K)
(M, K)
(M, D)
(D)
Use Gumbel-max trick[2] to backprop.
[2]Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint
arXiv:1611.01144 (2016).
gumbel-max sampling
15
Attention Fusion
Decoder
RNN
<GO>
RNN RNN RNN
…
Output sequence
Belief
State
label
DB
labelSummary
Linear
Encoder Summary
(K) (K) (K)
(K) (K) (K)
Categorical Latent Actions
Attention
(M, K)
(M, K)
(M, K)
(M, D)
(D)
16
Optimization Approaches
¨ Full ELBO
¨ Lite ELBO
• Full ELBO can suffer from exposure bias at latent space, i.e. the
decoder only sees z sampled from q at training time and never
experiences z sampled from p, which is always used at testing time.
• It sets the posterior network the same as our encoder,
• Add the additional regularization term
that encourages the posterior be similar to certain prior distribution
: a neural network that approximate the posterior distribution
and are achieved by
17
Experiment Settings
¨ Multi-Woz dataset
• 10438 dialogs on 6 different domains.
• This paper focuses on the Dialog-Context-to-text Generation task.
• It assumes that the model has access to the ground-truth belief state
and is asked to generate the next response given user utterance.
Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
modelling." arXiv preprint arXiv:1810.00278 (2018).
18
Multi-turn goal-oriented dialog (MultiWoz)
"usr": [ "am looking for a place to to stay that has [value_pricerange] price
range it should be in a type of hotel",
"no , i just need to make sure it s [value_pricerange] . oh , and i need
parking",
"yes , please . [value_count] people [value_count] nights starting on
[value_day] .",
"how about only [value_count] nights .",
"no , that will be all . goodbye ."
]
"sys": [ "okay , do you have a specific area you want to stay in ?",
"i found [value_count] [value_pricerange] hotel for you that include -
s parking . do you like me to book it ?",
"i am sorry but i was not able to book that for you for [value_day] . is
there another day you would like to stay or perhaps a shorter stay ?",
" booking was successful . reference number is [hotel_reference] .
anything else i can do for you ? ",
" thank you for using our services . "
],
"bs” : belief state label (94 dimension)
“db” : data base label (30 dimension)
19
Results
20
Language Constrained Reward(LCR) curve
¨ ROC-style curve to visualize the trade-off between high task
success and being faithful human language.
• It records two measures:
(1) Perplexity of a given model on the test data
(2) this model’s average cumulative task reward
• It creates a 2D plots where the x-axis is the maximum PPL allowed,
and the y-axis is the best achievable reward with the PPL budget.
Gaussian is under ”without RL”
21
Summary
¨ End-to-end models that latent actions be expressive enough to capture
response semantics in complex domains, decoupling the discourse-level
decision making process from natural language generation.
¨ A novel training objective(lite ELBO) that outperforms the typical
evidence lower bound
¨ Attention mechanism for integrating discrete latent variables(LiteAttnCat)
in the decoder to better model long responses.

More Related Content

What's hot

4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION 4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
sipij
 
Image compression .
Image compression .Image compression .
Image compression .
Payal Vishwakarma
 
Image compression
Image compression Image compression
Image compression
GARIMA SHAKYA
 
Image compression
Image compressionImage compression
Image compression
Bassam Kanber
 
Performance analysis of image compression using fuzzy logic algorithm
Performance analysis of image compression using fuzzy logic algorithmPerformance analysis of image compression using fuzzy logic algorithm
Performance analysis of image compression using fuzzy logic algorithm
sipij
 
Image compression
Image compressionImage compression
Image compression
Huda Seyam
 
Memory Based Hardware Efficient Implementation of FIR Filters
Memory Based Hardware Efficient Implementation of FIR FiltersMemory Based Hardware Efficient Implementation of FIR Filters
Memory Based Hardware Efficient Implementation of FIR Filters
Dr.SHANTHI K.G
 

What's hot (8)

4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION 4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
 
Image compression .
Image compression .Image compression .
Image compression .
 
Image compression
Image compression Image compression
Image compression
 
Presentation on Image Compression
Presentation on Image Compression Presentation on Image Compression
Presentation on Image Compression
 
Image compression
Image compressionImage compression
Image compression
 
Performance analysis of image compression using fuzzy logic algorithm
Performance analysis of image compression using fuzzy logic algorithmPerformance analysis of image compression using fuzzy logic algorithm
Performance analysis of image compression using fuzzy logic algorithm
 
Image compression
Image compressionImage compression
Image compression
 
Memory Based Hardware Efficient Implementation of FIR Filters
Memory Based Hardware Efficient Implementation of FIR FiltersMemory Based Hardware Efficient Implementation of FIR Filters
Memory Based Hardware Efficient Implementation of FIR Filters
 

Similar to Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)

Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
Tldr
TldrTldr
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...
Jeong-Gwan Lee
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
NEERAJ BAGHEL
 
OpenAI Retro Contest
OpenAI Retro ContestOpenAI Retro Contest
OpenAI Retro Contest
KIYONARI HARIGAE
 
Spreadsheet Modeling & Decision Analysis
Spreadsheet Modeling & Decision AnalysisSpreadsheet Modeling & Decision Analysis
Spreadsheet Modeling & Decision Analysis
SSA KPI
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Association for Computational Linguistics
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
dsfajkh
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Vitaly Baum
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
BigML, Inc
 
SOLID principles
SOLID principlesSOLID principles
SOLID principles
Dmitry Kandalov
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding
捷恩 蔡
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
Universitat Politècnica de Catalunya
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
JAEMINJEONG5
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Rankingbutest
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
MohammadAzreeYahaya
 

Similar to Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL) (20)

Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Tldr
TldrTldr
Tldr
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...Reward constrained interactive recommendation with natural language feedback ...
Reward constrained interactive recommendation with natural language feedback ...
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
OpenAI Retro Contest
OpenAI Retro ContestOpenAI Retro Contest
OpenAI Retro Contest
 
Spreadsheet Modeling & Decision Analysis
Spreadsheet Modeling & Decision AnalysisSpreadsheet Modeling & Decision Analysis
Spreadsheet Modeling & Decision Analysis
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
SOLID principles
SOLID principlesSOLID principles
SOLID principles
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
Computer Engineer Master Project
Computer Engineer Master ProjectComputer Engineer Master Project
Computer Engineer Master Project
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Ranking
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
 

Recently uploaded

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 

Recently uploaded (20)

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 

Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models(LaRL)

  • 1. Latent action Reinforcement learning in End-to-end Dialogue System Tiancheng Zhao, Kaige Xie, Maxine Esenazi: Rethinking Action Spaces for Reinforcement Learning in End-to- end Dialog Agents with Latent Variable Models. NAACL-HLT 2019 2019. 07. 23. presented by Jeong-Gwan Lee 1
  • 2. 2 Table of contents ¨ Multi-turn goal-oriented Dialog System • Component of Dialog system • Type of action space in dialog system ¨ Baseline model • RNN Encoder-Decoder model • Word-level Reinforcement Learning ¨ Latent Action Reinforcement Learning • Supervise pre-training & RL step • Gaussian Latent Actions • Categorical Latent Actions(with summation fusion) • Attention Fusion • Optimization Approaches (Full ELBO vs. Lite ELBO) ¨ Experiments (MultiWoz) • Setting • Results ¨ Summary
  • 3. 3 Multi-turn goal-oriented dialog (MultiWoz) ”I am looking for a place to to stay that has cheap price range it should be in a type of hotel" "okay , do you have a specific area you want to stay in ?" "no , i just need to make sure it s cheap. oh , and i need parking" "i found 1 cheap hotel for you that include -s parking . do you like me to book it ?", "yes , please . 6 people 3 nights starting on thursday ." i am sorry but i was not able to book that for you for 3 days. is there another day you would like to stay or perhaps a shorter stay ?", "how about only 2 nights .", "booking was successful . reference number is [hotel_reference]. anything else i can do for you ?", "no , that will be all . goodbye ." "thank you for using our services." User side System side Red : inform, sky-blue : request(or book)
  • 4. 4 Components of Dialog system NLU DST Policy(Action) NLG ”I am looking for a place to to stay that has cheap price range it should be in a type of hotel" [NLU] ”I am looking for a place to to stay that has cheap price range it should be in a type of hotel" [DST] [“type” : Hotel, “price_range” : cheap] [Policy] What the system’s next action?
  • 5. 5 Types of action space in dialog system NLU DST Policy(RL) NLG [DST] [“type” : Hotel, “price_range” : cheap] [Policy] [“Hotel parking?”, ”Hotel internet?”, …] ¨ The action space is defined by hand-crafted semantic representations such as dialog acts and slot values • Limit : only handle simple domains whose entire action space can be captured by hand- crafted representations.
  • 6. 6 Types of action space in dialog system NLU DST Word-level RL [DST] [“type” : Hotel, “price_range” : cheap] ¨ To apply RL to E2E dialog systems, the action space is defined as the entire vocabulary. (Word-level RL) • Every response output word is considered to be an action selection step. • Limit • direct application of word-level RL leads to degenerate behavior: the response decoder deviates from human language and generates utterances that are incomprehensible. • Suffers from a long horizon(UT), leading to slow and sub optimal convergence. [Word-level RL] [“parking”, “you”, ”need” ”internet”, …]
  • 7. 7 RNN Encoder-Decoder model Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Yoshua Bengio "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” EMNLP 2014 Decoder RNN <GO> Encoder RNN RNN RNN…RNN RNN RNN RNN… Output sequence Input sequence
  • 8. 8 Baseline approach(Word-level RL) ¨ E2E response generation can be treated as a conditional language generation task. ¨ Training with RL usually has 2 steps: supervised pre-training and policy gradient reinforcement learning. • The supervised learning step maximizes the log likelihood on the training dialogs. • RL step uses policy gradients, e.g., the REINFORCE[0] algorithm [0] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256. RL:SL=A:B è A policy gradient update, B supervised learning update
  • 9. 9 Baseline model (Supervised Learning) Encoder Bi- RNN Bi- RNN Bi- RNN Bi- RNN … Decoder RNN <GO> RNN RNN RNN… Output sequence Input sequence Attention Belief State label DB label Summary Summary Linear Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling." arXiv preprint arXiv:1810.00278 (2018).
  • 10. 10 Baseline model (RL) Output sequence Decoder softmax RNN <GO> softmax RNN softmax RNN softmax RNN … Categorical sampling! Categorical sampling!
  • 11. 11 Latent Action Reinforcement Learning ¨ Define a latent variable ¨ The conditional distribution is factorized into (1) given a dialog context , we first sample a latent action from , where is the dialog encoder network (2) generate the response by sampling based on via , where is the response decoder network.
  • 12. 12 Latent Action Reinforcement Learning ¨ Compared to Eq 2, • Shortens the horizon from TU to T. • Latent action space is designed to be low-dimensional, much smaller than V. • The policy gradient only updates the encoder and the decoder stays intact.
  • 13. 13 Gaussian Latent Actions Belief State label DB labelSummary Linear Decoder RNN <GO> RNN RNN RNN… Output sequence Gaussian Latent Actions Linear Encoder Summary decoder initial state To compute policy gradient in Eq 3, Use reparametrization trick[1] to backprop. [1] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
  • 14. 14 Categorical Latent Actions Decoder RNN <GO> RNN RNN RNN… Belief State label DB labelSummary Linear Encoder Summary (K) (K) (K) (K)(K) (K) Categorical Latent Actions To compute policy gradient in Eq 3, (M, K) (M, K) (M, K) (M, D) (D) Use Gumbel-max trick[2] to backprop. [2]Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016). gumbel-max sampling
  • 15. 15 Attention Fusion Decoder RNN <GO> RNN RNN RNN … Output sequence Belief State label DB labelSummary Linear Encoder Summary (K) (K) (K) (K) (K) (K) Categorical Latent Actions Attention (M, K) (M, K) (M, K) (M, D) (D)
  • 16. 16 Optimization Approaches ¨ Full ELBO ¨ Lite ELBO • Full ELBO can suffer from exposure bias at latent space, i.e. the decoder only sees z sampled from q at training time and never experiences z sampled from p, which is always used at testing time. • It sets the posterior network the same as our encoder, • Add the additional regularization term that encourages the posterior be similar to certain prior distribution : a neural network that approximate the posterior distribution and are achieved by
  • 17. 17 Experiment Settings ¨ Multi-Woz dataset • 10438 dialogs on 6 different domains. • This paper focuses on the Dialog-Context-to-text Generation task. • It assumes that the model has access to the ground-truth belief state and is asked to generate the next response given user utterance. Budzianowski, Paweł, et al. "Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling." arXiv preprint arXiv:1810.00278 (2018).
  • 18. 18 Multi-turn goal-oriented dialog (MultiWoz) "usr": [ "am looking for a place to to stay that has [value_pricerange] price range it should be in a type of hotel", "no , i just need to make sure it s [value_pricerange] . oh , and i need parking", "yes , please . [value_count] people [value_count] nights starting on [value_day] .", "how about only [value_count] nights .", "no , that will be all . goodbye ." ] "sys": [ "okay , do you have a specific area you want to stay in ?", "i found [value_count] [value_pricerange] hotel for you that include - s parking . do you like me to book it ?", "i am sorry but i was not able to book that for you for [value_day] . is there another day you would like to stay or perhaps a shorter stay ?", " booking was successful . reference number is [hotel_reference] . anything else i can do for you ? ", " thank you for using our services . " ], "bs” : belief state label (94 dimension) “db” : data base label (30 dimension)
  • 20. 20 Language Constrained Reward(LCR) curve ¨ ROC-style curve to visualize the trade-off between high task success and being faithful human language. • It records two measures: (1) Perplexity of a given model on the test data (2) this model’s average cumulative task reward • It creates a 2D plots where the x-axis is the maximum PPL allowed, and the y-axis is the best achievable reward with the PPL budget. Gaussian is under ”without RL”
  • 21. 21 Summary ¨ End-to-end models that latent actions be expressive enough to capture response semantics in complex domains, decoupling the discourse-level decision making process from natural language generation. ¨ A novel training objective(lite ELBO) that outperforms the typical evidence lower bound ¨ Attention mechanism for integrating discrete latent variables(LiteAttnCat) in the decoder to better model long responses.