SlideShare a Scribd company logo
1 of 39
Download to read offline
On the Effectiveness of Offline RL for Dialogue Response Generation
On the Effectiveness of Offline RL for Dialogue
Response Generation
ICML, 2023
Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al.
Speaker: Po-Chuan Chen
Dec 12, 2023
1 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
2 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
3 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Abstract
For language models, many methods using teacher forcing (TF) to
train. It attempts to match human language exactly, even though
identical meanings can be expressed in different ways.
But with offline RL, which shows a clear performance improvement
over teacher forcing while not inducing training instability or
sacrificing practical training budgets1.
1https://github.com/asappresearch/dialogue-offline-rl
4 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
5 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Introduction
Historically, text generation models have typically been trained with
teacher forcing (TF) [6], which involves predicting the next token in a
sequence to exactly match the human utterance in a ground truth
dataset.
But it’s a challenging objective, and solving it with designing a loss
that incorporating human-in-the-loop feedback can be expensive.
6 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Contribution
In this paper, they present a comprehensive evaluation of offline RL
methods for dialogue text generation and investigate best practices.
They implement three complementary approaches, TF Top, Decision
Transformers (DT) [1], ILQL [2].
Also, they find that offline RL methods show a clear performance
improvement over teacher forcing and achieve a trade-off where they
generate text close enough in meaning to human.
7 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
Dialogue Response Generation as an MDP
Rewards for Dialogue Response Generation
Why Offline Reinforcement Learning?
4 Approach
5 Experiments
8 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents II
6 Discussion
9 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP
There has a supervised dataset of context response pairs {xi, yi}N
i=1,
where context x is the conversation history, and response
y = {y1, . . . yT } is a target sequence of tokens.
Figure 1: Dialogue generation as a Markov Decision Process (MDP)
10 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP (Cont.)
The goal is to learn a policy 𝜋 : st → at maximizing return.
States, st ∈ S is the context x and the partially generated
sequence of tokens up to and including time step t,
ŷ≤t := {ŷ1, . . . ŷt}.
Actions, at ∈ A are the set of next tokens ŷt+1 available from the
vocabulary V.
Transition function, T (st+1 | st, at) is deterministic since every
state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the
next step.
Rewards, rt : S × A → [0, 1] that computes similarity
generated response ŷ and target response y.
11 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Rewards for Dialogue Response Generation
Rewards for Dialogue Response Generation
The metric should capture both what the speaker is trying to
communicate and the relevance to the conversation.
Definition for reward
Collecting human-in-the-loop annotations
Automated metrics
BERTScore [7]
BLEURT [5]
They use a terminal reward, which is cumulative over an episode
E𝜋
ÍT
t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards,
which 𝛾 = 1.
12 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning?
For text generation, if we use online reinforcement learning, the agent
must balance the need to try out new actions to learn about the
environment.
This can be particularly challenging in text generation, as action
space (i.e. vocabulary size) is often large.
Another problem is that the reward landscape is sparse, hence
policies during training can get stuck in local minima where reward is
persistently zero.
13 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning? (Cont.)
Offline RL provides a learning paradigm that combines
Supervised learning’s ability to leverage existing data
General utility optimization power of online reinforcement
learning methods
They collect an offline dataset of state transitions
D = {(si
t, ai
t, ri
t, si
t+1)}N
i=1 using a behavior policy 𝜋𝛽.
The goal is to learn a policy 𝜋 that maximizes performance on the
dataset while staying close to the behavior policy:
max
𝜋
JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽)
14 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
Fine Tune on Top Returns
Decision Transformers: Condition on Return
Off-Policy Q-Learning
On-Policy RL: PPO
Comparison between Approaches
15 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents II
5 Experiments
6 Discussion
16 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Fine Tune on Top Returns
Fine Tune on Top Returns
The simplest approach is to fine-tune a model on “top”
demonstrations, i.e. teacher forcing on top returns (TF-Top).
The gradient update is simply the log-likelihood gradient on the data
subset Dtop,
Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)]
where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿}
Here 𝛿 can be computed by taking the top percentile of all returns
Q̂(st, at), the return for any token along the sequence is the same as
the final reward received at the end of the sequence.
17 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return
Decision Transformer (DT) wants to learn the return conditional
distribution of actions in each state, and then define a policy by
sampling from the distribution of actions that receive high returns.
Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and
then fine tune a model by conditioning on this return token.
The gradient update is simply the log-likelihood,
Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))]
At test time, they condition the model on the highest return Q̂top.
18 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
Figure 2: Decision Transformer architecture
19 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
One advantage of decision transformer over fine-tuning on top returns
is that the model is trained to explicitly learn a decision boundary
between different returns.
However, both approaches have the theoretical drawback of requiring
”trajectory coverage”.
Trajectory coverage
The training dataset must contain trajectories starting from the initial
state s0 that sees high return. It makes the number of data points
needed increases exponentially with the length of the trajectory.
20 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning
Here, they use offline variant for Q-learning, Implicit Q-learning
(ILQL).
ILQL adds two extra heads to the pre-trained model, the action value
head Q𝜃 (st, at), which denotes the utility of a token at given a
sequence st, and the state value head V𝜓 (st), which denotes the value
of the sequence st.
The implicit policy set as
𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st)))
21 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning (Cont.)
The gradient update set as
E
st,at,st+1∼D
[∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at)

| {z }
Temporal Difference Error
]
-𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st)

This paper improve upon original ILQL by regularizing against logits
of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D,
which more suited for settings where we may not have a lot of
demonstrated data.
22 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
On-Policy RL: PPO
On-Policy RL: PPO
In this paper, they also compare against an online RL algorithm:
Proximal Policy Optimization [4].
The gradient update is,
E
st,at∼𝜋𝜃

∇𝜃 𝜋𝜃 (at | st)
𝜋𝜃old (at | st)
A (st, at)

23 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Comparison between Approaches
Comparison between Approaches
When is DT and Q-learning comparable?
For MDPs where such stitching is not possible, e.g. a tree, DT and
ILQL are comparable in performance. They hypothesize that dialogue
text generation belongs to this class of MDPs.
When is DT and TF Top comparable?
DT should expect to do better than TF Top only when the data TF Top
throws away provides valuable information.
If that information is already captured by base TF model, then both
DT and TF Top are likely to be similar.
24 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
Experimental Setup
Results and Analysis
6 Discussion
25 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Experimental Setup
They evaluate offline RL methods using three task-oriented dialogue
datasets.
MultiWoz 2.2, which is a widely used dataset created to evaluate
performance of dialogue systems in multi-domain settings.
Action Based Conversations Dataset, which contains
customer-agent conversations where the agent’s goal is to solve a
customer problem.
TaskMaster-3, which contains conversations between users and
a system on movie ticketing.
26 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Baseline and Metrics
They choose a terminal binary reward BERTCLICK, which is a
thresholded BERTSCORE with threshold value 0.6.
They evaluate on a range of automated similarity metrics shown to
have a high correlation with human judgements like BERTSCORE,
BLEURT, METEOR and BLEU.
Baselines: TF, TF All, TF Top, DT, ILQL, and PPO.
For base models they study GPT2Medium2 and DistilGPT3 which
have 355M and 82M parameters, respectively.
2https://huggingface.co/gpt2-medium
3https://huggingface.co/distilgpt2
27 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Training Process
1 They train the TF model on all the training data.
2 Then, they use this trained TF model to generate an offline RL
dataset.
3 Finally fine tune different RL models on varying percentages of
generated offline RL data.
28 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Results and Analysis
Table 1: Comparison across different methods on average metrics and dataset
size with distilGPT2. 20%, 80% refer to percentage of the data used for
fine-tuning offline RL methods.
29 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does performance vary across multiple responses?
TF optimizes for recall, so with multiple responses, it should be able
to reach the performance of offline RL methods.
Figure 3: Average BERTCLICK over top-k responses
30 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do improvements look qualitatively to human
evaluators?
Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on
100 examples with 2 representative examples presented.
31 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do offline RL compare with PPO?
Table 2: Comparison of offline RL (DT) against online RL (PPO).
32 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does ILQL critic perform as a ranker?
Table 3: Comparison when ranking responses generated by the base TF
model.
33 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Can online data collection help DT?
They compare with Quark [3], which can be viewed as an online
counterpart to DT. The performance depending on how good a
coverage sampling from the base TF model has.
Figure 5: Average BERTCLICK for DT vs Quark
34 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
35 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Discussion
In this paper, they examine the effectiveness of offline RL methods for
generating dialogue text.
This paper found that
1 Offline RL models learn to produce good enough text that are
similar to human.
2 Decision Transformer is a practical choice.
3 Some future directions like learn reward functions from human
feedback and a dialogue has multiple turns.
36 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Limitations
This paper didn’t consider large language models, so it’s possible that
their findings do not generalize to large scale models with billions of
parameters.
37 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References I
[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning
via Sequence Modeling”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran
Associates, Inc., 2021, pp. 15084–15097. url: https:
//proceedings.neurips.cc/paper_files/paper/2021/
file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.
[2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
Reinforcement Learning with Implicit Q-Learning. 2021. arXiv:
2110.06169 [cs.LG].
[3] Ximing Lu et al. Quark: Controllable Text Generation with
Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].
38 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References II
[4] John Schulman et al. Proximal Policy Optimization Algorithms.
2017. arXiv: 1707.06347 [cs.LG].
[5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT:
Learning Robust Metrics for Text Generation”. In: Proceedings
of ACL. 2020.
[6] Ronald J. Williams and David Zipser. “A Learning Algorithm
for Continually Running Fully Recurrent Neural Networks”. In:
Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi:
10.1162/neco.1989.1.2.270. url:
https://doi.org/10.1162/neco.1989.1.2.270.
[7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with
BERT. 2020. arXiv: 1904.09675 [cs.CL].
39 / 39

More Related Content

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 
Comparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problemsComparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problemsTimdeBoer16
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeESCOM
 
Strategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service DiscoveryStrategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service DiscoveryMiguel Rebollo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Recent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination AlgorithmRecent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination AlgorithmEditor IJCATR
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsXin-She Yang
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationLionel Briand
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learningcsandit
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf (20)

Mangai
MangaiMangai
Mangai
 
Mangai
MangaiMangai
Mangai
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Comparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problemsComparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problems
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
 
Strategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service DiscoveryStrategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service Discovery
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Recent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination AlgorithmRecent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination Algorithm
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
Intro rl
Intro rlIntro rl
Intro rl
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Po-Chuan Chen
 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Po-Chuan Chen
 
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Po-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
 
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 

On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

  • 1. On the Effectiveness of Offline RL for Dialogue Response Generation On the Effectiveness of Offline RL for Dialogue Response Generation ICML, 2023 Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al. Speaker: Po-Chuan Chen Dec 12, 2023 1 / 39
  • 2. On the Effectiveness of Offline RL for Dialogue Response Generation Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 2 / 39
  • 3. On the Effectiveness of Offline RL for Dialogue Response Generation Abstract Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 3 / 39
  • 4. On the Effectiveness of Offline RL for Dialogue Response Generation Abstract Abstract For language models, many methods using teacher forcing (TF) to train. It attempts to match human language exactly, even though identical meanings can be expressed in different ways. But with offline RL, which shows a clear performance improvement over teacher forcing while not inducing training instability or sacrificing practical training budgets1. 1https://github.com/asappresearch/dialogue-offline-rl 4 / 39
  • 5. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 5 / 39
  • 6. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Introduction Historically, text generation models have typically been trained with teacher forcing (TF) [6], which involves predicting the next token in a sequence to exactly match the human utterance in a ground truth dataset. But it’s a challenging objective, and solving it with designing a loss that incorporating human-in-the-loop feedback can be expensive. 6 / 39
  • 7. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Contribution In this paper, they present a comprehensive evaluation of offline RL methods for dialogue text generation and investigate best practices. They implement three complementary approaches, TF Top, Decision Transformers (DT) [1], ILQL [2]. Also, they find that offline RL methods show a clear performance improvement over teacher forcing and achieve a trade-off where they generate text close enough in meaning to human. 7 / 39
  • 8. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Table of contents I 1 Abstract 2 Introduction 3 Problem Formulation Dialogue Response Generation as an MDP Rewards for Dialogue Response Generation Why Offline Reinforcement Learning? 4 Approach 5 Experiments 8 / 39
  • 9. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Table of contents II 6 Discussion 9 / 39
  • 10. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Dialogue Response Generation as an MDP Dialogue Response Generation as an MDP There has a supervised dataset of context response pairs {xi, yi}N i=1, where context x is the conversation history, and response y = {y1, . . . yT } is a target sequence of tokens. Figure 1: Dialogue generation as a Markov Decision Process (MDP) 10 / 39
  • 11. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Dialogue Response Generation as an MDP Dialogue Response Generation as an MDP (Cont.) The goal is to learn a policy 𝜋 : st → at maximizing return. States, st ∈ S is the context x and the partially generated sequence of tokens up to and including time step t, ŷ≤t := {ŷ1, . . . ŷt}. Actions, at ∈ A are the set of next tokens ŷt+1 available from the vocabulary V. Transition function, T (st+1 | st, at) is deterministic since every state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the next step. Rewards, rt : S × A → [0, 1] that computes similarity generated response ŷ and target response y. 11 / 39
  • 12. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Rewards for Dialogue Response Generation Rewards for Dialogue Response Generation The metric should capture both what the speaker is trying to communicate and the relevance to the conversation. Definition for reward Collecting human-in-the-loop annotations Automated metrics BERTScore [7] BLEURT [5] They use a terminal reward, which is cumulative over an episode E𝜋 ÍT t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards, which 𝛾 = 1. 12 / 39
  • 13. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Why Offline Reinforcement Learning? Why Offline Reinforcement Learning? For text generation, if we use online reinforcement learning, the agent must balance the need to try out new actions to learn about the environment. This can be particularly challenging in text generation, as action space (i.e. vocabulary size) is often large. Another problem is that the reward landscape is sparse, hence policies during training can get stuck in local minima where reward is persistently zero. 13 / 39
  • 14. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Why Offline Reinforcement Learning? Why Offline Reinforcement Learning? (Cont.) Offline RL provides a learning paradigm that combines Supervised learning’s ability to leverage existing data General utility optimization power of online reinforcement learning methods They collect an offline dataset of state transitions D = {(si t, ai t, ri t, si t+1)}N i=1 using a behavior policy 𝜋𝛽. The goal is to learn a policy 𝜋 that maximizes performance on the dataset while staying close to the behavior policy: max 𝜋 JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽) 14 / 39
  • 15. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Table of contents I 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach Fine Tune on Top Returns Decision Transformers: Condition on Return Off-Policy Q-Learning On-Policy RL: PPO Comparison between Approaches 15 / 39
  • 16. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Table of contents II 5 Experiments 6 Discussion 16 / 39
  • 17. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Fine Tune on Top Returns Fine Tune on Top Returns The simplest approach is to fine-tune a model on “top” demonstrations, i.e. teacher forcing on top returns (TF-Top). The gradient update is simply the log-likelihood gradient on the data subset Dtop, Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)] where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿} Here 𝛿 can be computed by taking the top percentile of all returns Q̂(st, at), the return for any token along the sequence is the same as the final reward received at the end of the sequence. 17 / 39
  • 18. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return Decision Transformer (DT) wants to learn the return conditional distribution of actions in each state, and then define a policy by sampling from the distribution of actions that receive high returns. Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and then fine tune a model by conditioning on this return token. The gradient update is simply the log-likelihood, Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))] At test time, they condition the model on the highest return Q̂top. 18 / 39
  • 19. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return (Cont.) Figure 2: Decision Transformer architecture 19 / 39
  • 20. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return (Cont.) One advantage of decision transformer over fine-tuning on top returns is that the model is trained to explicitly learn a decision boundary between different returns. However, both approaches have the theoretical drawback of requiring ”trajectory coverage”. Trajectory coverage The training dataset must contain trajectories starting from the initial state s0 that sees high return. It makes the number of data points needed increases exponentially with the length of the trajectory. 20 / 39
  • 21. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Off-Policy Q-Learning Off-Policy Q-Learning Here, they use offline variant for Q-learning, Implicit Q-learning (ILQL). ILQL adds two extra heads to the pre-trained model, the action value head Q𝜃 (st, at), which denotes the utility of a token at given a sequence st, and the state value head V𝜓 (st), which denotes the value of the sequence st. The implicit policy set as 𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st))) 21 / 39
  • 22. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Off-Policy Q-Learning Off-Policy Q-Learning (Cont.) The gradient update set as E st,at,st+1∼D [∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at) | {z } Temporal Difference Error ] -𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st) This paper improve upon original ILQL by regularizing against logits of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D, which more suited for settings where we may not have a lot of demonstrated data. 22 / 39
  • 23. On the Effectiveness of Offline RL for Dialogue Response Generation Approach On-Policy RL: PPO On-Policy RL: PPO In this paper, they also compare against an online RL algorithm: Proximal Policy Optimization [4]. The gradient update is, E st,at∼𝜋𝜃 ∇𝜃 𝜋𝜃 (at | st) 𝜋𝜃old (at | st) A (st, at) 23 / 39
  • 24. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Comparison between Approaches Comparison between Approaches When is DT and Q-learning comparable? For MDPs where such stitching is not possible, e.g. a tree, DT and ILQL are comparable in performance. They hypothesize that dialogue text generation belongs to this class of MDPs. When is DT and TF Top comparable? DT should expect to do better than TF Top only when the data TF Top throws away provides valuable information. If that information is already captured by base TF model, then both DT and TF Top are likely to be similar. 24 / 39
  • 25. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments Experimental Setup Results and Analysis 6 Discussion 25 / 39
  • 26. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Experimental Setup They evaluate offline RL methods using three task-oriented dialogue datasets. MultiWoz 2.2, which is a widely used dataset created to evaluate performance of dialogue systems in multi-domain settings. Action Based Conversations Dataset, which contains customer-agent conversations where the agent’s goal is to solve a customer problem. TaskMaster-3, which contains conversations between users and a system on movie ticketing. 26 / 39
  • 27. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Baseline and Metrics They choose a terminal binary reward BERTCLICK, which is a thresholded BERTSCORE with threshold value 0.6. They evaluate on a range of automated similarity metrics shown to have a high correlation with human judgements like BERTSCORE, BLEURT, METEOR and BLEU. Baselines: TF, TF All, TF Top, DT, ILQL, and PPO. For base models they study GPT2Medium2 and DistilGPT3 which have 355M and 82M parameters, respectively. 2https://huggingface.co/gpt2-medium 3https://huggingface.co/distilgpt2 27 / 39
  • 28. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Training Process 1 They train the TF model on all the training data. 2 Then, they use this trained TF model to generate an offline RL dataset. 3 Finally fine tune different RL models on varying percentages of generated offline RL data. 28 / 39
  • 29. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis Results and Analysis Table 1: Comparison across different methods on average metrics and dataset size with distilGPT2. 20%, 80% refer to percentage of the data used for fine-tuning offline RL methods. 29 / 39
  • 30. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How does performance vary across multiple responses? TF optimizes for recall, so with multiple responses, it should be able to reach the performance of offline RL methods. Figure 3: Average BERTCLICK over top-k responses 30 / 39
  • 31. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How do improvements look qualitatively to human evaluators? Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on 100 examples with 2 representative examples presented. 31 / 39
  • 32. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How do offline RL compare with PPO? Table 2: Comparison of offline RL (DT) against online RL (PPO). 32 / 39
  • 33. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How does ILQL critic perform as a ranker? Table 3: Comparison when ranking responses generated by the base TF model. 33 / 39
  • 34. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis Can online data collection help DT? They compare with Quark [3], which can be viewed as an online counterpart to DT. The performance depending on how good a coverage sampling from the base TF model has. Figure 5: Average BERTCLICK for DT vs Quark 34 / 39
  • 35. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 35 / 39
  • 36. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Discussion In this paper, they examine the effectiveness of offline RL methods for generating dialogue text. This paper found that 1 Offline RL models learn to produce good enough text that are similar to human. 2 Decision Transformer is a practical choice. 3 Some future directions like learn reward functions from human feedback and a dialogue has multiple turns. 36 / 39
  • 37. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Limitations This paper didn’t consider large language models, so it’s possible that their findings do not generalize to large scale models with billions of parameters. 37 / 39
  • 38. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion References I [1] Lili Chen et al. “Decision Transformer: Reinforcement Learning via Sequence Modeling”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran Associates, Inc., 2021, pp. 15084–15097. url: https: //proceedings.neurips.cc/paper_files/paper/2021/ file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf. [2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. 2021. arXiv: 2110.06169 [cs.LG]. [3] Ximing Lu et al. Quark: Controllable Text Generation with Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL]. 38 / 39
  • 39. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion References II [4] John Schulman et al. Proximal Policy Optimization Algorithms. 2017. arXiv: 1707.06347 [cs.LG]. [5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT: Learning Robust Metrics for Text Generation”. In: Proceedings of ACL. 2020. [6] Ronald J. Williams and David Zipser. “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”. In: Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi: 10.1162/neco.1989.1.2.270. url: https://doi.org/10.1162/neco.1989.1.2.270. [7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with BERT. 2020. arXiv: 1904.09675 [cs.CL]. 39 / 39