SlideShare a Scribd company logo
On the Effectiveness of Offline RL for Dialogue Response Generation
On the Effectiveness of Offline RL for Dialogue
Response Generation
ICML, 2023
Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al.
Speaker: Po-Chuan Chen
Dec 12, 2023
1 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
2 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
3 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
For language models, many methods using teacher forcing (TF) to
train. It attempts to match human language exactly, even though
identical meanings can be expressed in different ways.
But with offline RL, which shows a clear performance improvement
over teacher forcing while not inducing training instability or
sacrificing practical training budgets1.
4 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
5 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Historically, text generation models have typically been trained with
teacher forcing (TF) [6], which involves predicting the next token in a
sequence to exactly match the human utterance in a ground truth
But it’s a challenging objective, and solving it with designing a loss
that incorporating human-in-the-loop feedback can be expensive.
6 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
In this paper, they present a comprehensive evaluation of offline RL
methods for dialogue text generation and investigate best practices.
They implement three complementary approaches, TF Top, Decision
Transformers (DT) [1], ILQL [2].
Also, they find that offline RL methods show a clear performance
improvement over teacher forcing and achieve a trade-off where they
generate text close enough in meaning to human.
7 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
Dialogue Response Generation as an MDP
Rewards for Dialogue Response Generation
Why Offline Reinforcement Learning?
4 Approach
5 Experiments
8 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents II
6 Discussion
9 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP
There has a supervised dataset of context response pairs {xi, yi}N
where context x is the conversation history, and response
y = {y1, . . . yT } is a target sequence of tokens.
Figure 1: Dialogue generation as a Markov Decision Process (MDP)
10 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP (Cont.)
The goal is to learn a policy 𝜋 : st → at maximizing return.
States, st ∈ S is the context x and the partially generated
sequence of tokens up to and including time step t,
ŷ≤t := {ŷ1, . . . ŷt}.
Actions, at ∈ A are the set of next tokens ŷt+1 available from the
vocabulary V.
Transition function, T (st+1 | st, at) is deterministic since every
state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the
next step.
Rewards, rt : S × A → [0, 1] that computes similarity
generated response ŷ and target response y.
11 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Rewards for Dialogue Response Generation
Rewards for Dialogue Response Generation
The metric should capture both what the speaker is trying to
communicate and the relevance to the conversation.
Definition for reward
Collecting human-in-the-loop annotations
Automated metrics
BERTScore [7]
They use a terminal reward, which is cumulative over an episode
t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards,
which 𝛾 = 1.
12 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning?
For text generation, if we use online reinforcement learning, the agent
must balance the need to try out new actions to learn about the
This can be particularly challenging in text generation, as action
space (i.e. vocabulary size) is often large.
Another problem is that the reward landscape is sparse, hence
policies during training can get stuck in local minima where reward is
persistently zero.
13 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning? (Cont.)
Offline RL provides a learning paradigm that combines
Supervised learning’s ability to leverage existing data
General utility optimization power of online reinforcement
learning methods
They collect an offline dataset of state transitions
D = {(si
t, ai
t, ri
t, si
i=1 using a behavior policy 𝜋𝛽.
The goal is to learn a policy 𝜋 that maximizes performance on the
dataset while staying close to the behavior policy:
JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽)
14 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
Fine Tune on Top Returns
Decision Transformers: Condition on Return
Off-Policy Q-Learning
On-Policy RL: PPO
Comparison between Approaches
15 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents II
5 Experiments
6 Discussion
16 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Fine Tune on Top Returns
Fine Tune on Top Returns
The simplest approach is to fine-tune a model on “top”
demonstrations, i.e. teacher forcing on top returns (TF-Top).
The gradient update is simply the log-likelihood gradient on the data
subset Dtop,
Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)]
where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿}
Here 𝛿 can be computed by taking the top percentile of all returns
Q̂(st, at), the return for any token along the sequence is the same as
the final reward received at the end of the sequence.
17 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return
Decision Transformer (DT) wants to learn the return conditional
distribution of actions in each state, and then define a policy by
sampling from the distribution of actions that receive high returns.
Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and
then fine tune a model by conditioning on this return token.
The gradient update is simply the log-likelihood,
Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))]
At test time, they condition the model on the highest return Q̂top.
18 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
Figure 2: Decision Transformer architecture
19 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
One advantage of decision transformer over fine-tuning on top returns
is that the model is trained to explicitly learn a decision boundary
between different returns.
However, both approaches have the theoretical drawback of requiring
”trajectory coverage”.
Trajectory coverage
The training dataset must contain trajectories starting from the initial
state s0 that sees high return. It makes the number of data points
needed increases exponentially with the length of the trajectory.
20 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Off-Policy Q-Learning
Off-Policy Q-Learning
Here, they use offline variant for Q-learning, Implicit Q-learning
ILQL adds two extra heads to the pre-trained model, the action value
head Q𝜃 (st, at), which denotes the utility of a token at given a
sequence st, and the state value head V𝜓 (st), which denotes the value
of the sequence st.
The implicit policy set as
𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st)))
21 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Off-Policy Q-Learning
Off-Policy Q-Learning (Cont.)
The gradient update set as
[∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at)

| {z }
Temporal Difference Error
-𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st)

This paper improve upon original ILQL by regularizing against logits
of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D,
which more suited for settings where we may not have a lot of
demonstrated data.
22 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
On-Policy RL: PPO
On-Policy RL: PPO
In this paper, they also compare against an online RL algorithm:
Proximal Policy Optimization [4].
The gradient update is,

∇𝜃 𝜋𝜃 (at | st)
𝜋𝜃old (at | st)
A (st, at)

23 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Comparison between Approaches
Comparison between Approaches
When is DT and Q-learning comparable?
For MDPs where such stitching is not possible, e.g. a tree, DT and
ILQL are comparable in performance. They hypothesize that dialogue
text generation belongs to this class of MDPs.
When is DT and TF Top comparable?
DT should expect to do better than TF Top only when the data TF Top
throws away provides valuable information.
If that information is already captured by base TF model, then both
DT and TF Top are likely to be similar.
24 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
Experimental Setup
Results and Analysis
6 Discussion
25 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experimental Setup
Experimental Setup
They evaluate offline RL methods using three task-oriented dialogue
MultiWoz 2.2, which is a widely used dataset created to evaluate
performance of dialogue systems in multi-domain settings.
Action Based Conversations Dataset, which contains
customer-agent conversations where the agent’s goal is to solve a
customer problem.
TaskMaster-3, which contains conversations between users and
a system on movie ticketing.
26 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experimental Setup
Baseline and Metrics
They choose a terminal binary reward BERTCLICK, which is a
thresholded BERTSCORE with threshold value 0.6.
They evaluate on a range of automated similarity metrics shown to
have a high correlation with human judgements like BERTSCORE,
Baselines: TF, TF All, TF Top, DT, ILQL, and PPO.
For base models they study GPT2Medium2 and DistilGPT3 which
have 355M and 82M parameters, respectively.
27 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Experimental Setup
Training Process
1 They train the TF model on all the training data.
2 Then, they use this trained TF model to generate an offline RL
3 Finally fine tune different RL models on varying percentages of
generated offline RL data.
28 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
Results and Analysis
Table 1: Comparison across different methods on average metrics and dataset
size with distilGPT2. 20%, 80% refer to percentage of the data used for
fine-tuning offline RL methods.
29 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
How does performance vary across multiple responses?
TF optimizes for recall, so with multiple responses, it should be able
to reach the performance of offline RL methods.
Figure 3: Average BERTCLICK over top-k responses
30 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
How do improvements look qualitatively to human
Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on
100 examples with 2 representative examples presented.
31 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
How do offline RL compare with PPO?
Table 2: Comparison of offline RL (DT) against online RL (PPO).
32 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
How does ILQL critic perform as a ranker?
Table 3: Comparison when ranking responses generated by the base TF
33 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Results and Analysis
Can online data collection help DT?
They compare with Quark [3], which can be viewed as an online
counterpart to DT. The performance depending on how good a
coverage sampling from the base TF model has.
Figure 5: Average BERTCLICK for DT vs Quark
34 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
35 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
In this paper, they examine the effectiveness of offline RL methods for
generating dialogue text.
This paper found that
1 Offline RL models learn to produce good enough text that are
similar to human.
2 Decision Transformer is a practical choice.
3 Some future directions like learn reward functions from human
feedback and a dialogue has multiple turns.
36 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
This paper didn’t consider large language models, so it’s possible that
their findings do not generalize to large scale models with billions of
37 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
References I
[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning
via Sequence Modeling”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran
Associates, Inc., 2021, pp. 15084–15097. url: https:
[2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
Reinforcement Learning with Implicit Q-Learning. 2021. arXiv:
2110.06169 [cs.LG].
[3] Ximing Lu et al. Quark: Controllable Text Generation with
Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].
38 / 39
On the Effectiveness of Offline RL for Dialogue Response Generation
References II
[4] John Schulman et al. Proximal Policy Optimization Algorithms.
2017. arXiv: 1707.06347 [cs.LG].
[5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT:
Learning Robust Metrics for Text Generation”. In: Proceedings
of ACL. 2020.
[6] Ronald J. Williams and David Zipser. “A Learning Algorithm
for Continually Running Fully Recurrent Neural Networks”. In:
Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi:
10.1162/neco.1989.1.2.270. url:
[7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with
BERT. 2020. arXiv: 1904.09675 [cs.CL].
39 / 39

More Related Content

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deren Lei
Comparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problemsComparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problems
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Strategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service DiscoveryStrategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service Discovery
Miguel Rebollo
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
Recent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination AlgorithmRecent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination Algorithm
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
Xin-She Yang
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
Lionel Briand
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
Po-Chuan Chen
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Po-Chuan Chen
Intro rl
Intro rlIntro rl
Intro rl
Ronald Teo

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf (20)

Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Comparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problemsComparing human solving time with SAT-solving for Sudoku problems
Comparing human solving time with SAT-solving for Sudoku problems
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Strategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service DiscoveryStrategies for Cooperation Emergence in Distributed Service Discovery
Strategies for Cooperation Emergence in Distributed Service Discovery
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Recent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination AlgorithmRecent Advances in Flower Pollination Algorithm
Recent Advances in Flower Pollination Algorithm
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Intro rl
Intro rlIntro rl
Intro rl

More from Po-Chuan Chen

Graph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdfGraph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdf
Po-Chuan Chen
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Po-Chuan Chen
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Po-Chuan Chen
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Po-Chuan Chen
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Po-Chuan Chen
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
Po-Chuan Chen
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
Po-Chuan Chen
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
Po-Chuan Chen
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
Po-Chuan Chen
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
Po-Chuan Chen
Po-Chuan Chen
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Po-Chuan Chen
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Po-Chuan Chen
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Po-Chuan Chen
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Po-Chuan Chen
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
Po-Chuan Chen
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Po-Chuan Chen

More from Po-Chuan Chen (20)

Graph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdfGraph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...

Recently uploaded

Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf

Recently uploaded (20)

Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf

On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

  • 1. On the Effectiveness of Offline RL for Dialogue Response Generation On the Effectiveness of Offline RL for Dialogue Response Generation ICML, 2023 Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al. Speaker: Po-Chuan Chen Dec 12, 2023 1 / 39
  • 2. On the Effectiveness of Offline RL for Dialogue Response Generation Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 2 / 39
  • 3. On the Effectiveness of Offline RL for Dialogue Response Generation Abstract Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 3 / 39
  • 4. On the Effectiveness of Offline RL for Dialogue Response Generation Abstract Abstract For language models, many methods using teacher forcing (TF) to train. It attempts to match human language exactly, even though identical meanings can be expressed in different ways. But with offline RL, which shows a clear performance improvement over teacher forcing while not inducing training instability or sacrificing practical training budgets1. 1 4 / 39
  • 5. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 5 / 39
  • 6. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Introduction Historically, text generation models have typically been trained with teacher forcing (TF) [6], which involves predicting the next token in a sequence to exactly match the human utterance in a ground truth dataset. But it’s a challenging objective, and solving it with designing a loss that incorporating human-in-the-loop feedback can be expensive. 6 / 39
  • 7. On the Effectiveness of Offline RL for Dialogue Response Generation Introduction Contribution In this paper, they present a comprehensive evaluation of offline RL methods for dialogue text generation and investigate best practices. They implement three complementary approaches, TF Top, Decision Transformers (DT) [1], ILQL [2]. Also, they find that offline RL methods show a clear performance improvement over teacher forcing and achieve a trade-off where they generate text close enough in meaning to human. 7 / 39
  • 8. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Table of contents I 1 Abstract 2 Introduction 3 Problem Formulation Dialogue Response Generation as an MDP Rewards for Dialogue Response Generation Why Offline Reinforcement Learning? 4 Approach 5 Experiments 8 / 39
  • 9. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Table of contents II 6 Discussion 9 / 39
  • 10. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Dialogue Response Generation as an MDP Dialogue Response Generation as an MDP There has a supervised dataset of context response pairs {xi, yi}N i=1, where context x is the conversation history, and response y = {y1, . . . yT } is a target sequence of tokens. Figure 1: Dialogue generation as a Markov Decision Process (MDP) 10 / 39
  • 11. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Dialogue Response Generation as an MDP Dialogue Response Generation as an MDP (Cont.) The goal is to learn a policy 𝜋 : st → at maximizing return. States, st ∈ S is the context x and the partially generated sequence of tokens up to and including time step t, ŷ≤t := {ŷ1, . . . ŷt}. Actions, at ∈ A are the set of next tokens ŷt+1 available from the vocabulary V. Transition function, T (st+1 | st, at) is deterministic since every state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the next step. Rewards, rt : S × A → [0, 1] that computes similarity generated response ŷ and target response y. 11 / 39
  • 12. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Rewards for Dialogue Response Generation Rewards for Dialogue Response Generation The metric should capture both what the speaker is trying to communicate and the relevance to the conversation. Definition for reward Collecting human-in-the-loop annotations Automated metrics BERTScore [7] BLEURT [5] They use a terminal reward, which is cumulative over an episode E𝜋 ÍT t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards, which 𝛾 = 1. 12 / 39
  • 13. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Why Offline Reinforcement Learning? Why Offline Reinforcement Learning? For text generation, if we use online reinforcement learning, the agent must balance the need to try out new actions to learn about the environment. This can be particularly challenging in text generation, as action space (i.e. vocabulary size) is often large. Another problem is that the reward landscape is sparse, hence policies during training can get stuck in local minima where reward is persistently zero. 13 / 39
  • 14. On the Effectiveness of Offline RL for Dialogue Response Generation Problem Formulation Why Offline Reinforcement Learning? Why Offline Reinforcement Learning? (Cont.) Offline RL provides a learning paradigm that combines Supervised learning’s ability to leverage existing data General utility optimization power of online reinforcement learning methods They collect an offline dataset of state transitions D = {(si t, ai t, ri t, si t+1)}N i=1 using a behavior policy 𝜋𝛽. The goal is to learn a policy 𝜋 that maximizes performance on the dataset while staying close to the behavior policy: max 𝜋 JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽) 14 / 39
  • 15. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Table of contents I 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach Fine Tune on Top Returns Decision Transformers: Condition on Return Off-Policy Q-Learning On-Policy RL: PPO Comparison between Approaches 15 / 39
  • 16. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Table of contents II 5 Experiments 6 Discussion 16 / 39
  • 17. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Fine Tune on Top Returns Fine Tune on Top Returns The simplest approach is to fine-tune a model on “top” demonstrations, i.e. teacher forcing on top returns (TF-Top). The gradient update is simply the log-likelihood gradient on the data subset Dtop, Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)] where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿} Here 𝛿 can be computed by taking the top percentile of all returns Q̂(st, at), the return for any token along the sequence is the same as the final reward received at the end of the sequence. 17 / 39
  • 18. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return Decision Transformer (DT) wants to learn the return conditional distribution of actions in each state, and then define a policy by sampling from the distribution of actions that receive high returns. Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and then fine tune a model by conditioning on this return token. The gradient update is simply the log-likelihood, Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))] At test time, they condition the model on the highest return Q̂top. 18 / 39
  • 19. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return (Cont.) Figure 2: Decision Transformer architecture 19 / 39
  • 20. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Decision Transformers: Condition on Return Decision Transformers: Condition on Return (Cont.) One advantage of decision transformer over fine-tuning on top returns is that the model is trained to explicitly learn a decision boundary between different returns. However, both approaches have the theoretical drawback of requiring ”trajectory coverage”. Trajectory coverage The training dataset must contain trajectories starting from the initial state s0 that sees high return. It makes the number of data points needed increases exponentially with the length of the trajectory. 20 / 39
  • 21. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Off-Policy Q-Learning Off-Policy Q-Learning Here, they use offline variant for Q-learning, Implicit Q-learning (ILQL). ILQL adds two extra heads to the pre-trained model, the action value head Q𝜃 (st, at), which denotes the utility of a token at given a sequence st, and the state value head V𝜓 (st), which denotes the value of the sequence st. The implicit policy set as 𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st))) 21 / 39
  • 22. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Off-Policy Q-Learning Off-Policy Q-Learning (Cont.) The gradient update set as E st,at,st+1∼D [∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at) | {z } Temporal Difference Error ] -𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st) This paper improve upon original ILQL by regularizing against logits of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D, which more suited for settings where we may not have a lot of demonstrated data. 22 / 39
  • 23. On the Effectiveness of Offline RL for Dialogue Response Generation Approach On-Policy RL: PPO On-Policy RL: PPO In this paper, they also compare against an online RL algorithm: Proximal Policy Optimization [4]. The gradient update is, E st,at∼𝜋𝜃 ∇𝜃 𝜋𝜃 (at | st) 𝜋𝜃old (at | st) A (st, at) 23 / 39
  • 24. On the Effectiveness of Offline RL for Dialogue Response Generation Approach Comparison between Approaches Comparison between Approaches When is DT and Q-learning comparable? For MDPs where such stitching is not possible, e.g. a tree, DT and ILQL are comparable in performance. They hypothesize that dialogue text generation belongs to this class of MDPs. When is DT and TF Top comparable? DT should expect to do better than TF Top only when the data TF Top throws away provides valuable information. If that information is already captured by base TF model, then both DT and TF Top are likely to be similar. 24 / 39
  • 25. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments Experimental Setup Results and Analysis 6 Discussion 25 / 39
  • 26. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Experimental Setup They evaluate offline RL methods using three task-oriented dialogue datasets. MultiWoz 2.2, which is a widely used dataset created to evaluate performance of dialogue systems in multi-domain settings. Action Based Conversations Dataset, which contains customer-agent conversations where the agent’s goal is to solve a customer problem. TaskMaster-3, which contains conversations between users and a system on movie ticketing. 26 / 39
  • 27. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Baseline and Metrics They choose a terminal binary reward BERTCLICK, which is a thresholded BERTSCORE with threshold value 0.6. They evaluate on a range of automated similarity metrics shown to have a high correlation with human judgements like BERTSCORE, BLEURT, METEOR and BLEU. Baselines: TF, TF All, TF Top, DT, ILQL, and PPO. For base models they study GPT2Medium2 and DistilGPT3 which have 355M and 82M parameters, respectively. 2 3 27 / 39
  • 28. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Experimental Setup Training Process 1 They train the TF model on all the training data. 2 Then, they use this trained TF model to generate an offline RL dataset. 3 Finally fine tune different RL models on varying percentages of generated offline RL data. 28 / 39
  • 29. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis Results and Analysis Table 1: Comparison across different methods on average metrics and dataset size with distilGPT2. 20%, 80% refer to percentage of the data used for fine-tuning offline RL methods. 29 / 39
  • 30. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How does performance vary across multiple responses? TF optimizes for recall, so with multiple responses, it should be able to reach the performance of offline RL methods. Figure 3: Average BERTCLICK over top-k responses 30 / 39
  • 31. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How do improvements look qualitatively to human evaluators? Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on 100 examples with 2 representative examples presented. 31 / 39
  • 32. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How do offline RL compare with PPO? Table 2: Comparison of offline RL (DT) against online RL (PPO). 32 / 39
  • 33. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis How does ILQL critic perform as a ranker? Table 3: Comparison when ranking responses generated by the base TF model. 33 / 39
  • 34. On the Effectiveness of Offline RL for Dialogue Response Generation Experiments Results and Analysis Can online data collection help DT? They compare with Quark [3], which can be viewed as an online counterpart to DT. The performance depending on how good a coverage sampling from the base TF model has. Figure 5: Average BERTCLICK for DT vs Quark 34 / 39
  • 35. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Table of contents 1 Abstract 2 Introduction 3 Problem Formulation 4 Approach 5 Experiments 6 Discussion 35 / 39
  • 36. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Discussion In this paper, they examine the effectiveness of offline RL methods for generating dialogue text. This paper found that 1 Offline RL models learn to produce good enough text that are similar to human. 2 Decision Transformer is a practical choice. 3 Some future directions like learn reward functions from human feedback and a dialogue has multiple turns. 36 / 39
  • 37. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion Limitations This paper didn’t consider large language models, so it’s possible that their findings do not generalize to large scale models with billions of parameters. 37 / 39
  • 38. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion References I [1] Lili Chen et al. “Decision Transformer: Reinforcement Learning via Sequence Modeling”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran Associates, Inc., 2021, pp. 15084–15097. url: https: // file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf. [2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. 2021. arXiv: 2110.06169 [cs.LG]. [3] Ximing Lu et al. Quark: Controllable Text Generation with Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL]. 38 / 39
  • 39. On the Effectiveness of Offline RL for Dialogue Response Generation Discussion References II [4] John Schulman et al. Proximal Policy Optimization Algorithms. 2017. arXiv: 1707.06347 [cs.LG]. [5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT: Learning Robust Metrics for Text Generation”. In: Proceedings of ACL. 2020. [6] Ronald J. Williams and David Zipser. “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”. In: Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi: 10.1162/neco.1989.1.2.270. url: [7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with BERT. 2020. arXiv: 1904.09675 [cs.CL]. 39 / 39