On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

On the Effectiveness of Offline RL for Dialogue Response Generation
On the Effectiveness of Offline RL for Dialogue
Response Generation
ICML, 2023
Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al.
Speaker: Po-Chuan Chen
Dec 12, 2023
1 / 39

Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
2 / 39

Abstract
Table of contents
1 Abstract
2 Introduction
4 Approach
5 Experiments
6 Discussion
3 / 39

Abstract
Abstract
For language models, many methods using teacher forcing (TF) to
train. It attempts to match human language exactly, even though
identical meanings can be expressed in different ways.
But with offline RL, which shows a clear performance improvement
over teacher forcing while not inducing training instability or
sacrificing practical training budgets1.
1https://github.com/asappresearch/dialogue-offline-rl
4 / 39

Introduction
Table of contents
1 Abstract
2 Introduction
4 Approach
5 Experiments
6 Discussion
5 / 39

Introduction
Introduction
Historically, text generation models have typically been trained with
teacher forcing (TF) [6], which involves predicting the next token in a
sequence to exactly match the human utterance in a ground truth
dataset.
But it’s a challenging objective, and solving it with designing a loss
that incorporating human-in-the-loop feedback can be expensive.
6 / 39

Introduction
Contribution
In this paper, they present a comprehensive evaluation of offline RL
methods for dialogue text generation and investigate best practices.
They implement three complementary approaches, TF Top, Decision
Transformers (DT) [1], ILQL [2].
Also, they find that offline RL methods show a clear performance
improvement over teacher forcing and achieve a trade-off where they
generate text close enough in meaning to human.
7 / 39

Problem Formulation
Table of contents I
1 Abstract
2 Introduction
Dialogue Response Generation as an MDP
Rewards for Dialogue Response Generation
Why Offline Reinforcement Learning?
4 Approach
5 Experiments
8 / 39

Problem Formulation
Table of contents II
6 Discussion
9 / 39

Problem Formulation
There has a supervised dataset of context response pairs {xi, yi}N
i=1,
where context x is the conversation history, and response
y = {y1, . . . yT } is a target sequence of tokens.
Figure 1: Dialogue generation as a Markov Decision Process (MDP)
10 / 39

Problem Formulation
Dialogue Response Generation as an MDP (Cont.)
The goal is to learn a policy 𝜋 : st → at maximizing return.
States, st ∈ S is the context x and the partially generated
sequence of tokens up to and including time step t,
ŷ≤t := {ŷ1, . . . ŷt}.
Actions, at ∈ A are the set of next tokens ŷt+1 available from the
vocabulary V.
Transition function, T (st+1 | st, at) is deterministic since every
state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the
next step.
Rewards, rt : S × A → [0, 1] that computes similarity
generated response ŷ and target response y.
11 / 39

Problem Formulation
The metric should capture both what the speaker is trying to
communicate and the relevance to the conversation.
Definition for reward
Collecting human-in-the-loop annotations
Automated metrics
BERTScore [7]
BLEURT [5]
They use a terminal reward, which is cumulative over an episode
E𝜋
ÍT
t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards,
which 𝛾 = 1.
12 / 39

Problem Formulation
For text generation, if we use online reinforcement learning, the agent
must balance the need to try out new actions to learn about the
environment.
This can be particularly challenging in text generation, as action
space (i.e. vocabulary size) is often large.
Another problem is that the reward landscape is sparse, hence
policies during training can get stuck in local minima where reward is
persistently zero.
13 / 39

Problem Formulation
Why Offline Reinforcement Learning? (Cont.)
Offline RL provides a learning paradigm that combines
Supervised learning’s ability to leverage existing data
General utility optimization power of online reinforcement
learning methods
They collect an offline dataset of state transitions
D = {(si
t, ai
t, ri
t, si
t+1)}N
i=1 using a behavior policy 𝜋𝛽.
The goal is to learn a policy 𝜋 that maximizes performance on the
dataset while staying close to the behavior policy:
max
𝜋
JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽)
14 / 39

Approach
Table of contents I
1 Abstract
2 Introduction
4 Approach
Fine Tune on Top Returns
Decision Transformers: Condition on Return
Off-Policy Q-Learning
On-Policy RL: PPO
Comparison between Approaches
15 / 39

Approach
Table of contents II
5 Experiments
6 Discussion
16 / 39

Approach
The simplest approach is to fine-tune a model on “top”
demonstrations, i.e. teacher forcing on top returns (TF-Top).
The gradient update is simply the log-likelihood gradient on the data
subset Dtop,
Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)]
where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿}
Here 𝛿 can be computed by taking the top percentile of all returns
Q̂(st, at), the return for any token along the sequence is the same as
the final reward received at the end of the sequence.
17 / 39

Approach
Decision Transformer (DT) wants to learn the return conditional
distribution of actions in each state, and then define a policy by
sampling from the distribution of actions that receive high returns.
Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and
then fine tune a model by conditioning on this return token.
The gradient update is simply the log-likelihood,
Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))]
At test time, they condition the model on the highest return Q̂top.
18 / 39

Approach
Decision Transformers: Condition on Return (Cont.)
Figure 2: Decision Transformer architecture
19 / 39

Approach
Decision Transformers: Condition on Return (Cont.)
One advantage of decision transformer over fine-tuning on top returns
is that the model is trained to explicitly learn a decision boundary
between different returns.
However, both approaches have the theoretical drawback of requiring
”trajectory coverage”.
Trajectory coverage
The training dataset must contain trajectories starting from the initial
state s0 that sees high return. It makes the number of data points
needed increases exponentially with the length of the trajectory.
20 / 39

Approach
Here, they use offline variant for Q-learning, Implicit Q-learning
(ILQL).
ILQL adds two extra heads to the pre-trained model, the action value
head Q𝜃 (st, at), which denotes the utility of a token at given a
sequence st, and the state value head V𝜓 (st), which denotes the value
of the sequence st.
The implicit policy set as
𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st)))
21 / 39

Approach
Off-Policy Q-Learning (Cont.)
The gradient update set as
E
st,at,st+1∼D
[∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at)

| {z }
Temporal Difference Error
]
-𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st)

This paper improve upon original ILQL by regularizing against logits
of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D,
which more suited for settings where we may not have a lot of
demonstrated data.
22 / 39

Approach
On-Policy RL: PPO
On-Policy RL: PPO
In this paper, they also compare against an online RL algorithm:
Proximal Policy Optimization [4].
The gradient update is,
E
st,at∼𝜋𝜃

∇𝜃 𝜋𝜃 (at | st)
𝜋𝜃old (at | st)
A (st, at)

23 / 39

Approach
When is DT and Q-learning comparable?
For MDPs where such stitching is not possible, e.g. a tree, DT and
ILQL are comparable in performance. They hypothesize that dialogue
text generation belongs to this class of MDPs.
When is DT and TF Top comparable?
DT should expect to do better than TF Top only when the data TF Top
throws away provides valuable information.
If that information is already captured by base TF model, then both
DT and TF Top are likely to be similar.
24 / 39

Experiments
Table of contents
1 Abstract
2 Introduction
4 Approach
5 Experiments
Experimental Setup
Results and Analysis
6 Discussion
25 / 39

Experiments
Experimental Setup
Experimental Setup
They evaluate offline RL methods using three task-oriented dialogue
datasets.
MultiWoz 2.2, which is a widely used dataset created to evaluate
performance of dialogue systems in multi-domain settings.
Action Based Conversations Dataset, which contains
customer-agent conversations where the agent’s goal is to solve a
customer problem.
TaskMaster-3, which contains conversations between users and
a system on movie ticketing.
26 / 39

Experiments
Experimental Setup
Baseline and Metrics
They choose a terminal binary reward BERTCLICK, which is a
thresholded BERTSCORE with threshold value 0.6.
They evaluate on a range of automated similarity metrics shown to
have a high correlation with human judgements like BERTSCORE,
BLEURT, METEOR and BLEU.
Baselines: TF, TF All, TF Top, DT, ILQL, and PPO.
For base models they study GPT2Medium2 and DistilGPT3 which
have 355M and 82M parameters, respectively.
2https://huggingface.co/gpt2-medium
3https://huggingface.co/distilgpt2
27 / 39

Experiments
Experimental Setup
Training Process
1 They train the TF model on all the training data.
2 Then, they use this trained TF model to generate an offline RL
dataset.
3 Finally fine tune different RL models on varying percentages of
generated offline RL data.
28 / 39

Experiments
Table 1: Comparison across different methods on average metrics and dataset
size with distilGPT2. 20%, 80% refer to percentage of the data used for
fine-tuning offline RL methods.
29 / 39

Experiments
How does performance vary across multiple responses?
TF optimizes for recall, so with multiple responses, it should be able
to reach the performance of offline RL methods.
Figure 3: Average BERTCLICK over top-k responses
30 / 39

Experiments
How do improvements look qualitatively to human
evaluators?
Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on
100 examples with 2 representative examples presented.
31 / 39

Experiments
How do offline RL compare with PPO?
Table 2: Comparison of offline RL (DT) against online RL (PPO).
32 / 39

Experiments
How does ILQL critic perform as a ranker?
Table 3: Comparison when ranking responses generated by the base TF
model.
33 / 39

Experiments
Can online data collection help DT?
They compare with Quark [3], which can be viewed as an online
counterpart to DT. The performance depending on how good a
coverage sampling from the base TF model has.
Figure 5: Average BERTCLICK for DT vs Quark
34 / 39

Discussion
Table of contents
1 Abstract
2 Introduction
4 Approach
5 Experiments
6 Discussion
35 / 39

Discussion
Discussion
In this paper, they examine the effectiveness of offline RL methods for
generating dialogue text.
This paper found that
1 Offline RL models learn to produce good enough text that are
similar to human.
2 Decision Transformer is a practical choice.
3 Some future directions like learn reward functions from human
feedback and a dialogue has multiple turns.
36 / 39

Discussion
Limitations
This paper didn’t consider large language models, so it’s possible that
their findings do not generalize to large scale models with billions of
parameters.
37 / 39

Discussion
References I
[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning
via Sequence Modeling”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran
Associates, Inc., 2021, pp. 15084–15097. url: https:
//proceedings.neurips.cc/paper_files/paper/2021/
file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.
[2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
Reinforcement Learning with Implicit Q-Learning. 2021. arXiv:
2110.06169 [cs.LG].
[3] Ximing Lu et al. Quark: Controllable Text Generation with
Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].
38 / 39

Discussion
References II
[4] John Schulman et al. Proximal Policy Optimization Algorithms.
2017. arXiv: 1707.06347 [cs.LG].
[5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT:
Learning Robust Metrics for Text Generation”. In: Proceedings
of ACL. 2020.
[6] Ronald J. Williams and David Zipser. “A Learning Algorithm
for Continually Running Fully Recurrent Neural Networks”. In:
Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi:
10.1162/neco.1989.1.2.270. url:
https://doi.org/10.1162/neco.1989.1.2.270.
[7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with
BERT. 2020. arXiv: 1904.09675 [cs.CL].
39 / 39

On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

Recommended

Recommended

More Related Content

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf

Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

On the Effectiveness of Offline RL for Dialogue Response Generation.pdf