(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
1. On the Effectiveness of Offline RL for Dialogue Response Generation
On the Effectiveness of Offline RL for Dialogue
Response Generation
ICML, 2023
Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al.
Speaker: Po-Chuan Chen
Dec 12, 2023
1 / 39
2. On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
2 / 39
3. On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
3 / 39
4. On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Abstract
For language models, many methods using teacher forcing (TF) to
train. It attempts to match human language exactly, even though
identical meanings can be expressed in different ways.
But with offline RL, which shows a clear performance improvement
over teacher forcing while not inducing training instability or
sacrificing practical training budgets1.
1https://github.com/asappresearch/dialogue-offline-rl
4 / 39
5. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
5 / 39
6. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Introduction
Historically, text generation models have typically been trained with
teacher forcing (TF) [6], which involves predicting the next token in a
sequence to exactly match the human utterance in a ground truth
dataset.
But it’s a challenging objective, and solving it with designing a loss
that incorporating human-in-the-loop feedback can be expensive.
6 / 39
7. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Contribution
In this paper, they present a comprehensive evaluation of offline RL
methods for dialogue text generation and investigate best practices.
They implement three complementary approaches, TF Top, Decision
Transformers (DT) [1], ILQL [2].
Also, they find that offline RL methods show a clear performance
improvement over teacher forcing and achieve a trade-off where they
generate text close enough in meaning to human.
7 / 39
8. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
Dialogue Response Generation as an MDP
Rewards for Dialogue Response Generation
Why Offline Reinforcement Learning?
4 Approach
5 Experiments
8 / 39
9. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents II
6 Discussion
9 / 39
10. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP
There has a supervised dataset of context response pairs {xi, yi}N
i=1,
where context x is the conversation history, and response
y = {y1, . . . yT } is a target sequence of tokens.
Figure 1: Dialogue generation as a Markov Decision Process (MDP)
10 / 39
11. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP (Cont.)
The goal is to learn a policy 𝜋 : st → at maximizing return.
States, st ∈ S is the context x and the partially generated
sequence of tokens up to and including time step t,
ŷ≤t := {ŷ1, . . . ŷt}.
Actions, at ∈ A are the set of next tokens ŷt+1 available from the
vocabulary V.
Transition function, T (st+1 | st, at) is deterministic since every
state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the
next step.
Rewards, rt : S × A → [0, 1] that computes similarity
generated response ŷ and target response y.
11 / 39
12. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Rewards for Dialogue Response Generation
Rewards for Dialogue Response Generation
The metric should capture both what the speaker is trying to
communicate and the relevance to the conversation.
Definition for reward
Collecting human-in-the-loop annotations
Automated metrics
BERTScore [7]
BLEURT [5]
They use a terminal reward, which is cumulative over an episode
E𝜋
ÍT
t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards,
which 𝛾 = 1.
12 / 39
13. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning?
For text generation, if we use online reinforcement learning, the agent
must balance the need to try out new actions to learn about the
environment.
This can be particularly challenging in text generation, as action
space (i.e. vocabulary size) is often large.
Another problem is that the reward landscape is sparse, hence
policies during training can get stuck in local minima where reward is
persistently zero.
13 / 39
14. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning? (Cont.)
Offline RL provides a learning paradigm that combines
Supervised learning’s ability to leverage existing data
General utility optimization power of online reinforcement
learning methods
They collect an offline dataset of state transitions
D = {(si
t, ai
t, ri
t, si
t+1)}N
i=1 using a behavior policy 𝜋𝛽.
The goal is to learn a policy 𝜋 that maximizes performance on the
dataset while staying close to the behavior policy:
max
𝜋
JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽)
14 / 39
15. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
Fine Tune on Top Returns
Decision Transformers: Condition on Return
Off-Policy Q-Learning
On-Policy RL: PPO
Comparison between Approaches
15 / 39
16. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents II
5 Experiments
6 Discussion
16 / 39
17. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Fine Tune on Top Returns
Fine Tune on Top Returns
The simplest approach is to fine-tune a model on “top”
demonstrations, i.e. teacher forcing on top returns (TF-Top).
The gradient update is simply the log-likelihood gradient on the data
subset Dtop,
Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)]
where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿}
Here 𝛿 can be computed by taking the top percentile of all returns
Q̂(st, at), the return for any token along the sequence is the same as
the final reward received at the end of the sequence.
17 / 39
18. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return
Decision Transformer (DT) wants to learn the return conditional
distribution of actions in each state, and then define a policy by
sampling from the distribution of actions that receive high returns.
Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and
then fine tune a model by conditioning on this return token.
The gradient update is simply the log-likelihood,
Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))]
At test time, they condition the model on the highest return Q̂top.
18 / 39
19. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
Figure 2: Decision Transformer architecture
19 / 39
20. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
One advantage of decision transformer over fine-tuning on top returns
is that the model is trained to explicitly learn a decision boundary
between different returns.
However, both approaches have the theoretical drawback of requiring
”trajectory coverage”.
Trajectory coverage
The training dataset must contain trajectories starting from the initial
state s0 that sees high return. It makes the number of data points
needed increases exponentially with the length of the trajectory.
20 / 39
21. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning
Here, they use offline variant for Q-learning, Implicit Q-learning
(ILQL).
ILQL adds two extra heads to the pre-trained model, the action value
head Q𝜃 (st, at), which denotes the utility of a token at given a
sequence st, and the state value head V𝜓 (st), which denotes the value
of the sequence st.
The implicit policy set as
𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st)))
21 / 39
22. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning (Cont.)
The gradient update set as
E
st,at,st+1∼D
[∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at)
| {z }
Temporal Difference Error
]
-𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st)
This paper improve upon original ILQL by regularizing against logits
of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D,
which more suited for settings where we may not have a lot of
demonstrated data.
22 / 39
23. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
On-Policy RL: PPO
On-Policy RL: PPO
In this paper, they also compare against an online RL algorithm:
Proximal Policy Optimization [4].
The gradient update is,
E
st,at∼𝜋𝜃
∇𝜃 𝜋𝜃 (at | st)
𝜋𝜃old (at | st)
A (st, at)
23 / 39
24. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Comparison between Approaches
Comparison between Approaches
When is DT and Q-learning comparable?
For MDPs where such stitching is not possible, e.g. a tree, DT and
ILQL are comparable in performance. They hypothesize that dialogue
text generation belongs to this class of MDPs.
When is DT and TF Top comparable?
DT should expect to do better than TF Top only when the data TF Top
throws away provides valuable information.
If that information is already captured by base TF model, then both
DT and TF Top are likely to be similar.
24 / 39
25. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
Experimental Setup
Results and Analysis
6 Discussion
25 / 39
26. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Experimental Setup
They evaluate offline RL methods using three task-oriented dialogue
datasets.
MultiWoz 2.2, which is a widely used dataset created to evaluate
performance of dialogue systems in multi-domain settings.
Action Based Conversations Dataset, which contains
customer-agent conversations where the agent’s goal is to solve a
customer problem.
TaskMaster-3, which contains conversations between users and
a system on movie ticketing.
26 / 39
27. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Baseline and Metrics
They choose a terminal binary reward BERTCLICK, which is a
thresholded BERTSCORE with threshold value 0.6.
They evaluate on a range of automated similarity metrics shown to
have a high correlation with human judgements like BERTSCORE,
BLEURT, METEOR and BLEU.
Baselines: TF, TF All, TF Top, DT, ILQL, and PPO.
For base models they study GPT2Medium2 and DistilGPT3 which
have 355M and 82M parameters, respectively.
2https://huggingface.co/gpt2-medium
3https://huggingface.co/distilgpt2
27 / 39
28. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Training Process
1 They train the TF model on all the training data.
2 Then, they use this trained TF model to generate an offline RL
dataset.
3 Finally fine tune different RL models on varying percentages of
generated offline RL data.
28 / 39
29. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Results and Analysis
Table 1: Comparison across different methods on average metrics and dataset
size with distilGPT2. 20%, 80% refer to percentage of the data used for
fine-tuning offline RL methods.
29 / 39
30. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does performance vary across multiple responses?
TF optimizes for recall, so with multiple responses, it should be able
to reach the performance of offline RL methods.
Figure 3: Average BERTCLICK over top-k responses
30 / 39
31. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do improvements look qualitatively to human
evaluators?
Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on
100 examples with 2 representative examples presented.
31 / 39
32. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do offline RL compare with PPO?
Table 2: Comparison of offline RL (DT) against online RL (PPO).
32 / 39
33. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does ILQL critic perform as a ranker?
Table 3: Comparison when ranking responses generated by the base TF
model.
33 / 39
34. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Can online data collection help DT?
They compare with Quark [3], which can be viewed as an online
counterpart to DT. The performance depending on how good a
coverage sampling from the base TF model has.
Figure 5: Average BERTCLICK for DT vs Quark
34 / 39
35. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
35 / 39
36. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Discussion
In this paper, they examine the effectiveness of offline RL methods for
generating dialogue text.
This paper found that
1 Offline RL models learn to produce good enough text that are
similar to human.
2 Decision Transformer is a practical choice.
3 Some future directions like learn reward functions from human
feedback and a dialogue has multiple turns.
36 / 39
37. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Limitations
This paper didn’t consider large language models, so it’s possible that
their findings do not generalize to large scale models with billions of
parameters.
37 / 39
38. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References I
[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning
via Sequence Modeling”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran
Associates, Inc., 2021, pp. 15084–15097. url: https:
//proceedings.neurips.cc/paper_files/paper/2021/
file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.
[2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
Reinforcement Learning with Implicit Q-Learning. 2021. arXiv:
2110.06169 [cs.LG].
[3] Ximing Lu et al. Quark: Controllable Text Generation with
Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].
38 / 39
39. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References II
[4] John Schulman et al. Proximal Policy Optimization Algorithms.
2017. arXiv: 1707.06347 [cs.LG].
[5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT:
Learning Robust Metrics for Text Generation”. In: Proceedings
of ACL. 2020.
[6] Ronald J. Williams and David Zipser. “A Learning Algorithm
for Continually Running Fully Recurrent Neural Networks”. In:
Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi:
10.1162/neco.1989.1.2.270. url:
https://doi.org/10.1162/neco.1989.1.2.270.
[7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with
BERT. 2020. arXiv: 1904.09675 [cs.CL].
39 / 39