Is Reinforcement Learning (Not) for Natural Language Processing.pdf

RL4LMs
Is Reinforcement Learning (Not) for Natural
Language Processing
Rajkumar Ramamurthy, Prithviraj Ammanabrolu et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
April 18, 2023
1 / 41

RL4LMs
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
2 / 41

RL4LMs
Abstract
Table of contents
1 Abstract
2 Introduction
6 Conclusion
3 / 41

RL4LMs
Abstract
Abstract
In the task about text generation, they can view it as a sequential
decision-making problem, reinforcement learning (RL) appears to be
a natural conceptual framework.
But there has several challenges:
Training instability due to the combinatorial action space
Lack of open-source libraries and benchmarks customized for
LM alignment
4 / 41

RL4LMs
Abstract
Contribution
In this paper, they first introduce an open-source modular library,
RL4LMs, for optimizing language generators with RL.
Also, they present the GRUE (General Reinforced-language
Understanding Evaluation) benchmark, the benchmark uses reward
functions which capture automated measures of human preference.
After that, they introduce an easy-to-use, performant RL algorithm,
NLPO (Natural Language Policy Optimization) that learns to
effectively reduce the combinatorial action space in language
generation.
5 / 41

RL4LMs
Introduction
Table of contents
1 Abstract
2 Introduction
6 Conclusion
6 / 41

RL4LMs
Introduction
Introduction
The ultimate aim of language technology is to interact with humans.
One option to incorporate user feedback is via human-in-the-loop,
but this degree of dense supervision is often prohibitive and inefficient.
Automated metrics (BERTScore, BLEURT) offer a promising
compromise have significantly improved correlation with human
judgment compared to earlier metrics (BLEU, METEOR, etc.), and
are cheap to evaluate. But — these functions are usually not per-token
differentiable.
7 / 41

RL4LMs
Introduction
Introduction
Reinforcement Learning (RL) offers a natural path forward for
optimizing non-differentiable, scalar objectives for LM-based
generation when it is cast as a sequential decision-making problem.
But, it is easy to find nonsense samples that achieve high-quality
estimates.
Recent works have shown promising results in aligning LMs to human
preferences via RL by constraining preference-based rewards to
incorporate notions of fluency but progress in this line of work is
heavily hindered by a lack of open-source benchmarks and
algorithmic implementations.
8 / 41

RL4LMs
Introduction
Introduction
To facilitate research in building RL algorithms to better align LMs,
they introduce a library, a benchmark, and an algorithm.
RL4LMs library, which enables generative HuggingFace models
(e.g., GPT-2 or T5) to be trained using a variety of existing RL
methods like PPO/A2C/etc.
GRUE benchmark: GRUE is a collection of 7 contemporary NLP
tasks; in contrast to other benchmarks, instead of supervised training,
they pair each task with reward function(s).
A novel on-policy RL algorithm called NLPO, that dynamically
learns task-specific constraints over the distribution of language
at a token level.
9 / 41

RL4LMs
Introduction
Related Work
Imitation learning for NLP
RL for Large Action Spaces
RL for NLP
11 / 41

RL4LMs
RL4LMs: A Library for Training LMs with RL
Table of contents
1 Abstract
2 Introduction
6 Conclusion
12 / 41

RL4LMs
RL4LMS: A Library for Training LMs with RL
RL4LMs, an open-source library with building blocks for fine-tuning
and evaluating RL algorithms on LM-based generation.
The library is modular, which enables users to plug-in customized
environments, reward functions, metrics, and algorithms. In the initial
release, they provide support for 6 different NLP tasks, 16 evaluation
metrics and rewards, and 4 RL algorithms.
13 / 41

RL4LMs
RL4LMS: A Library for Training LMs with RL
1 Environments: Generation as a token-level MDP
2 Reward functions and evaluation metrics
3 On-policy actor-critic algorithms
14 / 41

RL4LMs
Environments: Generation as a token-level MDP
Each environment is an NLP task: they are given a supervised dataset
D = {(xi, yi)}N
i=1 of N examples, where x ∈ X is an language input
and y ∈ Y is the target string.
Generation can be viewed as a Markov Decision Process (MDP)
⟨S, A, R, P, 𝛾, T⟩ using a finite vocabulary V.
Each episode in the MDP begins by sampling a datapoint (x, y) from
their dataset and ends when the current time step t exceeds the horizon
T or an end of sentence (EOS) token is generated.
15 / 41

RL4LMs
Environments: Generation as a token-level MDP
The input x = (x0, . . . , xm) is a task-specific prompt that is used as
their initial state x0 = (x0, . . . , xm), where s0 ∈ S.
For the state, action, and transition function
S: state space with xm ∈ V
An action at ∈ A consists of a token from their vocabulary V
Transition function P : S × A → Δ(S)
Each episode has a reward R : S × A × Y → R1 that depends on
the (sT, y) is emitted
16 / 41

RL4LMs
Reward functions and evaluation metrics
They provide interfaces to
1 n-gram overlap metrics
2 model-based semantic metrics
3 task-specific metrics
4 diversity/fluency/naturalness metrics
5 task-specific, model-based human preference metrics
17 / 41

RL4LMs
On-policy actor-critic algorithms
RL4LMs supports fine-tuning and training LMs from scratch via
on-policy actor-critic algorithms on language environments.
Their benchmark experiments focus on fine-tuning a pre-trained LM
denoted as 𝜋0 from which they initial their agent’s policy 𝜋𝜃 = 𝜋0.
The value network V𝜙 used to estimate the value function is also
initialized from 𝜋0 except for the final layer which is randomly
initialized to output a single scalar value.
18 / 41

RL4LMs
Value function:
V 𝜋
t = Eat∼𝜋
" T
∑︁
𝜏=t
𝛾R (s𝜏, a𝜏, y)
#
Q-value function:
Q𝜋
t (st, at) = R (st, at, y) + 𝛾Est+1∼P

V 𝜋
t+1 (st+1)

Advantage function:
A𝜋
t (s, a) = Q𝜋
t (st, at) − V 𝜋
t
To increase training stability, advantage is appoximated using
Generalized Advantage Estimation.
19 / 41

RL4LMs
Given an input-output pair (x, y) and generation predictions from their
agent; because the environment rewards are sequence-level and sparse.
So, they regularize the reward function using a token-level KL penalty
for all on-policy algorithms, to prevent the model from deviating
too far from the initialized LM.
R̂ (st, at, y) = R (st, at, y) − 𝛽KL (𝜋𝜃 (at | st) ∥𝜋0 (at | st))
where KL(𝜋𝜃 (at | st)∥𝜋0(at | st)) = (log 𝜋0(at | st) − log 𝜋𝜃 (at | st)).
20 / 41

RL4LMs
NLPO: Natural Language Policy Optimization
Table of contents
1 Abstract
2 Introduction
6 Conclusion
21 / 41

RL4LMs
Language generation action spaces are orders of magnitude larger than
what most discrete action space RL algorithms are designed for.
They hypothesize that the size of the action space is a core cause of
instability when training LMs with existing RL methods.
NLPO, a parameterized-masked extension of PPO, learns to mask out
less relevant tokens in-context as it trains. It accomplishes this via
top-p sampling, which restricts tokens to the smallest possible set
whose cumulative probability is greater than p.
22 / 41

RL4LMs
NLPO maintains a masking policy 𝜋𝜓: the masking policy is a copy
of the current policy (𝜋𝜃) updated every 𝜇 steps.
A parameterized-invalid-mask is created from 𝜋𝜓 by first selecting the
top-p tokens from the vocabulary, and then applying an invalid-mask
to the remaining tokens.
With an additional constraint that balances between the benefits of
containing more task relevant information than the KL penalty
derived from 𝜋0 and the risk of reward hacking.
23 / 41

RL4LMs
24 / 41

RL4LMs
NLPO Details
NLPO learns to mask irrelevant language by maintaining a masking
policy 𝜋𝜓 : the masking policy is a copy of the current policy (𝜋𝜃),
but is updated only every 𝜇 steps.
Given Z (𝜋𝜃) =
Í
a∈V 𝜋𝜃0 (a | s) the normalization value of the sum
of probabilities of all action a ∈ A given a particular State s ∈ S, let
the parameterized top- p vocabulary V
p
𝜋𝜃
⊂ V be the subset of the
vocab, consisting of the top- p highest probability vocabulary tokens
with respect to 𝜋𝜃.
25 / 41

RL4LMs
NLPO Details
Formally, let Zp be the normalization value for the parameterized top-
p vocabulary, can be defined as the subset of tokens that maximizes
Zp (𝜋𝜃) =
Í
a∈Vk
𝜋𝜃
𝜋𝜃 (a | s).
Then optimizing a policy according to the parameterized top- p
vocabulary can be defined as:
𝜋𝜓 (· | s, 𝜋𝜃 ) =
(
𝜋𝜃 (· | s)/Zp (𝜋𝜃) if a ∈ V
p
𝜋𝜃
and Z (𝜋𝜃)
0 otherwise
26 / 41

RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Table of contents
1 Abstract
2 Introduction
6 Conclusion
27 / 41

RL4LMs
GRUE is a collection of 7 generative NLP tasks. The metrics span
two categories.
Task preference metrics capture how well the models produce
generations that satisfy the desiderata of the specific generation task.
Naturalness metrics capture fluency, readability, etc. And provide
perspective on factors beyond semantics.
28 / 41

RL4LMs
29 / 41

RL4LMs
1 Results on GRUE: Which algorithm should be used to learn
preferences ?
2 Preference reward learning, selection, and hacking
3 Data budget: Improve reward or gather more demonstration ?
4 Practical considerations: Which implementation details matter
most ?
30 / 41

RL4LMs
31 / 41

RL4LMs
Using RL+Supervised learning in conjunction works best;
NLPO+supervised and PPO+supervised usually always outperforms
NLPO/PPO (or supervised in isolation) across both task metrics and
naturalness metrics.
32 / 41

RL4LMs
Preference reward learning, selection, and hacking
KL constraints can prevent models reward hack, but when the
initial policy has low performance on the task, the KL penalty
pushes the policy towards nonsense.
NLPO improved performance and stability is because the
masking policy provides an additional constraint for the current
policy. And p is an important parameter should be consider
during RL training.
33 / 41

RL4LMs
Human Preference Reward Learning
To this point, their experiments have largely focused on optimizing
evaluation metrics that correlate with human judgments, e.g.,
METEOR.
34 / 41

RL4LMs
Human Preference Reward Learning
Comparing the METEOR-only to the preference model, the
generations produced by the human feedback model are preferred in
682 cases, compared to the METEOR-only model which is preferred
in 587 cases.
This implies that this pipeline of collecting preferences, training a
reward, and further tuning the policy improves alignment to
human preferences.
35 / 41

RL4LMs
Improve reward or gather more demonstration ?
In the IMDB task, a model is given a partial movie review as a
prompt, and is asked to continue it as positively as possible.
The trade-off is between gathering more:
1 Sentiment labels (improving the reward)
2 Positive sentiment reviews (improving supervised training)
They find that a learned reward function enables greater
performance when used as a signal for an RL method than a
supervised method trained with 5 times more data.
36 / 41

RL4LMs
Which implementation details matter most ?
In recent works, they found that environment can be simulated within
their RL formulation by setting the discount factor 𝛾 = 1, this causes
instability in training with respect to naturalness in both PPO and
NLPO for IMDB.
Such that indicating that discounting rewards with 𝛾 1 via a
token-level MDP formulation is at least sometimes more effective for
language generation.
And with dropout and sampling, they will be critical issues for
stability of RL training.
37 / 41

RL4LMs
Conclusion
Table of contents
1 Abstract
2 Introduction
6 Conclusion
38 / 41

RL4LMs
Conclusion
Conclusion
This paper provide GRUE benchmark and the RL4LMs library, which
can push progress in aligning language models to human preferences
via RL methods by providing the community with a standard means of
comparing methods.
39 / 41

RL4LMs
Conclusion
PPO Details
Given discussion and equations in Section 3.3, they further note that
they follow (Ziegler et al., 2019) and dynamically adapt the KL
coefficient 𝛽 during training where,
et = clip

KL (𝜋 (at | st) ∥𝜋0 (at | st)) − KLtarget
KLtarget
, −0.2, 0.2

𝛽t+1 = 𝛽t 1 + K𝛽et

where KLtarget is user-specified KL divergence between initial model
h and current policy 𝜋 and K𝛽 is rate of update which they generally
set to 0.2 in their experiments.
40 / 41

RL4LMs
Conclusion
PPO Details
To increase stability during training, they further use Generalized
Advantage Estimation (GAE) and define the advantage estimator
Â (sn, an) based on the Temporal Difference residual as:
𝛿t = r (st, at) + V𝜙 (st+1) − V𝜙 (st) .
Â (sn, an) =
∞
∑︁
t=0
𝜆t
𝛿n+t,
where 𝜆 provides the trade-off between bias and variance.
41 / 41

Is Reinforcement Learning (Not) for Natural Language Processing.pdf

More Related Content

Similar to Is Reinforcement Learning (Not) for Natural Language Processing.pdf

More from Po-Chuan Chen

Recently uploaded

Is Reinforcement Learning (Not) for Natural Language Processing.pdf