RL4LMs
Is Reinforcement Learning (Not) for Natural
Language Processing
Rajkumar Ramamurthy, Prithviraj Ammanabrolu et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
April 18, 2023
1 / 41
RL4LMs
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
2 / 41
RL4LMs
Abstract
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
3 / 41
RL4LMs
Abstract
Abstract
In the task about text generation, they can view it as a sequential
decision-making problem, reinforcement learning (RL) appears to be
a natural conceptual framework.
But there has several challenges:
Training instability due to the combinatorial action space
Lack of open-source libraries and benchmarks customized for
LM alignment
4 / 41
RL4LMs
Abstract
Contribution
In this paper, they first introduce an open-source modular library,
RL4LMs, for optimizing language generators with RL.
Also, they present the GRUE (General Reinforced-language
Understanding Evaluation) benchmark, the benchmark uses reward
functions which capture automated measures of human preference.
After that, they introduce an easy-to-use, performant RL algorithm,
NLPO (Natural Language Policy Optimization) that learns to
effectively reduce the combinatorial action space in language
generation.
5 / 41
RL4LMs
Introduction
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
6 / 41
RL4LMs
Introduction
Introduction
The ultimate aim of language technology is to interact with humans.
One option to incorporate user feedback is via human-in-the-loop,
but this degree of dense supervision is often prohibitive and inefficient.
Automated metrics (BERTScore, BLEURT) offer a promising
compromise have significantly improved correlation with human
judgment compared to earlier metrics (BLEU, METEOR, etc.), and
are cheap to evaluate. But — these functions are usually not per-token
differentiable.
7 / 41
RL4LMs
Introduction
Introduction
Reinforcement Learning (RL) offers a natural path forward for
optimizing non-differentiable, scalar objectives for LM-based
generation when it is cast as a sequential decision-making problem.
But, it is easy to find nonsense samples that achieve high-quality
estimates.
Recent works have shown promising results in aligning LMs to human
preferences via RL by constraining preference-based rewards to
incorporate notions of fluency but progress in this line of work is
heavily hindered by a lack of open-source benchmarks and
algorithmic implementations.
8 / 41
RL4LMs
Introduction
Introduction
To facilitate research in building RL algorithms to better align LMs,
they introduce a library, a benchmark, and an algorithm.
RL4LMs library, which enables generative HuggingFace models
(e.g., GPT-2 or T5) to be trained using a variety of existing RL
methods like PPO/A2C/etc.
GRUE benchmark: GRUE is a collection of 7 contemporary NLP
tasks; in contrast to other benchmarks, instead of supervised training,
they pair each task with reward function(s).
A novel on-policy RL algorithm called NLPO, that dynamically
learns task-specific constraints over the distribution of language
at a token level.
9 / 41
RL4LMs
Introduction
10 / 41
RL4LMs
Introduction
Related Work
Imitation learning for NLP
RL for Large Action Spaces
RL for NLP
11 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
12 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
RL4LMS: A Library for Training LMs with RL
RL4LMs, an open-source library with building blocks for fine-tuning
and evaluating RL algorithms on LM-based generation.
The library is modular, which enables users to plug-in customized
environments, reward functions, metrics, and algorithms. In the initial
release, they provide support for 6 different NLP tasks, 16 evaluation
metrics and rewards, and 4 RL algorithms.
13 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
RL4LMS: A Library for Training LMs with RL
1 Environments: Generation as a token-level MDP
2 Reward functions and evaluation metrics
3 On-policy actor-critic algorithms
14 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
Environments: Generation as a token-level MDP
Each environment is an NLP task: they are given a supervised dataset
D = {(xi, yi)}N
i=1 of N examples, where x ∈ X is an language input
and y ∈ Y is the target string.
Generation can be viewed as a Markov Decision Process (MDP)
⟨S, A, R, P, 𝛾, T⟩ using a finite vocabulary V.
Each episode in the MDP begins by sampling a datapoint (x, y) from
their dataset and ends when the current time step t exceeds the horizon
T or an end of sentence (EOS) token is generated.
15 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
Environments: Generation as a token-level MDP
The input x = (x0, . . . , xm) is a task-specific prompt that is used as
their initial state x0 = (x0, . . . , xm), where s0 ∈ S.
For the state, action, and transition function
S: state space with xm ∈ V
An action at ∈ A consists of a token from their vocabulary V
Transition function P : S × A → Δ(S)
Each episode has a reward R : S × A × Y → R1 that depends on
the (sT, y) is emitted
16 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
Reward functions and evaluation metrics
They provide interfaces to
1 n-gram overlap metrics
2 model-based semantic metrics
3 task-specific metrics
4 diversity/fluency/naturalness metrics
5 task-specific, model-based human preference metrics
17 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
On-policy actor-critic algorithms
RL4LMs supports fine-tuning and training LMs from scratch via
on-policy actor-critic algorithms on language environments.
Their benchmark experiments focus on fine-tuning a pre-trained LM
denoted as 𝜋0 from which they initial their agent’s policy 𝜋𝜃 = 𝜋0.
The value network V𝜙 used to estimate the value function is also
initialized from 𝜋0 except for the final layer which is randomly
initialized to output a single scalar value.
18 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
On-policy actor-critic algorithms
Value function:
V 𝜋
t = Eat∼𝜋
" T
∑︁
𝜏=t
𝛾R (s𝜏, a𝜏, y)
#
Q-value function:
Q𝜋
t (st, at) = R (st, at, y) + 𝛾Est+1∼P

V 𝜋
t+1 (st+1)

Advantage function:
A𝜋
t (s, a) = Q𝜋
t (st, at) − V 𝜋
t
To increase training stability, advantage is appoximated using
Generalized Advantage Estimation.
19 / 41
RL4LMs
RL4LMs: A Library for Training LMs with RL
On-policy actor-critic algorithms
Given an input-output pair (x, y) and generation predictions from their
agent; because the environment rewards are sequence-level and sparse.
So, they regularize the reward function using a token-level KL penalty
for all on-policy algorithms, to prevent the model from deviating
too far from the initialized LM.
R̂ (st, at, y) = R (st, at, y) − 𝛽KL (𝜋𝜃 (at | st) ∥𝜋0 (at | st))
where KL(𝜋𝜃 (at | st)∥𝜋0(at | st)) = (log 𝜋0(at | st) − log 𝜋𝜃 (at | st)).
20 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
21 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
NLPO: Natural Language Policy Optimization
Language generation action spaces are orders of magnitude larger than
what most discrete action space RL algorithms are designed for.
They hypothesize that the size of the action space is a core cause of
instability when training LMs with existing RL methods.
NLPO, a parameterized-masked extension of PPO, learns to mask out
less relevant tokens in-context as it trains. It accomplishes this via
top-p sampling, which restricts tokens to the smallest possible set
whose cumulative probability is greater than p.
22 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
NLPO: Natural Language Policy Optimization
NLPO maintains a masking policy 𝜋𝜓: the masking policy is a copy
of the current policy (𝜋𝜃) updated every 𝜇 steps.
A parameterized-invalid-mask is created from 𝜋𝜓 by first selecting the
top-p tokens from the vocabulary, and then applying an invalid-mask
to the remaining tokens.
With an additional constraint that balances between the benefits of
containing more task relevant information than the KL penalty
derived from 𝜋0 and the risk of reward hacking.
23 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
24 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
NLPO Details
NLPO learns to mask irrelevant language by maintaining a masking
policy 𝜋𝜓 : the masking policy is a copy of the current policy (𝜋𝜃),
but is updated only every 𝜇 steps.
Given Z (𝜋𝜃) =
Í
a∈V 𝜋𝜃0 (a | s) the normalization value of the sum
of probabilities of all action a ∈ A given a particular State s ∈ S, let
the parameterized top- p vocabulary V
p
𝜋𝜃
⊂ V be the subset of the
vocab, consisting of the top- p highest probability vocabulary tokens
with respect to 𝜋𝜃.
25 / 41
RL4LMs
NLPO: Natural Language Policy Optimization
NLPO Details
Formally, let Zp be the normalization value for the parameterized top-
p vocabulary, can be defined as the subset of tokens that maximizes
Zp (𝜋𝜃) =
Í
a∈Vk
𝜋𝜃
𝜋𝜃 (a | s).
Then optimizing a policy according to the parameterized top- p
vocabulary can be defined as:
𝜋𝜓 (· | s, 𝜋𝜃 ) =
(
𝜋𝜃 (· | s)/Zp (𝜋𝜃) if a ∈ V
p
𝜋𝜃
and Z (𝜋𝜃)
0 otherwise
26 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
27 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
GRUE (General Reinforced-Language Understanding Eval)
GRUE is a collection of 7 generative NLP tasks. The metrics span
two categories.
Task preference metrics capture how well the models produce
generations that satisfy the desiderata of the specific generation task.
Naturalness metrics capture fluency, readability, etc. And provide
perspective on factors beyond semantics.
28 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
29 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
GRUE (General Reinforced-Language Understanding Eval)
1 Results on GRUE: Which algorithm should be used to learn
preferences ?
2 Preference reward learning, selection, and hacking
3 Data budget: Improve reward or gather more demonstration ?
4 Practical considerations: Which implementation details matter
most ?
30 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
31 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Using RL+Supervised learning in conjunction works best;
NLPO+supervised and PPO+supervised usually always outperforms
NLPO/PPO (or supervised in isolation) across both task metrics and
naturalness metrics.
32 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Preference reward learning, selection, and hacking
KL constraints can prevent models reward hack, but when the
initial policy has low performance on the task, the KL penalty
pushes the policy towards nonsense.
NLPO improved performance and stability is because the
masking policy provides an additional constraint for the current
policy. And p is an important parameter should be consider
during RL training.
33 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Human Preference Reward Learning
To this point, their experiments have largely focused on optimizing
evaluation metrics that correlate with human judgments, e.g.,
METEOR.
34 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Human Preference Reward Learning
Comparing the METEOR-only to the preference model, the
generations produced by the human feedback model are preferred in
682 cases, compared to the METEOR-only model which is preferred
in 587 cases.
This implies that this pipeline of collecting preferences, training a
reward, and further tuning the policy improves alignment to
human preferences.
35 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Improve reward or gather more demonstration ?
In the IMDB task, a model is given a partial movie review as a
prompt, and is asked to continue it as positively as possible.
The trade-off is between gathering more:
1 Sentiment labels (improving the reward)
2 Positive sentiment reviews (improving supervised training)
They find that a learned reward function enables greater
performance when used as a signal for an RL method than a
supervised method trained with 5 times more data.
36 / 41
RL4LMs
GRUE (General Reinforced-Language Understanding Eval)
Which implementation details matter most ?
In recent works, they found that environment can be simulated within
their RL formulation by setting the discount factor 𝛾 = 1, this causes
instability in training with respect to naturalness in both PPO and
NLPO for IMDB.
Such that indicating that discounting rewards with 𝛾  1 via a
token-level MDP formulation is at least sometimes more effective for
language generation.
And with dropout and sampling, they will be critical issues for
stability of RL training.
37 / 41
RL4LMs
Conclusion
Table of contents
1 Abstract
2 Introduction
3 RL4LMs: A Library for Training LMs with RL
4 NLPO: Natural Language Policy Optimization
5 GRUE (General Reinforced-Language Understanding Eval)
6 Conclusion
38 / 41
RL4LMs
Conclusion
Conclusion
This paper provide GRUE benchmark and the RL4LMs library, which
can push progress in aligning language models to human preferences
via RL methods by providing the community with a standard means of
comparing methods.
39 / 41
RL4LMs
Conclusion
PPO Details
Given discussion and equations in Section 3.3, they further note that
they follow (Ziegler et al., 2019) and dynamically adapt the KL
coefficient 𝛽 during training where,
et = clip

KL (𝜋 (at | st) ∥𝜋0 (at | st)) − KLtarget
KLtarget
, −0.2, 0.2

𝛽t+1 = 𝛽t 1 + K𝛽et

where KLtarget is user-specified KL divergence between initial model
h and current policy 𝜋 and K𝛽 is rate of update which they generally
set to 0.2 in their experiments.
40 / 41
RL4LMs
Conclusion
PPO Details
To increase stability during training, they further use Generalized
Advantage Estimation (GAE) and define the advantage estimator
 (sn, an) based on the Temporal Difference residual as:
𝛿t = r (st, at) + V𝜙 (st+1) − V𝜙 (st) .
 (sn, an) =
∞
∑︁
t=0
𝜆t
𝛿n+t,
where 𝜆 provides the trade-off between bias and variance.
41 / 41

Is Reinforcement Learning (Not) for Natural Language Processing.pdf

  • 1.
    RL4LMs Is Reinforcement Learning(Not) for Natural Language Processing Rajkumar Ramamurthy, Prithviraj Ammanabrolu et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan Chen April 18, 2023 1 / 41
  • 2.
    RL4LMs Table of contents 1Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 2 / 41
  • 3.
    RL4LMs Abstract Table of contents 1Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 3 / 41
  • 4.
    RL4LMs Abstract Abstract In the taskabout text generation, they can view it as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. But there has several challenges: Training instability due to the combinatorial action space Lack of open-source libraries and benchmarks customized for LM alignment 4 / 41
  • 5.
    RL4LMs Abstract Contribution In this paper,they first introduce an open-source modular library, RL4LMs, for optimizing language generators with RL. Also, they present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, the benchmark uses reward functions which capture automated measures of human preference. After that, they introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. 5 / 41
  • 6.
    RL4LMs Introduction Table of contents 1Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 6 / 41
  • 7.
    RL4LMs Introduction Introduction The ultimate aimof language technology is to interact with humans. One option to incorporate user feedback is via human-in-the-loop, but this degree of dense supervision is often prohibitive and inefficient. Automated metrics (BERTScore, BLEURT) offer a promising compromise have significantly improved correlation with human judgment compared to earlier metrics (BLEU, METEOR, etc.), and are cheap to evaluate. But — these functions are usually not per-token differentiable. 7 / 41
  • 8.
    RL4LMs Introduction Introduction Reinforcement Learning (RL)offers a natural path forward for optimizing non-differentiable, scalar objectives for LM-based generation when it is cast as a sequential decision-making problem. But, it is easy to find nonsense samples that achieve high-quality estimates. Recent works have shown promising results in aligning LMs to human preferences via RL by constraining preference-based rewards to incorporate notions of fluency but progress in this line of work is heavily hindered by a lack of open-source benchmarks and algorithmic implementations. 8 / 41
  • 9.
    RL4LMs Introduction Introduction To facilitate researchin building RL algorithms to better align LMs, they introduce a library, a benchmark, and an algorithm. RL4LMs library, which enables generative HuggingFace models (e.g., GPT-2 or T5) to be trained using a variety of existing RL methods like PPO/A2C/etc. GRUE benchmark: GRUE is a collection of 7 contemporary NLP tasks; in contrast to other benchmarks, instead of supervised training, they pair each task with reward function(s). A novel on-policy RL algorithm called NLPO, that dynamically learns task-specific constraints over the distribution of language at a token level. 9 / 41
  • 10.
  • 11.
    RL4LMs Introduction Related Work Imitation learningfor NLP RL for Large Action Spaces RL for NLP 11 / 41
  • 12.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL Table of contents 1 Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 12 / 41
  • 13.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL RL4LMS: A Library for Training LMs with RL RL4LMs, an open-source library with building blocks for fine-tuning and evaluating RL algorithms on LM-based generation. The library is modular, which enables users to plug-in customized environments, reward functions, metrics, and algorithms. In the initial release, they provide support for 6 different NLP tasks, 16 evaluation metrics and rewards, and 4 RL algorithms. 13 / 41
  • 14.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL RL4LMS: A Library for Training LMs with RL 1 Environments: Generation as a token-level MDP 2 Reward functions and evaluation metrics 3 On-policy actor-critic algorithms 14 / 41
  • 15.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL Environments: Generation as a token-level MDP Each environment is an NLP task: they are given a supervised dataset D = {(xi, yi)}N i=1 of N examples, where x ∈ X is an language input and y ∈ Y is the target string. Generation can be viewed as a Markov Decision Process (MDP) ⟨S, A, R, P, 𝛾, T⟩ using a finite vocabulary V. Each episode in the MDP begins by sampling a datapoint (x, y) from their dataset and ends when the current time step t exceeds the horizon T or an end of sentence (EOS) token is generated. 15 / 41
  • 16.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL Environments: Generation as a token-level MDP The input x = (x0, . . . , xm) is a task-specific prompt that is used as their initial state x0 = (x0, . . . , xm), where s0 ∈ S. For the state, action, and transition function S: state space with xm ∈ V An action at ∈ A consists of a token from their vocabulary V Transition function P : S × A → Δ(S) Each episode has a reward R : S × A × Y → R1 that depends on the (sT, y) is emitted 16 / 41
  • 17.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL Reward functions and evaluation metrics They provide interfaces to 1 n-gram overlap metrics 2 model-based semantic metrics 3 task-specific metrics 4 diversity/fluency/naturalness metrics 5 task-specific, model-based human preference metrics 17 / 41
  • 18.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL On-policy actor-critic algorithms RL4LMs supports fine-tuning and training LMs from scratch via on-policy actor-critic algorithms on language environments. Their benchmark experiments focus on fine-tuning a pre-trained LM denoted as 𝜋0 from which they initial their agent’s policy 𝜋𝜃 = 𝜋0. The value network V𝜙 used to estimate the value function is also initialized from 𝜋0 except for the final layer which is randomly initialized to output a single scalar value. 18 / 41
  • 19.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL On-policy actor-critic algorithms Value function: V 𝜋 t = Eat∼𝜋 " T ∑︁ 𝜏=t 𝛾R (s𝜏, a𝜏, y) # Q-value function: Q𝜋 t (st, at) = R (st, at, y) + 𝛾Est+1∼P V 𝜋 t+1 (st+1) Advantage function: A𝜋 t (s, a) = Q𝜋 t (st, at) − V 𝜋 t To increase training stability, advantage is appoximated using Generalized Advantage Estimation. 19 / 41
  • 20.
    RL4LMs RL4LMs: A Libraryfor Training LMs with RL On-policy actor-critic algorithms Given an input-output pair (x, y) and generation predictions from their agent; because the environment rewards are sequence-level and sparse. So, they regularize the reward function using a token-level KL penalty for all on-policy algorithms, to prevent the model from deviating too far from the initialized LM. R̂ (st, at, y) = R (st, at, y) − 𝛽KL (𝜋𝜃 (at | st) ∥𝜋0 (at | st)) where KL(𝜋𝜃 (at | st)∥𝜋0(at | st)) = (log 𝜋0(at | st) − log 𝜋𝜃 (at | st)). 20 / 41
  • 21.
    RL4LMs NLPO: Natural LanguagePolicy Optimization Table of contents 1 Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 21 / 41
  • 22.
    RL4LMs NLPO: Natural LanguagePolicy Optimization NLPO: Natural Language Policy Optimization Language generation action spaces are orders of magnitude larger than what most discrete action space RL algorithms are designed for. They hypothesize that the size of the action space is a core cause of instability when training LMs with existing RL methods. NLPO, a parameterized-masked extension of PPO, learns to mask out less relevant tokens in-context as it trains. It accomplishes this via top-p sampling, which restricts tokens to the smallest possible set whose cumulative probability is greater than p. 22 / 41
  • 23.
    RL4LMs NLPO: Natural LanguagePolicy Optimization NLPO: Natural Language Policy Optimization NLPO maintains a masking policy 𝜋𝜓: the masking policy is a copy of the current policy (𝜋𝜃) updated every 𝜇 steps. A parameterized-invalid-mask is created from 𝜋𝜓 by first selecting the top-p tokens from the vocabulary, and then applying an invalid-mask to the remaining tokens. With an additional constraint that balances between the benefits of containing more task relevant information than the KL penalty derived from 𝜋0 and the risk of reward hacking. 23 / 41
  • 24.
    RL4LMs NLPO: Natural LanguagePolicy Optimization 24 / 41
  • 25.
    RL4LMs NLPO: Natural LanguagePolicy Optimization NLPO Details NLPO learns to mask irrelevant language by maintaining a masking policy 𝜋𝜓 : the masking policy is a copy of the current policy (𝜋𝜃), but is updated only every 𝜇 steps. Given Z (𝜋𝜃) = Í a∈V 𝜋𝜃0 (a | s) the normalization value of the sum of probabilities of all action a ∈ A given a particular State s ∈ S, let the parameterized top- p vocabulary V p 𝜋𝜃 ⊂ V be the subset of the vocab, consisting of the top- p highest probability vocabulary tokens with respect to 𝜋𝜃. 25 / 41
  • 26.
    RL4LMs NLPO: Natural LanguagePolicy Optimization NLPO Details Formally, let Zp be the normalization value for the parameterized top- p vocabulary, can be defined as the subset of tokens that maximizes Zp (𝜋𝜃) = Í a∈Vk 𝜋𝜃 𝜋𝜃 (a | s). Then optimizing a policy according to the parameterized top- p vocabulary can be defined as: 𝜋𝜓 (· | s, 𝜋𝜃 ) = ( 𝜋𝜃 (· | s)/Zp (𝜋𝜃) if a ∈ V p 𝜋𝜃 and Z (𝜋𝜃) 0 otherwise 26 / 41
  • 27.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Table of contents 1 Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 27 / 41
  • 28.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) GRUE (General Reinforced-Language Understanding Eval) GRUE is a collection of 7 generative NLP tasks. The metrics span two categories. Task preference metrics capture how well the models produce generations that satisfy the desiderata of the specific generation task. Naturalness metrics capture fluency, readability, etc. And provide perspective on factors beyond semantics. 28 / 41
  • 29.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) 29 / 41
  • 30.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) GRUE (General Reinforced-Language Understanding Eval) 1 Results on GRUE: Which algorithm should be used to learn preferences ? 2 Preference reward learning, selection, and hacking 3 Data budget: Improve reward or gather more demonstration ? 4 Practical considerations: Which implementation details matter most ? 30 / 41
  • 31.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) 31 / 41
  • 32.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Using RL+Supervised learning in conjunction works best; NLPO+supervised and PPO+supervised usually always outperforms NLPO/PPO (or supervised in isolation) across both task metrics and naturalness metrics. 32 / 41
  • 33.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Preference reward learning, selection, and hacking KL constraints can prevent models reward hack, but when the initial policy has low performance on the task, the KL penalty pushes the policy towards nonsense. NLPO improved performance and stability is because the masking policy provides an additional constraint for the current policy. And p is an important parameter should be consider during RL training. 33 / 41
  • 34.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Human Preference Reward Learning To this point, their experiments have largely focused on optimizing evaluation metrics that correlate with human judgments, e.g., METEOR. 34 / 41
  • 35.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Human Preference Reward Learning Comparing the METEOR-only to the preference model, the generations produced by the human feedback model are preferred in 682 cases, compared to the METEOR-only model which is preferred in 587 cases. This implies that this pipeline of collecting preferences, training a reward, and further tuning the policy improves alignment to human preferences. 35 / 41
  • 36.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Improve reward or gather more demonstration ? In the IMDB task, a model is given a partial movie review as a prompt, and is asked to continue it as positively as possible. The trade-off is between gathering more: 1 Sentiment labels (improving the reward) 2 Positive sentiment reviews (improving supervised training) They find that a learned reward function enables greater performance when used as a signal for an RL method than a supervised method trained with 5 times more data. 36 / 41
  • 37.
    RL4LMs GRUE (General Reinforced-LanguageUnderstanding Eval) Which implementation details matter most ? In recent works, they found that environment can be simulated within their RL formulation by setting the discount factor 𝛾 = 1, this causes instability in training with respect to naturalness in both PPO and NLPO for IMDB. Such that indicating that discounting rewards with 𝛾 1 via a token-level MDP formulation is at least sometimes more effective for language generation. And with dropout and sampling, they will be critical issues for stability of RL training. 37 / 41
  • 38.
    RL4LMs Conclusion Table of contents 1Abstract 2 Introduction 3 RL4LMs: A Library for Training LMs with RL 4 NLPO: Natural Language Policy Optimization 5 GRUE (General Reinforced-Language Understanding Eval) 6 Conclusion 38 / 41
  • 39.
    RL4LMs Conclusion Conclusion This paper provideGRUE benchmark and the RL4LMs library, which can push progress in aligning language models to human preferences via RL methods by providing the community with a standard means of comparing methods. 39 / 41
  • 40.
    RL4LMs Conclusion PPO Details Given discussionand equations in Section 3.3, they further note that they follow (Ziegler et al., 2019) and dynamically adapt the KL coefficient 𝛽 during training where, et = clip KL (𝜋 (at | st) ∥𝜋0 (at | st)) − KLtarget KLtarget , −0.2, 0.2 𝛽t+1 = 𝛽t 1 + K𝛽et where KLtarget is user-specified KL divergence between initial model h and current policy 𝜋 and K𝛽 is rate of update which they generally set to 0.2 in their experiments. 40 / 41
  • 41.
    RL4LMs Conclusion PPO Details To increasestability during training, they further use Generalized Advantage Estimation (GAE) and define the advantage estimator  (sn, an) based on the Temporal Difference residual as: 𝛿t = r (st, at) + V𝜙 (st+1) − V𝜙 (st) .  (sn, an) = ∞ ∑︁ t=0 𝜆t 𝛿n+t, where 𝜆 provides the trade-off between bias and variance. 41 / 41