Training language models to follow instructions with human feedback.pdf

InstructGPT
Training language models to follow instructions
with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
January 10, 2023
1 / 50

InstructGPT
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
2 / 50

InstructGPT
Abstract
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
3 / 50

InstructGPT
Abstract
Abstract
This paper show a method to align language models with user intent
on a wide range of tasks by fine-tuning with human feedback.
With using labeler-written prompts and prompts submitted through
the OpenAI API and a dataset of labeler demonstrations of the
desired model behavior. Such that they can fine-tune GPT-3 using
supervised learning.
And then collect a dataset of rankings of model outputs, which they
use to further fine-tune this supervised model using reinforcement
learning from human feedback, the model called InstructGPT.
4 / 50

InstructGPT
Abstract
Contribution
InstructGPT shows that fine-tuning with human feedback is a
promising direction for aligning language models with human intent.
The model shows improvements in truthfulness and reductions in
toxic output generation while having minimal performance
regressions on public NLP datasets.
Also, the output model has 1.3B parameter, which have less parameter
than the source model 175B GPT-3.
5 / 50

InstructGPT
Introduction
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
6 / 50

InstructGPT
Introduction
Introduction
InstructGPT prompted large language models (LMs) to perform a
range of natural language processing (NLP) tasks, given some
examples of the task as input.
And making progress on aligning language models by training them
to act in accordance with the user’s intention. But evaluating the
model to be helpful, honest, and harmless.
Then, fine-tuning approaches to aligning language models to follow a
broad class of written instructions with reinforcement learning from
human feedback (RLHF).
7 / 50

InstructGPT
Introduction
Framework about GPT-3
8 / 50

InstructGPT
Introduction
Three steps for building InstructGPT
9 / 50

InstructGPT
Introduction
Training supervised fine-tuning model
Firstly, hiring a team of 40 contractors to label data, based on their
performance on a screening test.
Then collecting a dataset of human-written demonstrations of the
desired output behavior on (mostly English) prompts submitted to
the OpenAI API and some labeler-written prompts, and use this to
train their supervised learning baselines.
10 / 50

InstructGPT
Introduction
11 / 50

InstructGPT
Introduction
Training reward model
Secondly, collecting a dataset of human-labeled comparisons
between outputs from OpenAI’s models on a larger set of API
prompts.
And then train a reward model (RM) on this dataset to predict which
model output their labelers would prefer.
12 / 50

InstructGPT
Introduction
13 / 50

InstructGPT
Introduction
Optimizing a policy against the reward model with RL
Finally, they use this RM as a reward function and fine-tune
supervised learning baseline to maximize this reward using the PPO
algorithm.
They mainly evaluate their models by having their labelers rate the
quality of model outputs on test set, consisting of prompts from
held-out customers (who are not represented in the training data).
Also conducting automatic evaluations on a range of public NLP
datasets.
14 / 50

InstructGPT
Introduction
Human evaluations on OpenAI API prompt distribution
15 / 50

InstructGPT
Introduction
Main findings
Minimizing performance regressions on public NLP datasets
Mixing PPO updates with updates that increase the log
likelihood of the pretraining distribution (PPO-ptx)
Generalizing to the preferences of “held-out” labelers
InstructGPT models show promising generalization to
instructions outside of the RLHF fine-tuning distribution
16 / 50

InstructGPT
Methods and experimental details
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
17 / 50

InstructGPT
Dataset
Dataset
To train the very first InstructGPT models, labelers need to write
prompts themselves, since it needed an initial source of
instruction-like prompts to bootstrap the process, which regular
GPT-3 models don’t have.
Plain: labelers to come up with an arbitrary task,
Few-shot: come up with an instruction, and multiple
query/response pairs for that instruction
User-based: come up with prompts corresponding to a number
of use-cases to the OpenAI API.
18 / 50

InstructGPT
Dataset
Use-case categories
19 / 50

InstructGPT
The aim was to select a group of labelers who were sensitive to the
preferences of different demographic groups, and who were good at
identifying outputs that were potentially harmful by undergoing a
screening test.
During training and evaluation, their alignment criteria may come into
conflict:
Training: prioritize helpfulness to the user
Evaluating: prioritize truthfulness and harmlessness
20 / 50

InstructGPT
As an initial study to see how well the model generalizes to the
preferences of other labelers, they hire a separate set of labelers who
do not produce any of the training data.
Despite the complexity of the task, they find that inter-annotator
agreement rates are quite high:
Training labelers: agree with each-other 72.6 ± 1.5%
Held-out labelers: the number is 77.3 ± 1.3%
21 / 50

InstructGPT
Models
Models
These models are trained on a broad distribution of Internet data and
are adaptable to a wide range of downstream tasks, but have poorly
characterized behavior.
1 Supervised fine-tuning (SFT)
2 Reward modeling (RM)
3 Reinforcement learning (RL)
22 / 50

InstructGPT
Models
Supervised fine-tuning (SFT)
They fine-tune GPT-3 on their labeler demonstrations using
supervised learning.
And then they select final SFT model based on the RM score on the
validation set. Despite finding that their SFT models overfit on
validation loss after 1 epoch, it still helps both the RM score and
human preference ratings.
23 / 50

InstructGPT
Models
Reward modeling (RM)
Starting from the SFT model with the final unembedding layer
removed, they trained a model to take in a prompt and response, and
output a scalar reward.
They only use 6B RMs, as this saves a lot of compute, and larger
model could be unstable, since it would less suitable to be used as the
value function during RL.
24 / 50

InstructGPT
Models
Loss function for the reward model
In order to speed up comparison collection, they present labelers with
anywhere between K = 4 and K = 9 responses to rank.
loss(𝜃) = −
1

K
2
E(x,yw,yl)∼D [log (𝜎 (r𝜃 (x, yw) − r𝜃 (x, yl)))]
where r𝜃 (x, y) is the scalar output of the reward model for prompt x
and completion y with parameters 𝜃, yw is the preferred completion out
of the pair of yw and yl, and D is the dataset of human comparisons.
25 / 50

InstructGPT
Models
Reinforcement learning (RL) with PPO
They fine-tuned the SFT model on their environment using PPO, and
the environment is a bandit environment which presents a random
customer prompt and expects a response to the prompt.
In addition, they add a per-token KL penalty from the SFT model at
each token to mitigate over optimization of the reward model, and the
value function is initialized from the RM.
objective(𝜙) = E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
26 / 50

InstructGPT
Models
PPO-ptx models
In order to fix the performance regressions on public NLP datasets,
they mixing the pretraining gradients into the PPO gradients.
objective (𝜙) =E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
+
𝛾Ex∼Dpretrain
h
log

𝜋RL
𝜙 (x)
i
where 𝜋RL
𝜙 is the learned RL policy, 𝜋SFT is the supervised trained
model, and Dpretrain is the pretraining distribution, for each coefficient,
𝛽 is for KL reward, and 𝛾 is for the pretraining loss.
27 / 50

InstructGPT
Evaluation
Evaluation
API distribution
Main metric is human preference ratings on a held out set of
prompts from the same source as their training distribution.
Public NLP datasets
They evaluate on two types of public datasets, which are FLAN
and T0, both consist of a variety of NLP tasks. Also, conduct
human evaluations of toxicity on the RealToxicityPrompts dataset
28 / 50

InstructGPT
Evaluation
Labeler-collected metadata on the API distribution
29 / 50

InstructGPT
Results
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
30 / 50

InstructGPT
Results
Results
API distribution
Labelers significantly prefer InstructGPT outputs over outputs
from GPT-3
Generalizing to the preferences of ”held-out” labelers
Public NLP datasets are not reflective of how their language
models are used
Public NLP datasets
Showing improvements in truthfulness over GPT-3
Showing small improvements in toxicity over GPT-3, but not bias
Minimizing performance regressions on public NLP datasets by
modifying their RLHF fine-tuning procedure
31 / 50

InstructGPT
Results
Preference results
32 / 50

InstructGPT
Results
Metadata results on the API distribution
33 / 50

InstructGPT
Results
Comparing with FLAN and T0 in terms of Likert scores
34 / 50

InstructGPT
Results
Results on the TruthfulQA dataset
35 / 50

InstructGPT
Results
Qualitative results
They notice that it often produces an output in English even when the
instruction is in another language.
In comparison, they find that GPT-3 can perform these tasks but
requires more careful prompting, and rarely follows instructions in
these domains.
37 / 50

InstructGPT
Results
Simple mistakes
38 / 50
Confused by instruction that assume false premises
Overly hedge, rather than directly answering simple questions

InstructGPT
Discussion
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
39 / 50

InstructGPT
Discussion
Discussion
Implications for alignment research
1 Increasing investments in alignment of existing language models
is more cost-effective than training larger models
2 More research is needed to study how well this generalization
scales with increased capabilities
The goal of this paper is to demonstrate that this alignment technique
can align to an specific human reference group for a specific
application.
40 / 50

InstructGPT
Discussion
Future work
They need to design an alignment process that is transparent
They need to prevent model to generate the contents that will
harm the real world
Filtering the pretraining mix data for toxic content
The model can’t represent full spectrum of people who will use
41 / 50

InstructGPT
Additional details
Table of contents
1 Abstract
2 Introduction
Dataset
Models
Evaluation
4 Results
5 Discussion
42 / 50

InstructGPT
Additional details
Additional details
1 Additional prompt data details
2 Additional human data collection details
3 Additional model details
4 Automatic evaluation details
5 Additional results
43 / 50

InstructGPT
Additional details
1.1 Dataset sizes
44 / 50

InstructGPT
Additional details
1.2 Data diversity
45 / 50

InstructGPT
Additional details
2.1 Web interface
46 / 50

InstructGPT
Additional details
2.1 Web interface
47 / 50

InstructGPT
Additional details
2.2 Labeler demographic data
48 / 50

InstructGPT
Additional details
5 Additional results (Zero-shot)
49 / 50

InstructGPT
Additional details
5 Additional results (Few-shot)
50 / 50

Training language models to follow instructions with human feedback.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Training language models to follow instructions with human feedback.pdf

Similar to Training language models to follow instructions with human feedback.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

Training language models to follow instructions with human feedback.pdf