Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
4. Model Owner (Team of authors)
Long Ouyang*
Jeffrey Wu*
Research Scientist at
OpenAI
Research engineer on
OpenAI's safety team
Open AI team*
AI research and deployment
company dedicated to
ensuring that general-
purpose artificial
intelligence benefits all
of humanity.
* All the Pictures and information about the authors and the company from LinkedIn
5. ● Large language models (LMs) can perform a range of
natural language processing (NLP) tasks when prompted
with examples.
● However, these models often exhibit unintended
behaviors, such as generating biased or toxic text,
making up facts, or not following user instructions.
● They aim to align LMs by training them to act in
accordance with the user's intention, including
explicit and implicit intentions. But evaluating the
model to be helpful, honest, and harmless.
● Fine-tuning approaches to align language models to
follow a broad class of written instructions with
reinforcement learning from human feedback (RLHF).
6. Pre-training task
Since they focus on fine-tuning the pre-trained
GPT-3 language model with human feedback. GPT-3
is pre-trained on a large corpus of text data
using a language modeling objective, which
involves predicting the next word in a sequence
of text.
8. ● As a first step, they hired a team of 40 contractors
based on their screening test results.
● Then, they trained their supervised learning baselines
based on human-written demonstrations of the desired
output behavior using (mostly English) prompts
submitted to the OpenAI API and some labeler-written
prompts. (Source of training data)
Supervisedfine-tuning model
10. ● Secondly, gathering a dataset of human-labeled
comparisons between outputs from OpenAI's models on a
larger set of API prompts.
● Then, train a reward model (RM) on this dataset to
predict which model output their labelers would
prefer.
Rewardmodel (RM) training
12. ● Finally, they use this RM as a reward function and
fine-tune the supervised learning baseline to maximize
this reward using the PPO algorithm.
Optimizing a policy against the reward
model using Reinforcement learning (RL)
13. ● They primarily evaluate their models by having their
labelers rate the quality of model outputs on the test
set, consisting of prompts from held-out customers who
were not included in the training data.
● Additionally, they perform automatic evaluations using
various public NLP datasets.
Evaluation
14. ●SFT: Stands for a supervised fine-tuning model.
●PPO: Stands for proximal policy optimization, which is a reinforcement
learning algorithm used in this research paper to fine-tune the language
model with human feedback.
●PPO-ptx:Is a variant of the PPO algorithm used to fine-tune the
InstructGPT models.
●GPT: Stands for Generative Pre-trained Transformer. This model is
used for natural language processing (NLP) tasks.
●GPT(prompted): Fine-tuning the GPT model with human feedback.
●(GPT, GPT prompted): GPT-3 baselines.
They found that outputs from the 1.3B parameter
InstructGPT model are preferred to outputs from the 175B
GPT-3 model, despite having 100x fewer parameters.
Human evaluations on OpenAI API
prompt distribution
15. They experiment with different sizes
of the GPT-3 language models (1.3B,
6B, and 175B parameters). They also
compare the performance of their
InstructGPT models, which have 1.3B
parameters, to the original GPT-3
models as we showed in the previous
slide.
Variants of the model
in terms of size
17. Comparing the performance of GPT-3 and InstructGPTmodels
on various tasks
InstructGPT
● InstructGPT models outperform in terms of
generating appropriate outputs, following
explicit constraints in the instructions, and
generating more truthful and informative
answers.
● InstructGPT models generate information not
present in the input about half as often as
GPT-3 models.
● InstructGPT models show small improvements in
toxicity.
● Minimizing performance regressions on public
NLP datasets
● Generalizing to the preferences of “held-out”
labelers
● InstructGPT models show promising
generalization to instructions outside of the
RLHF fine-tuning distribution
GPT-3
● GPT-3 models do not outperform in
terms of generating appropriate
outputs even when it is given a few-
shot prompt to make it better at
following instructions.
● GPT-3 models show small improvements
in bias.
● Maximizing performance regressions on
public NLP datasets
● GPT-3 models can perform many tasks
but require more careful prompting and
do not usually follow instructions.
Vs
19. Creating a language model that can follow a
broad class of written instructions
helpfully and safely, while avoiding
generating untruthful, toxic, or otherwise
harmful outputs called InstructGPT.
Objective 1
Objective 2 Using human feedback to fine-tune language models is
a promising approach for aligning language models
with human intent.
Showing that the output model has 1.3 billion
parameters, which is fewer than the 175
billion parameters in the source model.
Objective 3
22. ● Research on alignment and learning from human
feedback.
● Training language models to follow instructions.
● Evaluating the harms of language models.
● Modifying the behavior of language models to mitigate
harms.
Related work
23. High-level methodology
1. Collect demonstration data and train a supervised
policy
2. Collect comparison data and train a reward model
3. Optimize a policy against the reward model using PPO.
Methods and experimental details
24. Dataset
To train the first InstructGPT models, labelers needed to
write prompts themselves—since it required an initial
source of instruction-like prompts to bootstrap the
process.
Three kinds of prompts are used
Plain - arbitrary task
Few-shot - multiple query/ response pairs per instruction
User-based - waitlist use cases for open Al API
Methods and experimental details
25. Data cleaning
They heuristically de-duplicate prompts by checking for
prompts that share a long common prefix, and they limit
the number of prompts to 200 per user ID.
They also create their train, validation, and test splits
based on user ID, so that the validation and test sets
contain no data from users whose data is in the training
set.
Methods and experimental details
26. From the prompts used, they
produced three different
datasets used in their fine-
tuning procedure:
Methods and experimental details
27. Tasks
Their training tasks are from two sources:
1. Dataset of prompts written by their labelers
2. Dataset of prompts submitted to early InstructGPT
models on their API
These prompts are very diverse and include generation,
question answering, dialog, summarization, extractions,
and other natural language tasks
Methods and experimental details
28. Human data collection
The aim was to select a group of labelers who were
sensitive to the preferences of different demographic
groups and able to identify potentially harmful outputs
through screening tests.
During training and evaluation, their alignment criteria
may come into conflict:
● Training: prioritize helpfulness to the user
● Evaluating: prioritize truthfulness and harmlessness
Methods and experimental details
29. Human data collection
To test whether the model generalizes to other labelers, a
separate set of labelers are hired who do not produce any
training data.(Held-out labelers)
Despite the complexity of the task, they find that inter-
annotator agreement rates are quite high:
● Training labelers: agree with each other 72.6 ± 1.5%
● Held-out labelers: the number is 77.3 ± 1.3%
Methods and experimental details
30. Reinforcement learning (RL)
Supervised fine-tuning (SFT)
• Fine-tune GPT-3 on labeler
demonstrations using
supervised learning
• Train for 16 epochs using
cosine learning rate decay
and residual dropout of
0.2
• SFT models overfit on
validation loss after 1
epoch
• Starting with the SFT model
with the final unembedding
layer removed, train a
model to take in a prompt
and response and output a
scalar reward
• Present labelers with K= 4
to K=9 responses to rank
• Train on all 𝐾
2
comparisons
from each prompt as a
single batch element
• Fine-tuned the SFT model with PPO
• Given prompt and response, it
produces a reward determined by the
reward model and ends the episode
• Add a per-token KL penalty from the
SFT model at each token to mitigate
over-optimization of the reward
model
Reward modeling (RM)
Models
Methods and experimental details
31. Loss function for the reward model
where r𝜃 (x, y) is the scalar output of the reward model
for prompt x and completion y with parameters 𝜃, yw is the
preferred completion out of the pair of yw and yl , and D
is the dataset of human comparisons.
Methods and experimental details
32. Evaluations on API
distribution
• The main metric is human
preference ratings on a held-
out set of prompts from the
same source as training
distribution
• Choose prompts equally between
GPT and InstructGPT in order to
not bias between the models
• Datasets with language model
safety, truthfulness, toxicity, and
bias
• They evaluate on two types of
public datasets, which are FLAN and
T0, both consist of a variety of
NLP tasks. Also, conduct human
evaluations of toxicity on the
RealToxicityPrompts dataset
Evaluations on public
NLP datasets
Evaluation
Methods and experimental details
33. API distribution
● Labelers significantly prefer InstructGPT outputs over outputs from
GPT-3
● The InstructGPT generalized to the preferences of ”held-out” labelers
Public NLP datasets
● Showing improvements in truthfulness over GPT-3
● Showing small improvements in toxicity over GPT-3, but not bias
● Minimizing performance regressions on public NLP datasets by modifying
their RLHF fine-tuning procedure
Results
34. Qualitative results
● They notice that it often produces an output in English
even when the instruction is in another language.
● In comparison, they find that GPT-3 can perform these tasks
but requires more careful prompting, and rarely follows
instructions in these domains.
Results
35. Preference results
Results
Left: results on prompts submitted to
GPT models on the API.
Right: results on prompts submitted to
InstructGPT models on the API.
Top: results from held-out labelers.
Bottom: results from training labelers.
36. Metadata results on the API
distribution
Results
Compared to GPT-3, the PPO
models are more appropriate
in the context of a customer
assistant, are better at
following explicit
constraints in the
instruction and attempting
the correct instruction, and
are less likely to make up
information on closed domain
tasks.
37. Comparing with FLAN and T0
in terms of Likert scores
Results
On a 1-7 scale, on the
InstructGPT prompt distribution.
FLAN and T0 perform better than
default GPT-3, and comparably
with a few-shot GPT-3 model
placed into ‘instruction-
following’ mode.
38. Results on the TruthfulQA
dataset
Results
Gray bars indicate ratings of
truthfulness.
Colored bars indicate ratings of
truthfulness and informativeness.
39. Results comparisons on
mainstream NLP tasks
Results
They compare the performance of their
InstructGPT models to the original GPT-3 model
on several mainstream NLP tasks, including
sentiment analysis, question answering, text
classification, and other natural language tasks
(see Table 1).
They found that the InstructGPT models performed
similarly to or slightly worse than the original
GPT-3 model on these tasks, but with
improvements in truthfulness and reductions in
toxic output generation.
40. Implications for alignment research
1. The cost of increasing model alignment is modest relative
to pretraining
○ 175B model requires 60 petaflops/ day
○ 3600 petaflops per day for GPT-3
2. Evidence that InstructGPT generalizes “following
instructions” to settings without supervision
3. Mitigate most of the performance degradations introduced
by fine-tuning
4. Validated alignment techniques with real-world research
Discussion
41. 0
1
InstructGPT models' behavior
is influenced by human
feedback from contractors
0
2
Labeling tasks may be impacted by
contractors' beliefs, cultural
backgrounds, and personal history
0
3
Their team of contractors is not
representative of the full
spectrum of people who will use
our models
0
4
Labelers are primarily English-
speaking and their data
consists almost entirely of
English instructions
Limitations
Discussion
42. 0
5
Their models are not fully aligned
or fully safe, they can generate
toxic or biased outputs, make up
facts, and generate sexual and
violent content without explicit
prompting – Examples of these
mistakes come up next slide
0
6
Their models follow the user's
instruction, even if it could lead
to harm in the real world
Limitations
Discussion
43. Model mistakes Confused by
instruction
that assume
false premises
Overly hedge,
rather than
directly
answering simple
questions
44. OQ 01
Several methods could be tried
to further decrease the models’
propensity to generate toxic,
biased, or otherwise harmful
outputs.
Training models to be harmless despite
user instructions is important, but
difficult since whether an output is
harmful depends on the context in
which it’s deployed.
There is a vast space of options for
designing interfaces for labelers to
provide feedback to language models;
this is an interesting human-
computer interaction problem.
How to design an alignment
process that is transparent
OQ 03
OQ 02
OQ 04
Discussion Open Questions
45. Additional details
Additional prompt data details
Additional human data collection details
Additional model details
Automatic evaluation details
Additional results
Model samples
46. Data diversity
Additional prompt
data details
A subset of their labeled prompt metadata.
Note that their annotation fields changed
over the course of the project, so not
every prompt was annotated for every field.
47. Web interface
Additional human
data collection
details
For each output, labelers give a
Likert score for overall quality on a
1-7 scale and also provide various
metadata label.
48. Web interface
Additional human
data collection
details
Labelers rank all the outputs for a
given prompt. Ties are encouraged in
cases where two outputs seem to be of
similar quality.
53. • Developing better ways to detect and remove
biased or harmful content
• Incorporating ethical considerations into
the design of their models
• Implementing safeguards to prevent the
generation of harmful outputs
55. • The paper concludes that fine-tuning language models with
human feedback is a promising direction for aligning these
models with user intent.
• The authors demonstrate this approach using the GPT-3
language model and show that their method, called
InstructGPT, can improve the truthfulness and reduce the
toxicity of model outputs while maintaining performance on
public NLP datasets.
• They also found that the 1.3B parameter InstructGPT model
is preferred to the 175B GPT-3 model in human evaluations,
despite having fewer parameters.
• The authors suggest that their method could be applied to a
wide range of NLP tasks and could help address concerns
about the ethical implications of large language models.