InstructGPT
Training language models to follow instructions
with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
January 10, 2023
1 / 50
InstructGPT
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
2 / 50
InstructGPT
Abstract
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
3 / 50
InstructGPT
Abstract
Abstract
This paper show a method to align language models with user intent
on a wide range of tasks by fine-tuning with human feedback.
With using labeler-written prompts and prompts submitted through
the OpenAI API and a dataset of labeler demonstrations of the
desired model behavior. Such that they can fine-tune GPT-3 using
supervised learning.
And then collect a dataset of rankings of model outputs, which they
use to further fine-tune this supervised model using reinforcement
learning from human feedback, the model called InstructGPT.
4 / 50
InstructGPT
Abstract
Contribution
InstructGPT shows that fine-tuning with human feedback is a
promising direction for aligning language models with human intent.
The model shows improvements in truthfulness and reductions in
toxic output generation while having minimal performance
regressions on public NLP datasets.
Also, the output model has 1.3B parameter, which have less parameter
than the source model 175B GPT-3.
5 / 50
InstructGPT
Introduction
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
6 / 50
InstructGPT
Introduction
Introduction
InstructGPT prompted large language models (LMs) to perform a
range of natural language processing (NLP) tasks, given some
examples of the task as input.
And making progress on aligning language models by training them
to act in accordance with the user’s intention. But evaluating the
model to be helpful, honest, and harmless.
Then, fine-tuning approaches to aligning language models to follow a
broad class of written instructions with reinforcement learning from
human feedback (RLHF).
7 / 50
InstructGPT
Introduction
Framework about GPT-3
8 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
9 / 50
InstructGPT
Introduction
Training supervised fine-tuning model
Firstly, hiring a team of 40 contractors to label data, based on their
performance on a screening test.
Then collecting a dataset of human-written demonstrations of the
desired output behavior on (mostly English) prompts submitted to
the OpenAI API and some labeler-written prompts, and use this to
train their supervised learning baselines.
10 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
11 / 50
InstructGPT
Introduction
Training reward model
Secondly, collecting a dataset of human-labeled comparisons
between outputs from OpenAI’s models on a larger set of API
prompts.
And then train a reward model (RM) on this dataset to predict which
model output their labelers would prefer.
12 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
13 / 50
InstructGPT
Introduction
Optimizing a policy against the reward model with RL
Finally, they use this RM as a reward function and fine-tune
supervised learning baseline to maximize this reward using the PPO
algorithm.
They mainly evaluate their models by having their labelers rate the
quality of model outputs on test set, consisting of prompts from
held-out customers (who are not represented in the training data).
Also conducting automatic evaluations on a range of public NLP
datasets.
14 / 50
InstructGPT
Introduction
Human evaluations on OpenAI API prompt distribution
15 / 50
InstructGPT
Introduction
Main findings
Minimizing performance regressions on public NLP datasets
Mixing PPO updates with updates that increase the log
likelihood of the pretraining distribution (PPO-ptx)
Generalizing to the preferences of “held-out” labelers
InstructGPT models show promising generalization to
instructions outside of the RLHF fine-tuning distribution
16 / 50
InstructGPT
Methods and experimental details
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
17 / 50
InstructGPT
Methods and experimental details
Dataset
Dataset
To train the very first InstructGPT models, labelers need to write
prompts themselves, since it needed an initial source of
instruction-like prompts to bootstrap the process, which regular
GPT-3 models don’t have.
Plain: labelers to come up with an arbitrary task,
Few-shot: come up with an instruction, and multiple
query/response pairs for that instruction
User-based: come up with prompts corresponding to a number
of use-cases to the OpenAI API.
18 / 50
InstructGPT
Methods and experimental details
Dataset
Use-case categories
19 / 50
InstructGPT
Methods and experimental details
Tasks and Human data collection
Tasks and Human data collection
The aim was to select a group of labelers who were sensitive to the
preferences of different demographic groups, and who were good at
identifying outputs that were potentially harmful by undergoing a
screening test.
During training and evaluation, their alignment criteria may come into
conflict:
Training: prioritize helpfulness to the user
Evaluating: prioritize truthfulness and harmlessness
20 / 50
InstructGPT
Methods and experimental details
Tasks and Human data collection
As an initial study to see how well the model generalizes to the
preferences of other labelers, they hire a separate set of labelers who
do not produce any of the training data.
Despite the complexity of the task, they find that inter-annotator
agreement rates are quite high:
Training labelers: agree with each-other 72.6 ± 1.5%
Held-out labelers: the number is 77.3 ± 1.3%
21 / 50
InstructGPT
Methods and experimental details
Models
Models
These models are trained on a broad distribution of Internet data and
are adaptable to a wide range of downstream tasks, but have poorly
characterized behavior.
1 Supervised fine-tuning (SFT)
2 Reward modeling (RM)
3 Reinforcement learning (RL)
22 / 50
InstructGPT
Methods and experimental details
Models
Supervised fine-tuning (SFT)
They fine-tune GPT-3 on their labeler demonstrations using
supervised learning.
And then they select final SFT model based on the RM score on the
validation set. Despite finding that their SFT models overfit on
validation loss after 1 epoch, it still helps both the RM score and
human preference ratings.
23 / 50
InstructGPT
Methods and experimental details
Models
Reward modeling (RM)
Starting from the SFT model with the final unembedding layer
removed, they trained a model to take in a prompt and response, and
output a scalar reward.
They only use 6B RMs, as this saves a lot of compute, and larger
model could be unstable, since it would less suitable to be used as the
value function during RL.
24 / 50
InstructGPT
Methods and experimental details
Models
Loss function for the reward model
In order to speed up comparison collection, they present labelers with
anywhere between K = 4 and K = 9 responses to rank.
loss(𝜃) = −
1

K
2
 E(x,yw,yl)∼D [log (𝜎 (r𝜃 (x, yw) − r𝜃 (x, yl)))]
where r𝜃 (x, y) is the scalar output of the reward model for prompt x
and completion y with parameters 𝜃, yw is the preferred completion out
of the pair of yw and yl, and D is the dataset of human comparisons.
25 / 50
InstructGPT
Methods and experimental details
Models
Reinforcement learning (RL) with PPO
They fine-tuned the SFT model on their environment using PPO, and
the environment is a bandit environment which presents a random
customer prompt and expects a response to the prompt.
In addition, they add a per-token KL penalty from the SFT model at
each token to mitigate over optimization of the reward model, and the
value function is initialized from the RM.
objective(𝜙) = E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
26 / 50
InstructGPT
Methods and experimental details
Models
PPO-ptx models
In order to fix the performance regressions on public NLP datasets,
they mixing the pretraining gradients into the PPO gradients.
objective (𝜙) =E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
+
𝛾Ex∼Dpretrain
h
log

𝜋RL
𝜙 (x)
i
where 𝜋RL
𝜙 is the learned RL policy, 𝜋SFT is the supervised trained
model, and Dpretrain is the pretraining distribution, for each coefficient,
𝛽 is for KL reward, and 𝛾 is for the pretraining loss.
27 / 50
InstructGPT
Methods and experimental details
Evaluation
Evaluation
API distribution
Main metric is human preference ratings on a held out set of
prompts from the same source as their training distribution.
Public NLP datasets
They evaluate on two types of public datasets, which are FLAN
and T0, both consist of a variety of NLP tasks. Also, conduct
human evaluations of toxicity on the RealToxicityPrompts dataset
28 / 50
InstructGPT
Methods and experimental details
Evaluation
Labeler-collected metadata on the API distribution
29 / 50
InstructGPT
Results
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
30 / 50
InstructGPT
Results
Results
API distribution
Labelers significantly prefer InstructGPT outputs over outputs
from GPT-3
Generalizing to the preferences of ”held-out” labelers
Public NLP datasets are not reflective of how their language
models are used
Public NLP datasets
Showing improvements in truthfulness over GPT-3
Showing small improvements in toxicity over GPT-3, but not bias
Minimizing performance regressions on public NLP datasets by
modifying their RLHF fine-tuning procedure
31 / 50
InstructGPT
Results
Preference results
32 / 50
InstructGPT
Results
Metadata results on the API distribution
33 / 50
InstructGPT
Results
Comparing with FLAN and T0 in terms of Likert scores
34 / 50
InstructGPT
Results
Results on the TruthfulQA dataset
35 / 50
InstructGPT
Results
36 / 50
InstructGPT
Results
Qualitative results
They notice that it often produces an output in English even when the
instruction is in another language.
In comparison, they find that GPT-3 can perform these tasks but
requires more careful prompting, and rarely follows instructions in
these domains.
37 / 50
InstructGPT
Results
Simple mistakes
38 / 50
Confused by instruction that assume false premises
Overly hedge, rather than directly answering simple questions
InstructGPT
Discussion
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
39 / 50
InstructGPT
Discussion
Discussion
Implications for alignment research
1 Increasing investments in alignment of existing language models
is more cost-effective than training larger models
2 More research is needed to study how well this generalization
scales with increased capabilities
The goal of this paper is to demonstrate that this alignment technique
can align to an specific human reference group for a specific
application.
40 / 50
InstructGPT
Discussion
Future work
They need to design an alignment process that is transparent
They need to prevent model to generate the contents that will
harm the real world
Filtering the pretraining mix data for toxic content
The model can’t represent full spectrum of people who will use
41 / 50
InstructGPT
Additional details
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
42 / 50
InstructGPT
Additional details
Additional details
1 Additional prompt data details
2 Additional human data collection details
3 Additional model details
4 Automatic evaluation details
5 Additional results
43 / 50
InstructGPT
Additional details
1.1 Dataset sizes
44 / 50
InstructGPT
Additional details
1.2 Data diversity
45 / 50
InstructGPT
Additional details
2.1 Web interface
46 / 50
InstructGPT
Additional details
2.1 Web interface
47 / 50
InstructGPT
Additional details
2.2 Labeler demographic data
48 / 50
InstructGPT
Additional details
5 Additional results (Zero-shot)
49 / 50
InstructGPT
Additional details
5 Additional results (Few-shot)
50 / 50

Training language models to follow instructions with human feedback.pdf

  • 1.
    InstructGPT Training language modelsto follow instructions with human feedback Long Ouyang, Jeff Wu, Xu Jiang et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan, Chen January 10, 2023 1 / 50
  • 2.
    InstructGPT Table of contents 1Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 2 / 50
  • 3.
    InstructGPT Abstract Table of contents 1Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 3 / 50
  • 4.
    InstructGPT Abstract Abstract This paper showa method to align language models with user intent on a wide range of tasks by fine-tuning with human feedback. With using labeler-written prompts and prompts submitted through the OpenAI API and a dataset of labeler demonstrations of the desired model behavior. Such that they can fine-tune GPT-3 using supervised learning. And then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using reinforcement learning from human feedback, the model called InstructGPT. 4 / 50
  • 5.
    InstructGPT Abstract Contribution InstructGPT shows thatfine-tuning with human feedback is a promising direction for aligning language models with human intent. The model shows improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Also, the output model has 1.3B parameter, which have less parameter than the source model 175B GPT-3. 5 / 50
  • 6.
    InstructGPT Introduction Table of contents 1Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 6 / 50
  • 7.
    InstructGPT Introduction Introduction InstructGPT prompted largelanguage models (LMs) to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. And making progress on aligning language models by training them to act in accordance with the user’s intention. But evaluating the model to be helpful, honest, and harmless. Then, fine-tuning approaches to aligning language models to follow a broad class of written instructions with reinforcement learning from human feedback (RLHF). 7 / 50
  • 8.
  • 9.
    InstructGPT Introduction Three steps forbuilding InstructGPT 9 / 50
  • 10.
    InstructGPT Introduction Training supervised fine-tuningmodel Firstly, hiring a team of 40 contractors to label data, based on their performance on a screening test. Then collecting a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API and some labeler-written prompts, and use this to train their supervised learning baselines. 10 / 50
  • 11.
    InstructGPT Introduction Three steps forbuilding InstructGPT 11 / 50
  • 12.
    InstructGPT Introduction Training reward model Secondly,collecting a dataset of human-labeled comparisons between outputs from OpenAI’s models on a larger set of API prompts. And then train a reward model (RM) on this dataset to predict which model output their labelers would prefer. 12 / 50
  • 13.
    InstructGPT Introduction Three steps forbuilding InstructGPT 13 / 50
  • 14.
    InstructGPT Introduction Optimizing a policyagainst the reward model with RL Finally, they use this RM as a reward function and fine-tune supervised learning baseline to maximize this reward using the PPO algorithm. They mainly evaluate their models by having their labelers rate the quality of model outputs on test set, consisting of prompts from held-out customers (who are not represented in the training data). Also conducting automatic evaluations on a range of public NLP datasets. 14 / 50
  • 15.
    InstructGPT Introduction Human evaluations onOpenAI API prompt distribution 15 / 50
  • 16.
    InstructGPT Introduction Main findings Minimizing performanceregressions on public NLP datasets Mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx) Generalizing to the preferences of “held-out” labelers InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution 16 / 50
  • 17.
    InstructGPT Methods and experimentaldetails Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 17 / 50
  • 18.
    InstructGPT Methods and experimentaldetails Dataset Dataset To train the very first InstructGPT models, labelers need to write prompts themselves, since it needed an initial source of instruction-like prompts to bootstrap the process, which regular GPT-3 models don’t have. Plain: labelers to come up with an arbitrary task, Few-shot: come up with an instruction, and multiple query/response pairs for that instruction User-based: come up with prompts corresponding to a number of use-cases to the OpenAI API. 18 / 50
  • 19.
    InstructGPT Methods and experimentaldetails Dataset Use-case categories 19 / 50
  • 20.
    InstructGPT Methods and experimentaldetails Tasks and Human data collection Tasks and Human data collection The aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful by undergoing a screening test. During training and evaluation, their alignment criteria may come into conflict: Training: prioritize helpfulness to the user Evaluating: prioritize truthfulness and harmlessness 20 / 50
  • 21.
    InstructGPT Methods and experimentaldetails Tasks and Human data collection As an initial study to see how well the model generalizes to the preferences of other labelers, they hire a separate set of labelers who do not produce any of the training data. Despite the complexity of the task, they find that inter-annotator agreement rates are quite high: Training labelers: agree with each-other 72.6 ± 1.5% Held-out labelers: the number is 77.3 ± 1.3% 21 / 50
  • 22.
    InstructGPT Methods and experimentaldetails Models Models These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. 1 Supervised fine-tuning (SFT) 2 Reward modeling (RM) 3 Reinforcement learning (RL) 22 / 50
  • 23.
    InstructGPT Methods and experimentaldetails Models Supervised fine-tuning (SFT) They fine-tune GPT-3 on their labeler demonstrations using supervised learning. And then they select final SFT model based on the RM score on the validation set. Despite finding that their SFT models overfit on validation loss after 1 epoch, it still helps both the RM score and human preference ratings. 23 / 50
  • 24.
    InstructGPT Methods and experimentaldetails Models Reward modeling (RM) Starting from the SFT model with the final unembedding layer removed, they trained a model to take in a prompt and response, and output a scalar reward. They only use 6B RMs, as this saves a lot of compute, and larger model could be unstable, since it would less suitable to be used as the value function during RL. 24 / 50
  • 25.
    InstructGPT Methods and experimentaldetails Models Loss function for the reward model In order to speed up comparison collection, they present labelers with anywhere between K = 4 and K = 9 responses to rank. loss(𝜃) = − 1 K 2 E(x,yw,yl)∼D [log (𝜎 (r𝜃 (x, yw) − r𝜃 (x, yl)))] where r𝜃 (x, y) is the scalar output of the reward model for prompt x and completion y with parameters 𝜃, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons. 25 / 50
  • 26.
    InstructGPT Methods and experimentaldetails Models Reinforcement learning (RL) with PPO They fine-tuned the SFT model on their environment using PPO, and the environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. In addition, they add a per-token KL penalty from the SFT model at each token to mitigate over optimization of the reward model, and the value function is initialized from the RM. objective(𝜙) = E(x,y)∼D𝜋RL 𝜙 h r𝜃 (x, y) − 𝛽 log 𝜋RL 𝜙 (y | x)/𝜋SFT (y | x) i 26 / 50
  • 27.
    InstructGPT Methods and experimentaldetails Models PPO-ptx models In order to fix the performance regressions on public NLP datasets, they mixing the pretraining gradients into the PPO gradients. objective (𝜙) =E(x,y)∼D𝜋RL 𝜙 h r𝜃 (x, y) − 𝛽 log 𝜋RL 𝜙 (y | x)/𝜋SFT (y | x) i + 𝛾Ex∼Dpretrain h log 𝜋RL 𝜙 (x) i where 𝜋RL 𝜙 is the learned RL policy, 𝜋SFT is the supervised trained model, and Dpretrain is the pretraining distribution, for each coefficient, 𝛽 is for KL reward, and 𝛾 is for the pretraining loss. 27 / 50
  • 28.
    InstructGPT Methods and experimentaldetails Evaluation Evaluation API distribution Main metric is human preference ratings on a held out set of prompts from the same source as their training distribution. Public NLP datasets They evaluate on two types of public datasets, which are FLAN and T0, both consist of a variety of NLP tasks. Also, conduct human evaluations of toxicity on the RealToxicityPrompts dataset 28 / 50
  • 29.
    InstructGPT Methods and experimentaldetails Evaluation Labeler-collected metadata on the API distribution 29 / 50
  • 30.
    InstructGPT Results Table of contents 1Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 30 / 50
  • 31.
    InstructGPT Results Results API distribution Labelers significantlyprefer InstructGPT outputs over outputs from GPT-3 Generalizing to the preferences of ”held-out” labelers Public NLP datasets are not reflective of how their language models are used Public NLP datasets Showing improvements in truthfulness over GPT-3 Showing small improvements in toxicity over GPT-3, but not bias Minimizing performance regressions on public NLP datasets by modifying their RLHF fine-tuning procedure 31 / 50
  • 32.
  • 33.
    InstructGPT Results Metadata results onthe API distribution 33 / 50
  • 34.
    InstructGPT Results Comparing with FLANand T0 in terms of Likert scores 34 / 50
  • 35.
    InstructGPT Results Results on theTruthfulQA dataset 35 / 50
  • 36.
  • 37.
    InstructGPT Results Qualitative results They noticethat it often produces an output in English even when the instruction is in another language. In comparison, they find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains. 37 / 50
  • 38.
    InstructGPT Results Simple mistakes 38 /50 Confused by instruction that assume false premises Overly hedge, rather than directly answering simple questions
  • 39.
    InstructGPT Discussion Table of contents 1Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 39 / 50
  • 40.
    InstructGPT Discussion Discussion Implications for alignmentresearch 1 Increasing investments in alignment of existing language models is more cost-effective than training larger models 2 More research is needed to study how well this generalization scales with increased capabilities The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. 40 / 50
  • 41.
    InstructGPT Discussion Future work They needto design an alignment process that is transparent They need to prevent model to generate the contents that will harm the real world Filtering the pretraining mix data for toxic content The model can’t represent full spectrum of people who will use 41 / 50
  • 42.
    InstructGPT Additional details Table ofcontents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 42 / 50
  • 43.
    InstructGPT Additional details Additional details 1Additional prompt data details 2 Additional human data collection details 3 Additional model details 4 Automatic evaluation details 5 Additional results 43 / 50
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.