rlhf.pdf

Training Language Models to Follow
Instructions with Human Feedback
Austin Wang, Howard Chen
COS 597G
1
Nov 14 2022

The three Hʼs of Model Desiderata
Askell et al. 2021 A general language assistant as a laboratory for alignment 4

Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)

Helpful:
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer

Helpful:
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Harmless:
● The AI should not cause physical, psychological, or social
harm to people or the environment

The Misalignment of Models
Misalignment: When the training objective does not capture
the desiderata we want from models
8

The Misalignment of Models
Training: Predict the next token
Misalignment: When the training objective does not capture
the desiderata we want from models
9

Addressing Misalignment: Instruction Following
11
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions

Addressing Misalignment: Instruction Following
Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this)
12
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions

Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
13

Instruction Tuning: Fine-tune models to follow instructions
14

1. Aggregate Datasets (62): Collect wide
variety of public datasets
15

2. Instruction Templates: Manually write 10
templates / dataset that captures task
16

3. Fine-tune: Use the instruction templates
and datasets to ﬁne-tune model
17

3. Fine-tune: Use the instruction templates
and datasets to ﬁne-tune model
4. Evaluate on held-out task
18

Addressing Misalignment: T0 (Encoder-Decoder models)
Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization
Train: Span prediction -> Eval: Follow instructions (e.g. answer this question)
Basically the same idea as FLAN, except ﬁne-tune an encoder-decoder model (T5)
20

Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
21

Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
22

- 2.97B Documents
- 1.56T words total
23

- 2.97B Documents
- 1.56T words total
24

Learning from Human Feedback
25

Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Annotates train set
26

Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Diﬀerent annotators from Upwork/ScaleAI
- Not screened, to better mirror real-world
Annotates train set Annotates eval
27

Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
32

A large collections of prompts:
- From OpenAI GPT3 Playground
- Filtered to remove PII
- Filtered to remove duplicates
- Limited to 200 examples / user
- Annotators are also tasked with writing prompts
33

34
Number of Prompts

Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
36

Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
- In notation also refer to as:
37

Method
The outputs are sampled from the SFT model
39
Number of Prompts

Method
To increase data collection throughput, each user is given K = 4 to
9 outputs to rank for each prompt
40

Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
41

Method
rθ
x: the prompt yw
Reward on better
completion
Reward on worse
completion
42

Method
rθ
x: the prompt yw
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
43

Method
rθ
x: the prompt yw
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overﬁtting
44

Method
rθ
x: the prompt yw
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
45

Method
rθ
x: the prompt yw
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
- Corollary: computationally more eﬃcient, since this only
requires K forward passes through rθ
for each prompt
- This is why there is the -1/(K choose 2) normalization in loss
46

Method
Use RM to update the SFT model from step 1. Call model PPO
49

Method
50
Number of Prompts

Method
Two problems:
1. As RLHF is updated, its outputs become very diﬀerent from
what the RM was trained on -> worse reward estimates
51

Method
Two problems:
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
52

Method
Two problems:
2. Just using RL objective leads to performance degradation
on many NLP tasks
53

Method
Two problems:
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
54

Method
Two problems:
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
55

Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 ﬁne-tuned on human demonstrations of prompt completions
58

2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
59

2. RM: Reward Model
models
3. PPO
a. SFT model further ﬁne-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
60

2. RM: Reward Model
models
3. PPO
a. SFT model further ﬁne-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
4. PPO-ptx
a. Identical to PPO, except with an additional auxiliary LM objective on the
pretraining data
61

Pre-lecture Q1
62
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?

Pre-lecture Q1
63
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.

Pre-lecture Q1
64
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.

Pre-lecture Q1
65
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.

Pre-lecture Q1
66
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Note: None of these datasets are available publically :(

Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
67

1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
68

“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely oﬄine.
a. This means the RM may provide more “tailored” feedback to the model
69

“sandwiches”. The assigns similar rewards to sequences with similar quality.
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
70

“sandwiches”. The assigns similar rewards to sequences with similar quality.
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
4. The RM is more data eﬃcient
a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts.
b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any
output
71

Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
73

Original Goal: 3H
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
74

Original Goal: 3H
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
● Harmless:
○ RealToxicityPrompts (toxicity)
○ Winogender & CrowS-Pairs (social bias)
75

Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
76

○ Prompts submitted to the InstructGPT model
77

○ Prompts submitted to the InstructGPT model
● Public NLP tasks
○ SQuAD
○ DROP
○ HellaSwag
○ WMT 2015 French to English
78

Helpfulness: Preferences of the Labelers
79

Baseline: 50-50 win rate against SFT
80

● GPT vs. Instruct distribution
81

● GPT vs. Instruct distribution
● Labelers who provide training
data vs. new labelers
(preference overﬁtting)
82

● Researcher tries to ﬁnd prompts
that can successfully instruct a
vanilla GPT (they donʼt include
examples in the paper)
83

● PPO models win across the board
84

Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
85

86

87
● Models trained with feedback data are less likely to hallucinate
● Interesting that SFT has lower hallucinations

Breakdown across Model Sizes
88

89

Comparing w/ Fine-Tuned Models
● Public NLP dataset does not reﬂect how the API is used
○ Public dataset capture mostly things that are easy to automatically evaluate
○ API is more often used for open-ended generation
Instruct prompt distribution
90

Truthfulness
91
● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not
certain of the correct answer
● Models do not have to be speciﬁcally instructed to “tell the truth” to be more truthfulness

Truthfulness
Gray: truthfulness
Color:
truthfulness +
informativeness
92

Truthfulness
● PPO/PPO-ptx choose truthful + uninformative > conﬁdent falsehood
Gray: truthfulness
Color:
truthfulness +
informativeness
93

Toxicity: RealToxicityPrompts
95
● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3
● When instructed to be rude, InstructGPT ampliﬁes toxicity > GPT-3 (in paper)

Toxicity: RealToxicityPrompts
96
(Quark: Controllable Text Generation with Reinforced [Un]learning, Lu et al., 2022)
PPO-style training, not the exact InstructGPT model

Bias: Winogender & CrowS-Pairs
97
● Metric: entropy of the multi-choice completion as the measure of bias
● Higher entropy -> less biased
Winogender
CrowS-Pairs

Bias: Winogender & CrowS-Pairs
98

Pre-Lecture Q2
99
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?

Pre-Lecture Q2
100
Answer:
Toxicity: InstructGPT can reduce it.
Bias: The authors say in the paper that they donʼt ﬁnd clear patterns
● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback
● Social biases can be subtle and hard to detect
● Labelers are not very directly instructed to catch bias

Pre-Lecture Q2
101
Instruction to the labelers

Qualitative Examples
● Generalizing to distribution outside of the ﬁne-tuned data
Diﬀerent Language
102

Qualitative Examples
● Generalizing to distribution outside of the ﬁne-tuned data
Code
103

InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
104

2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
105

106
Too much unnecessary hedging

107
Too much unnecessary hedging

2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
3. Performance degrades when instructions contain multiple explicit constraints
(e.g. “list 10 movies made in the 1930ʼs set in France”)
108

Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
109

Implications
● How to do research when the model is constantly changing
110

Implications
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
111

Implications
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
○ How do we do model versioning when we start to iterate on the models
and train them with model-dependant data?
112

Summary
Performance
● Labelers preference: InstructGPT > GPT-3
● Truthfulness: InstructGPT > GPT-3
● Toxicity: InstructGPT > GPT-3, (but not bias)
Findings
● InstructGPT can generalize to “held-out” labelersʼ preferences
● Public NLP datasets do not reﬂect real-world LMs use
● InstructGPT can generalize: outside of the RLHF instruction distribution
● InstructGPT still makes simple mistakes
113

Pre-Lecture Q3
● Is preference ranking/comparison the only way to provide human feedback?
● What are other options and how to convert them into reward to train the models?
● What other types of human data do you think would be helpful?
114

Addressing Misalignment: GPT-3
Brown et al. 2022 Language Models are Zero shot Learners
Prompting: Make the eval text more similar to the training corpora using prompts
116

rlhf.pdf

Recommended

Recommended

More Related Content

Similar to rlhf.pdf

Similar to rlhf.pdf (20)

Recently uploaded

Recently uploaded (20)

rlhf.pdf