Training Language Models to Follow
Instructions with Human Feedback
Austin Wang, Howard Chen
COS 597G
1
Nov 14 2022
Motivation: Alignment
3
The three Hʼs of Model Desiderata
Askell et al. 2021 A general language assistant as a laboratory for alignment 4
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Askell et al. 2021 A general language assistant as a laboratory for alignment 5
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Askell et al. 2021 A general language assistant as a laboratory for alignment 6
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Harmless:
● The AI should not cause physical, psychological, or social
harm to people or the environment
Askell et al. 2021 A general language assistant as a laboratory for alignment 7
The Misalignment of Models
Misalignment: When the training objective does not capture
the desiderata we want from models
8
The Misalignment of Models
Training: Predict the next token
The three Hʼs of Model Desiderata
Misalignment: When the training objective does not capture
the desiderata we want from models
9
Prior Works
10
Addressing Misalignment: Instruction Following
11
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
Addressing Misalignment: Instruction Following
Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this)
12
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
13
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
14
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
15
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
16
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
17
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
4. Evaluate on held-out task
18
Addressing Misalignment: T0 (Encoder-Decoder models)
Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization
Train: Span prediction -> Eval: Follow instructions (e.g. answer this question)
Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5)
20
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
21
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
22
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
23
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
24
Learning from Human Feedback
25
Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Annotates train set
26
Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Different annotators from Upwork/ScaleAI
- Not screened, to better mirror real-world
Annotates train set Annotates eval
27
Method: The SFT Model
28
Method: The SFT Model
29
Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
32
Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
- Filtered to remove PII
- Filtered to remove duplicates
- Limited to 200 examples / user
- Annotators are also tasked with writing prompts
33
Method: The SFT Model
34
Number of Prompts
Method: The SFT Model
35
Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
36
Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
- In notation also refer to as:
37
Method
38
Method
The outputs are sampled from the SFT model
39
Number of Prompts
Method
To increase data collection throughput, each user is given K = 4 to
9 outputs to rank for each prompt
40
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
41
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Reward on better
completion
Reward on worse
completion
42
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
43
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
44
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
45
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
- Corollary: computationally more efficient, since this only
requires K forward passes through rθ
for each prompt
- This is why there is the -1/(K choose 2) normalization in loss
46
Method
47
Method
48
Method
Use RM to update the SFT model from step 1. Call model PPO
49
Method
Use RM to update the SFT model from step 1. Call model PPO
50
Number of Prompts
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
51
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
52
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
53
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
54
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
55
Method
56
Method: Model Summary
57
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
58
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
59
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
60
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
4. PPO-ptx
a. Identical to PPO, except with an additional auxiliary LM objective on the
pretraining data
61
Pre-lecture Q1
62
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
Pre-lecture Q1
63
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
Pre-lecture Q1
64
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
Pre-lecture Q1
65
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Pre-lecture Q1
66
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Note: None of these datasets are available publically :(
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
67
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
68
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
69
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
70
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
4. The RM is more data efficient
a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts.
b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any
output
71
Evaluation
72
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
73
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
74
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
● Harmless:
○ RealToxicityPrompts (toxicity)
○ Winogender & CrowS-Pairs (social bias)
75
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
76
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
77
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
● Public NLP tasks
○ SQuAD
○ DROP
○ HellaSwag
○ WMT 2015 French to English
78
Helpfulness: Preferences of the Labelers
79
Baseline: 50-50 win rate against SFT
80
Helpfulness: Preferences of the Labelers
● GPT vs. Instruct distribution
81
Helpfulness: Preferences of the Labelers
● GPT vs. Instruct distribution
● Labelers who provide training
data vs. new labelers
(preference overfitting)
82
Helpfulness: Preferences of the Labelers
● Researcher tries to find prompts
that can successfully instruct a
vanilla GPT (they donʼt include
examples in the paper)
83
Helpfulness: Preferences of the Labelers
● PPO models win across the board
84
Helpfulness: Preferences of the Labelers
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
85
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
86
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
87
● Models trained with feedback data are less likely to hallucinate
● Interesting that SFT has lower hallucinations
Breakdown across Model Sizes
88
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
89
Comparing w/ Fine-Tuned Models
● Public NLP dataset does not reflect how the API is used
○ Public dataset capture mostly things that are easy to automatically evaluate
○ API is more often used for open-ended generation
Instruct prompt distribution
90
Truthfulness
91
● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not
certain of the correct answer
● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
Truthfulness
Gray: truthfulness
Color:
truthfulness +
informativeness
92
Truthfulness
● PPO/PPO-ptx choose truthful + uninformative > confident falsehood
Gray: truthfulness
Color:
truthfulness +
informativeness
93
Toxicity & Bias
94
Toxicity: RealToxicityPrompts
95
● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3
● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
Toxicity: RealToxicityPrompts
96
(Quark: Controllable Text Generation with Reinforced [Un]learning, Lu et al., 2022)
PPO-style training, not the exact InstructGPT model
Bias: Winogender & CrowS-Pairs
97
● Metric: entropy of the multi-choice completion as the measure of bias
● Higher entropy -> less biased
Winogender
CrowS-Pairs
Bias: Winogender & CrowS-Pairs
98
Pre-Lecture Q2
99
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Pre-Lecture Q2
100
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Answer:
Toxicity: InstructGPT can reduce it.
Bias: The authors say in the paper that they donʼt find clear patterns
● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback
● Social biases can be subtle and hard to detect
● Labelers are not very directly instructed to catch bias
Pre-Lecture Q2
101
Instruction to the labelers
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Qualitative Examples
● Generalizing to distribution outside of the fine-tuned data
Different Language
102
Qualitative Examples
● Generalizing to distribution outside of the fine-tuned data
Code
103
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
104
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
105
InstructGPT Still Makes Simple Mistakes
106
Too much unnecessary hedging
InstructGPT Still Makes Simple Mistakes
107
Too much unnecessary hedging
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
3. Performance degrades when instructions contain multiple explicit constraints
(e.g. “list 10 movies made in the 1930ʼs set in France”)
108
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
109
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
110
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
111
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
○ How do we do model versioning when we start to iterate on the models
and train them with model-dependant data?
112
Summary
Performance
● Labelers preference: InstructGPT > GPT-3
● Truthfulness: InstructGPT > GPT-3
● Toxicity: InstructGPT > GPT-3, (but not bias)
Findings
● InstructGPT can generalize to “held-out” labelersʼ preferences
● Public NLP datasets do not reflect real-world LMs use
● InstructGPT can generalize: outside of the RLHF instruction distribution
● InstructGPT still makes simple mistakes
113
Pre-Lecture Q3
● Is preference ranking/comparison the only way to provide human feedback?
● What are other options and how to convert them into reward to train the models?
● What other types of human data do you think would be helpful?
114
Unused Slides
115
Addressing Misalignment: GPT-3
Brown et al. 2022 Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Prompting: Make the eval text more similar to the training corpora using prompts
116

rlhf.pdf

  • 1.
    Training Language Modelsto Follow Instructions with Human Feedback Austin Wang, Howard Chen COS 597G 1 Nov 14 2022
  • 2.
  • 3.
    The three Hʼsof Model Desiderata Askell et al. 2021 A general language assistant as a laboratory for alignment 4
  • 4.
    The three Hʼsof Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Askell et al. 2021 A general language assistant as a laboratory for alignment 5
  • 5.
    The three Hʼsof Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Honest: ● The AI should give accurate information ● The AI should express uncertainty when the model doesnʼt know the answer, instead of hallucinating a wrong answer Askell et al. 2021 A general language assistant as a laboratory for alignment 6
  • 6.
    The three Hʼsof Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Honest: ● The AI should give accurate information ● The AI should express uncertainty when the model doesnʼt know the answer, instead of hallucinating a wrong answer Harmless: ● The AI should not cause physical, psychological, or social harm to people or the environment Askell et al. 2021 A general language assistant as a laboratory for alignment 7
  • 7.
    The Misalignment ofModels Misalignment: When the training objective does not capture the desiderata we want from models 8
  • 8.
    The Misalignment ofModels Training: Predict the next token The three Hʼs of Model Desiderata Misalignment: When the training objective does not capture the desiderata we want from models 9
  • 9.
  • 10.
    Addressing Misalignment: InstructionFollowing 11 - The three Hʼs are one possible set of desiderata - One more concrete desiderata is getting models to follow instructions
  • 11.
    Addressing Misalignment: InstructionFollowing Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this) 12 - The three Hʼs are one possible set of desiderata - One more concrete desiderata is getting models to follow instructions
  • 12.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) 13
  • 13.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 14
  • 14.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 15
  • 15.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 16
  • 16.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 3. Fine-tune: Use the instruction templates and datasets to fine-tune model 17
  • 17.
    Addressing Misalignment: FLAN(Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 3. Fine-tune: Use the instruction templates and datasets to fine-tune model 4. Evaluate on held-out task 18
  • 18.
    Addressing Misalignment: T0(Encoder-Decoder models) Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization Train: Span prediction -> Eval: Follow instructions (e.g. answer this question) Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5) 20
  • 19.
    Addressing Misalignment: LaMDA Thoppilanet al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users 21
  • 20.
    Addressing Misalignment: LaMDA Thoppilanet al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 22
  • 21.
    Addressing Misalignment: LaMDA Thoppilanet al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 23
  • 22.
    Addressing Misalignment: LaMDA Thoppilanet al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 24
  • 23.
  • 24.
    Method: Human Annotators 40Annotators from Upwork/ScaleAI - Screened/Onboarded/Diverse etc etc etc Annotates train set 26
  • 25.
    Method: Human Annotators 40Annotators from Upwork/ScaleAI - Screened/Onboarded/Diverse etc etc etc Different annotators from Upwork/ScaleAI - Not screened, to better mirror real-world Annotates train set Annotates eval 27
  • 26.
  • 27.
  • 28.
    Method: The SFTModel A large collections of prompts: - From OpenAI GPT3 Playground 32
  • 29.
    Method: The SFTModel A large collections of prompts: - From OpenAI GPT3 Playground - Filtered to remove PII - Filtered to remove duplicates - Limited to 200 examples / user - Annotators are also tasked with writing prompts 33
  • 30.
    Method: The SFTModel 34 Number of Prompts
  • 31.
  • 32.
    Method: The SFTModel Finetune the model, call this model SFT Model - Initialized with pretrained GPT-3 175B model, and trained for 16 Epochs on demonstration data 36
  • 33.
    Method: The SFTModel Finetune the model, call this model SFT Model - Initialized with pretrained GPT-3 175B model, and trained for 16 Epochs on demonstration data - In notation also refer to as: 37
  • 34.
  • 35.
    Method The outputs aresampled from the SFT model 39 Number of Prompts
  • 36.
    Method To increase datacollection throughput, each user is given K = 4 to 9 outputs to rank for each prompt 40
  • 37.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion 41
  • 38.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Reward on better completion Reward on worse completion 42
  • 39.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare 43
  • 40.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting 44
  • 41.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting - Solution: sample the prompt, and then put all K choose 2 pairs from the prompt into the same batch 45
  • 42.
    Method rθ : The rewardmodel we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting - Solution: sample the prompt, and then put all K choose 2 pairs from the prompt into the same batch - Corollary: computationally more efficient, since this only requires K forward passes through rθ for each prompt - This is why there is the -1/(K choose 2) normalization in loss 46
  • 43.
  • 44.
  • 45.
    Method Use RM toupdate the SFT model from step 1. Call model PPO 49
  • 46.
    Method Use RM toupdate the SFT model from step 1. Call model PPO 50 Number of Prompts
  • 47.
    Method Use RM toupdate the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates 51
  • 48.
    Method Use RM toupdate the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 52
  • 49.
    Method Use RM toupdate the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks 53
  • 50.
    Method Use RM toupdate the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks Solution: Add a auxiliary LM objective on the pretraining data. Call this variant PPO-ptx 54
  • 51.
    Method Use RM toupdate the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks Solution: Add a auxiliary LM objective on the pretraining data. Call this variant PPO-ptx 55
  • 52.
  • 53.
  • 54.
    Method: Model Summary 1.SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 58
  • 55.
    Method: Model Summary 1.SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 59
  • 56.
    Method: Model Summary 1.SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 3. PPO a. SFT model further fine-tuned using RL with the RM providing the reward signal b. A KL-loss is provided to prevent the PPO model from deviating far from SFT 60
  • 57.
    Method: Model Summary 1.SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 3. PPO a. SFT model further fine-tuned using RL with the RM providing the reward signal b. A KL-loss is provided to prevent the PPO model from deviating far from SFT 4. PPO-ptx a. Identical to PPO, except with an additional auxiliary LM objective on the pretraining data 61
  • 58.
    Pre-lecture Q1 62 Describe thethree datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline?
  • 59.
    Pre-lecture Q1 63 Describe thethree datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model.
  • 60.
    Pre-lecture Q1 64 Describe thethree datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM.
  • 61.
    Pre-lecture Q1 65 Describe thethree datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM. 3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for the policy optimization step.
  • 62.
    Pre-lecture Q1 66 Describe thethree datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM. 3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for the policy optimization step. Note: None of these datasets are available publically :(
  • 63.
    Method: Why isRL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 67
  • 64.
    Method: Why isRL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The RM assigns similar rewards to sequences with similar quality. 68
  • 65.
    Method: Why isRL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The RM assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 69
  • 66.
    Method: Why isRL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 3. The RM more directly captures the notion of “preference”. a. Preferences induce rankings, and rankings can be used to infer preferences b. Ranking is very naturally captured by the reward signal, better sequences = higher reward c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example 70
  • 67.
    Method: Why isRL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 3. The RM more directly captures the notion of “preference”. a. Preferences induce rankings, and rankings can be used to infer preferences b. Ranking is very naturally captured by the reward signal, better sequences = higher reward c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example 4. The RM is more data efficient a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts. b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any output 71
  • 68.
  • 69.
    Original Goal: 3H ●Helpful: need to infer intention from the user (labelersʼ preference rating) 73
  • 70.
    Original Goal: 3H ●Helpful: need to infer intention from the user (labelersʼ preference rating) ● Honest (truthfulness): ○ Hallucination (labelerʼs rating) ○ TruthfulQA dataset 74
  • 71.
    Original Goal: 3H ●Helpful: need to infer intention from the user (labelersʼ preference rating) ● Honest (truthfulness): ○ Hallucination (labelerʼs rating) ○ TruthfulQA dataset ● Harmless: ○ RealToxicityPrompts (toxicity) ○ Winogender & CrowS-Pairs (social bias) 75
  • 72.
    Evaluation: Testing Distributions ●API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) 76
  • 73.
    Evaluation: Testing Distributions ●API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) ○ Prompts submitted to the InstructGPT model 77
  • 74.
    Evaluation: Testing Distributions ●API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) ○ Prompts submitted to the InstructGPT model ● Public NLP tasks ○ SQuAD ○ DROP ○ HellaSwag ○ WMT 2015 French to English 78
  • 75.
  • 76.
    Baseline: 50-50 winrate against SFT 80 Helpfulness: Preferences of the Labelers
  • 77.
    ● GPT vs.Instruct distribution 81 Helpfulness: Preferences of the Labelers
  • 78.
    ● GPT vs.Instruct distribution ● Labelers who provide training data vs. new labelers (preference overfitting) 82 Helpfulness: Preferences of the Labelers
  • 79.
    ● Researcher triesto find prompts that can successfully instruct a vanilla GPT (they donʼt include examples in the paper) 83 Helpfulness: Preferences of the Labelers
  • 80.
    ● PPO modelswin across the board 84 Helpfulness: Preferences of the Labelers
  • 81.
    Preferences of theLabelers: Breakdown X-axis aggregated across model sizes 85
  • 82.
    Preferences of theLabelers: Breakdown X-axis aggregated across model sizes 86
  • 83.
    Preferences of theLabelers: Breakdown X-axis aggregated across model sizes 87 ● Models trained with feedback data are less likely to hallucinate ● Interesting that SFT has lower hallucinations
  • 84.
  • 85.
    Preferences of theLabelers: Breakdown X-axis aggregated across model sizes 89
  • 86.
    Comparing w/ Fine-TunedModels ● Public NLP dataset does not reflect how the API is used ○ Public dataset capture mostly things that are easy to automatically evaluate ○ API is more often used for open-ended generation Instruct prompt distribution 90
  • 87.
    Truthfulness 91 ● “Instruction+QA”: instructthe model to respond with “I have no comment” when it is not certain of the correct answer ● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
  • 88.
  • 89.
    Truthfulness ● PPO/PPO-ptx choosetruthful + uninformative > confident falsehood Gray: truthfulness Color: truthfulness + informativeness 93
  • 90.
  • 91.
    Toxicity: RealToxicityPrompts 95 ● Wheninstructed to be respectful, InstructGPT reduces toxicity > GTP-3 ● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
  • 92.
    Toxicity: RealToxicityPrompts 96 (Quark: ControllableText Generation with Reinforced [Un]learning, Lu et al., 2022) PPO-style training, not the exact InstructGPT model
  • 93.
    Bias: Winogender &CrowS-Pairs 97 ● Metric: entropy of the multi-choice completion as the measure of bias ● Higher entropy -> less biased Winogender CrowS-Pairs
  • 94.
    Bias: Winogender &CrowS-Pairs 98
  • 95.
    Pre-Lecture Q2 99 Summarize theevaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case?
  • 96.
    Pre-Lecture Q2 100 Summarize theevaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case? Answer: Toxicity: InstructGPT can reduce it. Bias: The authors say in the paper that they donʼt find clear patterns ● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback ● Social biases can be subtle and hard to detect ● Labelers are not very directly instructed to catch bias
  • 97.
    Pre-Lecture Q2 101 Instruction tothe labelers Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case?
  • 98.
    Qualitative Examples ● Generalizingto distribution outside of the fine-tuned data Different Language 102
  • 99.
    Qualitative Examples ● Generalizingto distribution outside of the fine-tuned data Code 103
  • 100.
    InstructGPT Still MakesSimple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 104
  • 101.
    InstructGPT Still MakesSimple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 2. Overly hedging: model might answer “no one answer to the question” when the one answer is clear from the context 105
  • 102.
    InstructGPT Still MakesSimple Mistakes 106 Too much unnecessary hedging
  • 103.
    InstructGPT Still MakesSimple Mistakes 107 Too much unnecessary hedging
  • 104.
    InstructGPT Still MakesSimple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 2. Overly hedging: model might answer “no one answer to the question” when the one answer is clear from the context 3. Performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930ʼs set in France”) 108
  • 105.
    Implications ● Alignment research(more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? 109
  • 106.
    Implications ● Alignment research(more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing 110
  • 107.
    Implications ● Alignment research(more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing ○ text-davinci-001 is the InstructGPT described in the paper ○ text-davinci-002 is the current model behind the API—this model is crazily powerful but we donʼt know what data itʼs trained on and any update on the training procedure 111
  • 108.
    Implications ● Alignment research(more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing ○ text-davinci-001 is the InstructGPT described in the paper ○ text-davinci-002 is the current model behind the API—this model is crazily powerful but we donʼt know what data itʼs trained on and any update on the training procedure ○ How do we do model versioning when we start to iterate on the models and train them with model-dependant data? 112
  • 109.
    Summary Performance ● Labelers preference:InstructGPT > GPT-3 ● Truthfulness: InstructGPT > GPT-3 ● Toxicity: InstructGPT > GPT-3, (but not bias) Findings ● InstructGPT can generalize to “held-out” labelersʼ preferences ● Public NLP datasets do not reflect real-world LMs use ● InstructGPT can generalize: outside of the RLHF instruction distribution ● InstructGPT still makes simple mistakes 113
  • 110.
    Pre-Lecture Q3 ● Ispreference ranking/comparison the only way to provide human feedback? ● What are other options and how to convert them into reward to train the models? ● What other types of human data do you think would be helpful? 114
  • 111.
  • 112.
    Addressing Misalignment: GPT-3 Brownet al. 2022 Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Prompting: Make the eval text more similar to the training corpora using prompts 116