3. The three Hʼs of Model Desiderata
Askell et al. 2021 A general language assistant as a laboratory for alignment 4
4. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Askell et al. 2021 A general language assistant as a laboratory for alignment 5
5. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Askell et al. 2021 A general language assistant as a laboratory for alignment 6
6. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Harmless:
● The AI should not cause physical, psychological, or social
harm to people or the environment
Askell et al. 2021 A general language assistant as a laboratory for alignment 7
7. The Misalignment of Models
Misalignment: When the training objective does not capture
the desiderata we want from models
8
8. The Misalignment of Models
Training: Predict the next token
The three Hʼs of Model Desiderata
Misalignment: When the training objective does not capture
the desiderata we want from models
9
10. Addressing Misalignment: Instruction Following
11
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
11. Addressing Misalignment: Instruction Following
Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this)
12
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
12. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
13
13. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
14
14. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
15
15. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
16
16. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
17
17. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
4. Evaluate on held-out task
18
18. Addressing Misalignment: T0 (Encoder-Decoder models)
Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization
Train: Span prediction -> Eval: Follow instructions (e.g. answer this question)
Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5)
20
19. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
21
20. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
22
21. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
23
22. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
24
24. Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Annotates train set
26
25. Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Different annotators from Upwork/ScaleAI
- Not screened, to better mirror real-world
Annotates train set Annotates eval
27
28. Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
32
29. Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
- Filtered to remove PII
- Filtered to remove duplicates
- Limited to 200 examples / user
- Annotators are also tasked with writing prompts
33
32. Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
36
33. Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
- In notation also refer to as:
37
36. Method
To increase data collection throughput, each user is given K = 4 to
9 outputs to rank for each prompt
40
37. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
41
38. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Reward on better
completion
Reward on worse
completion
42
39. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
43
40. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
44
41. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
45
42. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
- Corollary: computationally more efficient, since this only
requires K forward passes through rθ
for each prompt
- This is why there is the -1/(K choose 2) normalization in loss
46
45. Method
Use RM to update the SFT model from step 1. Call model PPO
49
46. Method
Use RM to update the SFT model from step 1. Call model PPO
50
Number of Prompts
47. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
51
48. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
52
49. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
53
50. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
54
51. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
55
54. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
58
55. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
59
56. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
60
57. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
4. PPO-ptx
a. Identical to PPO, except with an additional auxiliary LM objective on the
pretraining data
61
58. Pre-lecture Q1
62
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
59. Pre-lecture Q1
63
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
60. Pre-lecture Q1
64
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
61. Pre-lecture Q1
65
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
62. Pre-lecture Q1
66
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Note: None of these datasets are available publically :(
63. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
67
64. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
68
65. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
69
66. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
70
67. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
4. The RM is more data efficient
a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts.
b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any
output
71
69. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
73
70. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
74
71. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
● Harmless:
○ RealToxicityPrompts (toxicity)
○ Winogender & CrowS-Pairs (social bias)
75
72. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
76
73. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
77
74. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
● Public NLP tasks
○ SQuAD
○ DROP
○ HellaSwag
○ WMT 2015 French to English
78
76. Baseline: 50-50 win rate against SFT
80
Helpfulness: Preferences of the Labelers
77. ● GPT vs. Instruct distribution
81
Helpfulness: Preferences of the Labelers
78. ● GPT vs. Instruct distribution
● Labelers who provide training
data vs. new labelers
(preference overfitting)
82
Helpfulness: Preferences of the Labelers
79. ● Researcher tries to find prompts
that can successfully instruct a
vanilla GPT (they donʼt include
examples in the paper)
83
Helpfulness: Preferences of the Labelers
80. ● PPO models win across the board
84
Helpfulness: Preferences of the Labelers
81. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
85
82. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
86
83. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
87
● Models trained with feedback data are less likely to hallucinate
● Interesting that SFT has lower hallucinations
85. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
89
86. Comparing w/ Fine-Tuned Models
● Public NLP dataset does not reflect how the API is used
○ Public dataset capture mostly things that are easy to automatically evaluate
○ API is more often used for open-ended generation
Instruct prompt distribution
90
87. Truthfulness
91
● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not
certain of the correct answer
● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
91. Toxicity: RealToxicityPrompts
95
● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3
● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
93. Bias: Winogender & CrowS-Pairs
97
● Metric: entropy of the multi-choice completion as the measure of bias
● Higher entropy -> less biased
Winogender
CrowS-Pairs
95. Pre-Lecture Q2
99
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
96. Pre-Lecture Q2
100
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Answer:
Toxicity: InstructGPT can reduce it.
Bias: The authors say in the paper that they donʼt find clear patterns
● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback
● Social biases can be subtle and hard to detect
● Labelers are not very directly instructed to catch bias
97. Pre-Lecture Q2
101
Instruction to the labelers
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
100. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
104
101. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
105
104. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
3. Performance degrades when instructions contain multiple explicit constraints
(e.g. “list 10 movies made in the 1930ʼs set in France”)
108
106. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
110
107. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
111
108. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
○ How do we do model versioning when we start to iterate on the models
and train them with model-dependant data?
112
109. Summary
Performance
● Labelers preference: InstructGPT > GPT-3
● Truthfulness: InstructGPT > GPT-3
● Toxicity: InstructGPT > GPT-3, (but not bias)
Findings
● InstructGPT can generalize to “held-out” labelersʼ preferences
● Public NLP datasets do not reflect real-world LMs use
● InstructGPT can generalize: outside of the RLHF instruction distribution
● InstructGPT still makes simple mistakes
113
110. Pre-Lecture Q3
● Is preference ranking/comparison the only way to provide human feedback?
● What are other options and how to convert them into reward to train the models?
● What other types of human data do you think would be helpful?
114
112. Addressing Misalignment: GPT-3
Brown et al. 2022 Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Prompting: Make the eval text more similar to the training corpora using prompts
116