SlideShare a Scribd company logo
1 of 112
Download to read offline
Training Language Models to Follow
Instructions with Human Feedback
Austin Wang, Howard Chen
COS 597G
1
Nov 14 2022
Motivation: Alignment
3
The three Hʼs of Model Desiderata
Askell et al. 2021 A general language assistant as a laboratory for alignment 4
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Askell et al. 2021 A general language assistant as a laboratory for alignment 5
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Askell et al. 2021 A general language assistant as a laboratory for alignment 6
The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Harmless:
● The AI should not cause physical, psychological, or social
harm to people or the environment
Askell et al. 2021 A general language assistant as a laboratory for alignment 7
The Misalignment of Models
Misalignment: When the training objective does not capture
the desiderata we want from models
8
The Misalignment of Models
Training: Predict the next token
The three Hʼs of Model Desiderata
Misalignment: When the training objective does not capture
the desiderata we want from models
9
Prior Works
10
Addressing Misalignment: Instruction Following
11
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
Addressing Misalignment: Instruction Following
Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this)
12
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
13
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
14
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
15
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
16
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
17
Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
4. Evaluate on held-out task
18
Addressing Misalignment: T0 (Encoder-Decoder models)
Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization
Train: Span prediction -> Eval: Follow instructions (e.g. answer this question)
Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5)
20
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
21
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
22
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
23
Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
24
Learning from Human Feedback
25
Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Annotates train set
26
Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Different annotators from Upwork/ScaleAI
- Not screened, to better mirror real-world
Annotates train set Annotates eval
27
Method: The SFT Model
28
Method: The SFT Model
29
Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
32
Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
- Filtered to remove PII
- Filtered to remove duplicates
- Limited to 200 examples / user
- Annotators are also tasked with writing prompts
33
Method: The SFT Model
34
Number of Prompts
Method: The SFT Model
35
Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
36
Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
- In notation also refer to as:
37
Method
38
Method
The outputs are sampled from the SFT model
39
Number of Prompts
Method
To increase data collection throughput, each user is given K = 4 to
9 outputs to rank for each prompt
40
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
41
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Reward on better
completion
Reward on worse
completion
42
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
43
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
44
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
45
Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
- Corollary: computationally more efficient, since this only
requires K forward passes through rθ
for each prompt
- This is why there is the -1/(K choose 2) normalization in loss
46
Method
47
Method
48
Method
Use RM to update the SFT model from step 1. Call model PPO
49
Method
Use RM to update the SFT model from step 1. Call model PPO
50
Number of Prompts
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
51
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
52
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
53
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
54
Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
55
Method
56
Method: Model Summary
57
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
58
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
59
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
60
Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
4. PPO-ptx
a. Identical to PPO, except with an additional auxiliary LM objective on the
pretraining data
61
Pre-lecture Q1
62
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
Pre-lecture Q1
63
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
Pre-lecture Q1
64
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
Pre-lecture Q1
65
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Pre-lecture Q1
66
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Note: None of these datasets are available publically :(
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
67
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
68
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
69
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
70
Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
4. The RM is more data efficient
a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts.
b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any
output
71
Evaluation
72
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
73
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
74
Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
● Harmless:
○ RealToxicityPrompts (toxicity)
○ Winogender & CrowS-Pairs (social bias)
75
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
76
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
77
Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
● Public NLP tasks
○ SQuAD
○ DROP
○ HellaSwag
○ WMT 2015 French to English
78
Helpfulness: Preferences of the Labelers
79
Baseline: 50-50 win rate against SFT
80
Helpfulness: Preferences of the Labelers
● GPT vs. Instruct distribution
81
Helpfulness: Preferences of the Labelers
● GPT vs. Instruct distribution
● Labelers who provide training
data vs. new labelers
(preference overfitting)
82
Helpfulness: Preferences of the Labelers
● Researcher tries to find prompts
that can successfully instruct a
vanilla GPT (they donʼt include
examples in the paper)
83
Helpfulness: Preferences of the Labelers
● PPO models win across the board
84
Helpfulness: Preferences of the Labelers
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
85
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
86
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
87
● Models trained with feedback data are less likely to hallucinate
● Interesting that SFT has lower hallucinations
Breakdown across Model Sizes
88
Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
89
Comparing w/ Fine-Tuned Models
● Public NLP dataset does not reflect how the API is used
○ Public dataset capture mostly things that are easy to automatically evaluate
○ API is more often used for open-ended generation
Instruct prompt distribution
90
Truthfulness
91
● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not
certain of the correct answer
● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
Truthfulness
Gray: truthfulness
Color:
truthfulness +
informativeness
92
Truthfulness
● PPO/PPO-ptx choose truthful + uninformative > confident falsehood
Gray: truthfulness
Color:
truthfulness +
informativeness
93
Toxicity & Bias
94
Toxicity: RealToxicityPrompts
95
● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3
● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
Toxicity: RealToxicityPrompts
96
(Quark: Controllable Text Generation with Reinforced [Un]learning, Lu et al., 2022)
PPO-style training, not the exact InstructGPT model
Bias: Winogender & CrowS-Pairs
97
● Metric: entropy of the multi-choice completion as the measure of bias
● Higher entropy -> less biased
Winogender
CrowS-Pairs
Bias: Winogender & CrowS-Pairs
98
Pre-Lecture Q2
99
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Pre-Lecture Q2
100
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Answer:
Toxicity: InstructGPT can reduce it.
Bias: The authors say in the paper that they donʼt find clear patterns
● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback
● Social biases can be subtle and hard to detect
● Labelers are not very directly instructed to catch bias
Pre-Lecture Q2
101
Instruction to the labelers
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Qualitative Examples
● Generalizing to distribution outside of the fine-tuned data
Different Language
102
Qualitative Examples
● Generalizing to distribution outside of the fine-tuned data
Code
103
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
104
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
105
InstructGPT Still Makes Simple Mistakes
106
Too much unnecessary hedging
InstructGPT Still Makes Simple Mistakes
107
Too much unnecessary hedging
InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
3. Performance degrades when instructions contain multiple explicit constraints
(e.g. “list 10 movies made in the 1930ʼs set in France”)
108
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
109
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
110
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
111
Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
○ How do we do model versioning when we start to iterate on the models
and train them with model-dependant data?
112
Summary
Performance
● Labelers preference: InstructGPT > GPT-3
● Truthfulness: InstructGPT > GPT-3
● Toxicity: InstructGPT > GPT-3, (but not bias)
Findings
● InstructGPT can generalize to “held-out” labelersʼ preferences
● Public NLP datasets do not reflect real-world LMs use
● InstructGPT can generalize: outside of the RLHF instruction distribution
● InstructGPT still makes simple mistakes
113
Pre-Lecture Q3
● Is preference ranking/comparison the only way to provide human feedback?
● What are other options and how to convert them into reward to train the models?
● What other types of human data do you think would be helpful?
114
Unused Slides
115
Addressing Misalignment: GPT-3
Brown et al. 2022 Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Prompting: Make the eval text more similar to the training corpora using prompts
116

More Related Content

Similar to rlhf.pdf

Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Rama Irsheidat
 
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Wolfgang Grieskamp
 
Kaggle review Planet: Understanding the Amazon from Space
Kaggle reviewPlanet: Understanding the Amazon from SpaceKaggle reviewPlanet: Understanding the Amazon from Space
Kaggle review Planet: Understanding the Amazon from SpaceEduard Tyantov
 
Mb0048 operations research
Mb0048 operations researchMb0048 operations research
Mb0048 operations researchsmumbahelp
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations researchsmumbahelp
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations researchsmumbahelp
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations researchsmumbahelp
 
Machine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonMachine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonAlain Chabrier
 
Genetic Algorithms-1.ppt
Genetic Algorithms-1.pptGenetic Algorithms-1.ppt
Genetic Algorithms-1.pptDrSanjeevPunia
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
SE2018_Lec 20_ Test-Driven Development (TDD)
SE2018_Lec 20_ Test-Driven Development (TDD)SE2018_Lec 20_ Test-Driven Development (TDD)
SE2018_Lec 20_ Test-Driven Development (TDD)Amr E. Mohamed
 
Java programming lab assignments
Java programming lab assignments Java programming lab assignments
Java programming lab assignments rajni kaushal
 
Programs_Problem_Solving_Algorithms.ppt
Programs_Problem_Solving_Algorithms.pptPrograms_Problem_Solving_Algorithms.ppt
Programs_Problem_Solving_Algorithms.pptmalik681299
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 

Similar to rlhf.pdf (20)

Portfolio Planning
Portfolio PlanningPortfolio Planning
Portfolio Planning
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
 
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
 
Kaggle review Planet: Understanding the Amazon from Space
Kaggle reviewPlanet: Understanding the Amazon from SpaceKaggle reviewPlanet: Understanding the Amazon from Space
Kaggle review Planet: Understanding the Amazon from Space
 
Mb0048 operations research
Mb0048 operations researchMb0048 operations research
Mb0048 operations research
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
geekgap.io webinar #1
geekgap.io webinar #1geekgap.io webinar #1
geekgap.io webinar #1
 
Machine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonMachine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparison
 
Genetic Algorithms-1.ppt
Genetic Algorithms-1.pptGenetic Algorithms-1.ppt
Genetic Algorithms-1.ppt
 
midterm_fa07.pdf
midterm_fa07.pdfmidterm_fa07.pdf
midterm_fa07.pdf
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
SE2018_Lec 20_ Test-Driven Development (TDD)
SE2018_Lec 20_ Test-Driven Development (TDD)SE2018_Lec 20_ Test-Driven Development (TDD)
SE2018_Lec 20_ Test-Driven Development (TDD)
 
Day 4
Day 4Day 4
Day 4
 
Java programming lab assignments
Java programming lab assignments Java programming lab assignments
Java programming lab assignments
 
UNIT-2-PPTS-DAA.ppt
UNIT-2-PPTS-DAA.pptUNIT-2-PPTS-DAA.ppt
UNIT-2-PPTS-DAA.ppt
 
Programs_Problem_Solving_Algorithms.ppt
Programs_Problem_Solving_Algorithms.pptPrograms_Problem_Solving_Algorithms.ppt
Programs_Problem_Solving_Algorithms.ppt
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 

Recently uploaded

2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedKaiNexus
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxgeorgebrinton95
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...lizamodels9
 
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxAbhayThakur200703
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
NewBase 22 April 2024 Energy News issue - 1718 by Khaled Al Awadi (AutoRe...
NewBase  22 April  2024  Energy News issue - 1718 by Khaled Al Awadi  (AutoRe...NewBase  22 April  2024  Energy News issue - 1718 by Khaled Al Awadi  (AutoRe...
NewBase 22 April 2024 Energy News issue - 1718 by Khaled Al Awadi (AutoRe...Khaled Al Awadi
 
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckPitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckHajeJanKamps
 
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCRsoniya singh
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCRsoniya singh
 
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...lizamodels9
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 

Recently uploaded (20)

2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
 
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In BELLMONT HOTEL ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptx
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
NewBase 22 April 2024 Energy News issue - 1718 by Khaled Al Awadi (AutoRe...
NewBase  22 April  2024  Energy News issue - 1718 by Khaled Al Awadi  (AutoRe...NewBase  22 April  2024  Energy News issue - 1718 by Khaled Al Awadi  (AutoRe...
NewBase 22 April 2024 Energy News issue - 1718 by Khaled Al Awadi (AutoRe...
 
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckPitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
 
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
 
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 

rlhf.pdf

  • 1. Training Language Models to Follow Instructions with Human Feedback Austin Wang, Howard Chen COS 597G 1 Nov 14 2022
  • 3. The three Hʼs of Model Desiderata Askell et al. 2021 A general language assistant as a laboratory for alignment 4
  • 4. The three Hʼs of Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Askell et al. 2021 A general language assistant as a laboratory for alignment 5
  • 5. The three Hʼs of Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Honest: ● The AI should give accurate information ● The AI should express uncertainty when the model doesnʼt know the answer, instead of hallucinating a wrong answer Askell et al. 2021 A general language assistant as a laboratory for alignment 6
  • 6. The three Hʼs of Model Desiderata Helpful: ● The AI should help the user solve their task (e.g. answer their questions) Honest: ● The AI should give accurate information ● The AI should express uncertainty when the model doesnʼt know the answer, instead of hallucinating a wrong answer Harmless: ● The AI should not cause physical, psychological, or social harm to people or the environment Askell et al. 2021 A general language assistant as a laboratory for alignment 7
  • 7. The Misalignment of Models Misalignment: When the training objective does not capture the desiderata we want from models 8
  • 8. The Misalignment of Models Training: Predict the next token The three Hʼs of Model Desiderata Misalignment: When the training objective does not capture the desiderata we want from models 9
  • 10. Addressing Misalignment: Instruction Following 11 - The three Hʼs are one possible set of desiderata - One more concrete desiderata is getting models to follow instructions
  • 11. Addressing Misalignment: Instruction Following Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this) 12 - The three Hʼs are one possible set of desiderata - One more concrete desiderata is getting models to follow instructions
  • 12. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) 13
  • 13. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 14
  • 14. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 15
  • 15. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 16
  • 16. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 3. Fine-tune: Use the instruction templates and datasets to fine-tune model 17
  • 17. Addressing Misalignment: FLAN (Decoder models) Wei et al. 2022 Finetuned Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Instruction Tuning: Fine-tune models to follow instructions 1. Aggregate Datasets (62): Collect wide variety of public datasets 2. Instruction Templates: Manually write 10 templates / dataset that captures task 3. Fine-tune: Use the instruction templates and datasets to fine-tune model 4. Evaluate on held-out task 18
  • 18. Addressing Misalignment: T0 (Encoder-Decoder models) Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization Train: Span prediction -> Eval: Follow instructions (e.g. answer this question) Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5) 20
  • 19. Addressing Misalignment: LaMDA Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users 21
  • 20. Addressing Misalignment: LaMDA Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 22
  • 21. Addressing Misalignment: LaMDA Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 23
  • 22. Addressing Misalignment: LaMDA Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications Train: Next-token prediction -> Eval: Dialogue with human users Solution: Add a bunch of dialogue text to your pretraining data - 2.97B Documents - 1.12B Dialogues and 13.39B Dialogue Utterances - 1.56T words total 24
  • 23. Learning from Human Feedback 25
  • 24. Method: Human Annotators 40 Annotators from Upwork/ScaleAI - Screened/Onboarded/Diverse etc etc etc Annotates train set 26
  • 25. Method: Human Annotators 40 Annotators from Upwork/ScaleAI - Screened/Onboarded/Diverse etc etc etc Different annotators from Upwork/ScaleAI - Not screened, to better mirror real-world Annotates train set Annotates eval 27
  • 26. Method: The SFT Model 28
  • 27. Method: The SFT Model 29
  • 28. Method: The SFT Model A large collections of prompts: - From OpenAI GPT3 Playground 32
  • 29. Method: The SFT Model A large collections of prompts: - From OpenAI GPT3 Playground - Filtered to remove PII - Filtered to remove duplicates - Limited to 200 examples / user - Annotators are also tasked with writing prompts 33
  • 30. Method: The SFT Model 34 Number of Prompts
  • 31. Method: The SFT Model 35
  • 32. Method: The SFT Model Finetune the model, call this model SFT Model - Initialized with pretrained GPT-3 175B model, and trained for 16 Epochs on demonstration data 36
  • 33. Method: The SFT Model Finetune the model, call this model SFT Model - Initialized with pretrained GPT-3 175B model, and trained for 16 Epochs on demonstration data - In notation also refer to as: 37
  • 35. Method The outputs are sampled from the SFT model 39 Number of Prompts
  • 36. Method To increase data collection throughput, each user is given K = 4 to 9 outputs to rank for each prompt 40
  • 37. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion 41
  • 38. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Reward on better completion Reward on worse completion 42
  • 39. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare 43
  • 40. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting 44
  • 41. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting - Solution: sample the prompt, and then put all K choose 2 pairs from the prompt into the same batch 45
  • 42. Method rθ : The reward model we are trying to optimize x: the prompt yw : the better completion yl : the worse completion Small but important detail: - Each prompt has K completions -> K choose 2 pairs to compare - If ∀ batch we sample uniform over every pair (from any prompt): - Each completion can appear in K - 1 gradient updates - This can lead to overfitting - Solution: sample the prompt, and then put all K choose 2 pairs from the prompt into the same batch - Corollary: computationally more efficient, since this only requires K forward passes through rθ for each prompt - This is why there is the -1/(K choose 2) normalization in loss 46
  • 45. Method Use RM to update the SFT model from step 1. Call model PPO 49
  • 46. Method Use RM to update the SFT model from step 1. Call model PPO 50 Number of Prompts
  • 47. Method Use RM to update the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates 51
  • 48. Method Use RM to update the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 52
  • 49. Method Use RM to update the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks 53
  • 50. Method Use RM to update the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks Solution: Add a auxiliary LM objective on the pretraining data. Call this variant PPO-ptx 54
  • 51. Method Use RM to update the SFT model from step 1. Call model PPO Two problems: 1. As RLHF is updated, its outputs become very different from what the RM was trained on -> worse reward estimates Solution: add a KL penalty that makes sure PPO model output does not deviate too far from SFT 2. Just using RL objective leads to performance degradation on many NLP tasks Solution: Add a auxiliary LM objective on the pretraining data. Call this variant PPO-ptx 55
  • 54. Method: Model Summary 1. SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 58
  • 55. Method: Model Summary 1. SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 59
  • 56. Method: Model Summary 1. SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 3. PPO a. SFT model further fine-tuned using RL with the RM providing the reward signal b. A KL-loss is provided to prevent the PPO model from deviating far from SFT 60
  • 57. Method: Model Summary 1. SFT: Supervised Fine-Tuning a. GPT-3 fine-tuned on human demonstrations of prompt completions 2. RM: Reward Model a. Not actually used to generate anything, but used to train the PPO and PPO-ptx models 3. PPO a. SFT model further fine-tuned using RL with the RM providing the reward signal b. A KL-loss is provided to prevent the PPO model from deviating far from SFT 4. PPO-ptx a. Identical to PPO, except with an additional auxiliary LM objective on the pretraining data 61
  • 58. Pre-lecture Q1 62 Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline?
  • 59. Pre-lecture Q1 63 Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model.
  • 60. Pre-lecture Q1 64 Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM.
  • 61. Pre-lecture Q1 65 Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM. 3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for the policy optimization step.
  • 62. Pre-lecture Q1 66 Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and how are they used in the pipeline? 1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used to train the SFT model. 2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model completions ranked by labellers. This is used to train the RM. 3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for the policy optimization step. Note: None of these datasets are available publically :(
  • 63. Method: Why is RL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 67
  • 64. Method: Why is RL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The RM assigns similar rewards to sequences with similar quality. 68
  • 65. Method: Why is RL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The RM assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 69
  • 66. Method: Why is RL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 3. The RM more directly captures the notion of “preference”. a. Preferences induce rankings, and rankings can be used to infer preferences b. Ranking is very naturally captured by the reward signal, better sequences = higher reward c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example 70
  • 67. Method: Why is RL using Human Feedback (RLHF) good? The SFT approach also uses data to align with human desiderata, why do RLHF? 1. Reward is a more nuanced training signal than autoregressive loss a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as “sandwiches”. The assigns similar rewards to sequences with similar quality. 2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not use model generations, since it is completely offline. a. This means the RM may provide more “tailored” feedback to the model 3. The RM more directly captures the notion of “preference”. a. Preferences induce rankings, and rankings can be used to infer preferences b. Ranking is very naturally captured by the reward signal, better sequences = higher reward c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example 4. The RM is more data efficient a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts. b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any output 71
  • 69. Original Goal: 3H ● Helpful: need to infer intention from the user (labelersʼ preference rating) 73
  • 70. Original Goal: 3H ● Helpful: need to infer intention from the user (labelersʼ preference rating) ● Honest (truthfulness): ○ Hallucination (labelerʼs rating) ○ TruthfulQA dataset 74
  • 71. Original Goal: 3H ● Helpful: need to infer intention from the user (labelersʼ preference rating) ● Honest (truthfulness): ○ Hallucination (labelerʼs rating) ○ TruthfulQA dataset ● Harmless: ○ RealToxicityPrompts (toxicity) ○ Winogender & CrowS-Pairs (social bias) 75
  • 72. Evaluation: Testing Distributions ● API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) 76
  • 73. Evaluation: Testing Distributions ● API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) ○ Prompts submitted to the InstructGPT model 77
  • 74. Evaluation: Testing Distributions ● API distribution ○ Prompts submitted to the original GPT-3 model (generally not instruction following) ○ Prompts submitted to the InstructGPT model ● Public NLP tasks ○ SQuAD ○ DROP ○ HellaSwag ○ WMT 2015 French to English 78
  • 75. Helpfulness: Preferences of the Labelers 79
  • 76. Baseline: 50-50 win rate against SFT 80 Helpfulness: Preferences of the Labelers
  • 77. ● GPT vs. Instruct distribution 81 Helpfulness: Preferences of the Labelers
  • 78. ● GPT vs. Instruct distribution ● Labelers who provide training data vs. new labelers (preference overfitting) 82 Helpfulness: Preferences of the Labelers
  • 79. ● Researcher tries to find prompts that can successfully instruct a vanilla GPT (they donʼt include examples in the paper) 83 Helpfulness: Preferences of the Labelers
  • 80. ● PPO models win across the board 84 Helpfulness: Preferences of the Labelers
  • 81. Preferences of the Labelers: Breakdown X-axis aggregated across model sizes 85
  • 82. Preferences of the Labelers: Breakdown X-axis aggregated across model sizes 86
  • 83. Preferences of the Labelers: Breakdown X-axis aggregated across model sizes 87 ● Models trained with feedback data are less likely to hallucinate ● Interesting that SFT has lower hallucinations
  • 85. Preferences of the Labelers: Breakdown X-axis aggregated across model sizes 89
  • 86. Comparing w/ Fine-Tuned Models ● Public NLP dataset does not reflect how the API is used ○ Public dataset capture mostly things that are easy to automatically evaluate ○ API is more often used for open-ended generation Instruct prompt distribution 90
  • 87. Truthfulness 91 ● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not certain of the correct answer ● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
  • 89. Truthfulness ● PPO/PPO-ptx choose truthful + uninformative > confident falsehood Gray: truthfulness Color: truthfulness + informativeness 93
  • 91. Toxicity: RealToxicityPrompts 95 ● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3 ● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
  • 92. Toxicity: RealToxicityPrompts 96 (Quark: Controllable Text Generation with Reinforced [Un]learning, Lu et al., 2022) PPO-style training, not the exact InstructGPT model
  • 93. Bias: Winogender & CrowS-Pairs 97 ● Metric: entropy of the multi-choice completion as the measure of bias ● Higher entropy -> less biased Winogender CrowS-Pairs
  • 94. Bias: Winogender & CrowS-Pairs 98
  • 95. Pre-Lecture Q2 99 Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case?
  • 96. Pre-Lecture Q2 100 Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case? Answer: Toxicity: InstructGPT can reduce it. Bias: The authors say in the paper that they donʼt find clear patterns ● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback ● Social biases can be subtle and hard to detect ● Labelers are not very directly instructed to catch bias
  • 97. Pre-Lecture Q2 101 Instruction to the labelers Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do you think it is the case?
  • 98. Qualitative Examples ● Generalizing to distribution outside of the fine-tuned data Different Language 102
  • 99. Qualitative Examples ● Generalizing to distribution outside of the fine-tuned data Code 103
  • 100. InstructGPT Still Makes Simple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 104
  • 101. InstructGPT Still Makes Simple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 2. Overly hedging: model might answer “no one answer to the question” when the one answer is clear from the context 105
  • 102. InstructGPT Still Makes Simple Mistakes 106 Too much unnecessary hedging
  • 103. InstructGPT Still Makes Simple Mistakes 107 Too much unnecessary hedging
  • 104. InstructGPT Still Makes Simple Mistakes 1. Incorrectly assumes the premise is true when itʼs not 2. Overly hedging: model might answer “no one answer to the question” when the one answer is clear from the context 3. Performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930ʼs set in France”) 108
  • 105. Implications ● Alignment research (more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? 109
  • 106. Implications ● Alignment research (more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing 110
  • 107. Implications ● Alignment research (more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing ○ text-davinci-001 is the InstructGPT described in the paper ○ text-davinci-002 is the current model behind the API—this model is crazily powerful but we donʼt know what data itʼs trained on and any update on the training procedure 111
  • 108. Implications ● Alignment research (more in the last lecture) ○ What are we aligning to? The Labelers? The researchers? ● How to do research when the model is constantly changing ○ text-davinci-001 is the InstructGPT described in the paper ○ text-davinci-002 is the current model behind the API—this model is crazily powerful but we donʼt know what data itʼs trained on and any update on the training procedure ○ How do we do model versioning when we start to iterate on the models and train them with model-dependant data? 112
  • 109. Summary Performance ● Labelers preference: InstructGPT > GPT-3 ● Truthfulness: InstructGPT > GPT-3 ● Toxicity: InstructGPT > GPT-3, (but not bias) Findings ● InstructGPT can generalize to “held-out” labelersʼ preferences ● Public NLP datasets do not reflect real-world LMs use ● InstructGPT can generalize: outside of the RLHF instruction distribution ● InstructGPT still makes simple mistakes 113
  • 110. Pre-Lecture Q3 ● Is preference ranking/comparison the only way to provide human feedback? ● What are other options and how to convert them into reward to train the models? ● What other types of human data do you think would be helpful? 114
  • 112. Addressing Misalignment: GPT-3 Brown et al. 2022 Language Models are Zero shot Learners Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question) Prompting: Make the eval text more similar to the training corpora using prompts 116