SlideShare a Scribd company logo
1 of 56
Training language
models to follow
instructions with human
feedback (InstructGPT)
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Rama Irsheidat
02 Objectives
03 Methodology
Future works
Model Owner (Team of authors)
Long Ouyang*
Jeffrey Wu*
Research Scientist at
Research engineer on
OpenAI's safety team
Open AI team*
AI research and deployment
company dedicated to
ensuring that general-
purpose artificial
intelligence benefits all
of humanity.
* All the Pictures and information about the authors and the company from LinkedIn
● Large language models (LMs) can perform a range of
natural language processing (NLP) tasks when prompted
with examples.
● However, these models often exhibit unintended
behaviors, such as generating biased or toxic text,
making up facts, or not following user instructions.
● They aim to align LMs by training them to act in
accordance with the user's intention, including
explicit and implicit intentions. But evaluating the
model to be helpful, honest, and harmless.
● Fine-tuning approaches to align language models to
follow a broad class of written instructions with
reinforcement learning from human feedback (RLHF).
Pre-training task
Since they focus on fine-tuning the pre-trained
GPT-3 language model with human feedback. GPT-3
is pre-trained on a large corpus of text data
using a language modeling objective, which
involves predicting the next word in a sequence
of text.
Three steps for building InstructGPT
● As a first step, they hired a team of 40 contractors
based on their screening test results.
● Then, they trained their supervised learning baselines
based on human-written demonstrations of the desired
output behavior using (mostly English) prompts
submitted to the OpenAI API and some labeler-written
prompts. (Source of training data)
Supervisedfine-tuning model
Three steps for building InstructGPT
● Secondly, gathering a dataset of human-labeled
comparisons between outputs from OpenAI's models on a
larger set of API prompts.
● Then, train a reward model (RM) on this dataset to
predict which model output their labelers would
Rewardmodel (RM) training
Three steps for building InstructGPT
● Finally, they use this RM as a reward function and
fine-tune the supervised learning baseline to maximize
this reward using the PPO algorithm.
Optimizing a policy against the reward
model using Reinforcement learning (RL)
● They primarily evaluate their models by having their
labelers rate the quality of model outputs on the test
set, consisting of prompts from held-out customers who
were not included in the training data.
● Additionally, they perform automatic evaluations using
various public NLP datasets.
●SFT: Stands for a supervised fine-tuning model.
●PPO: Stands for proximal policy optimization, which is a reinforcement
learning algorithm used in this research paper to fine-tune the language
model with human feedback.
●PPO-ptx:Is a variant of the PPO algorithm used to fine-tune the
InstructGPT models.
●GPT: Stands for Generative Pre-trained Transformer. This model is
used for natural language processing (NLP) tasks.
●GPT(prompted): Fine-tuning the GPT model with human feedback.
●(GPT, GPT prompted): GPT-3 baselines.
They found that outputs from the 1.3B parameter
InstructGPT model are preferred to outputs from the 175B
GPT-3 model, despite having 100x fewer parameters.
Human evaluations on OpenAI API
prompt distribution
They experiment with different sizes
of the GPT-3 language models (1.3B,
6B, and 175B parameters). They also
compare the performance of their
InstructGPT models, which have 1.3B
parameters, to the original GPT-3
models as we showed in the previous
Variants of the model
in terms of size
model architecture
Since InstructGPT is based
on GPT-3, it is likely that
InstructGPT is also a
decoder model.
Comparing the performance of GPT-3 and InstructGPTmodels
on various tasks
● InstructGPT models outperform in terms of
generating appropriate outputs, following
explicit constraints in the instructions, and
generating more truthful and informative
● InstructGPT models generate information not
present in the input about half as often as
GPT-3 models.
● InstructGPT models show small improvements in
● Minimizing performance regressions on public
NLP datasets
● Generalizing to the preferences of “held-out”
● InstructGPT models show promising
generalization to instructions outside of the
RLHF fine-tuning distribution
● GPT-3 models do not outperform in
terms of generating appropriate
outputs even when it is given a few-
shot prompt to make it better at
following instructions.
● GPT-3 models show small improvements
in bias.
● Maximizing performance regressions on
public NLP datasets
● GPT-3 models can perform many tasks
but require more careful prompting and
do not usually follow instructions.
Creating a language model that can follow a
broad class of written instructions
helpfully and safely, while avoiding
generating untruthful, toxic, or otherwise
harmful outputs called InstructGPT.
Objective 1
Objective 2 Using human feedback to fine-tune language models is
a promising approach for aligning language models
with human intent.
Showing that the output model has 1.3 billion
parameters, which is fewer than the 175
billion parameters in the source model.
Objective 3
Related work
Methods and
Additional details
● Research on alignment and learning from human
● Training language models to follow instructions.
● Evaluating the harms of language models.
● Modifying the behavior of language models to mitigate
Related work
High-level methodology
1. Collect demonstration data and train a supervised
2. Collect comparison data and train a reward model
3. Optimize a policy against the reward model using PPO.
Methods and experimental details
To train the first InstructGPT models, labelers needed to
write prompts themselves—since it required an initial
source of instruction-like prompts to bootstrap the
Three kinds of prompts are used
Plain - arbitrary task
Few-shot - multiple query/ response pairs per instruction
User-based - waitlist use cases for open Al API
Methods and experimental details
Data cleaning
They heuristically de-duplicate prompts by checking for
prompts that share a long common prefix, and they limit
the number of prompts to 200 per user ID.
They also create their train, validation, and test splits
based on user ID, so that the validation and test sets
contain no data from users whose data is in the training
Methods and experimental details
From the prompts used, they
produced three different
datasets used in their fine-
tuning procedure:
Methods and experimental details
Their training tasks are from two sources:
1. Dataset of prompts written by their labelers
2. Dataset of prompts submitted to early InstructGPT
models on their API
These prompts are very diverse and include generation,
question answering, dialog, summarization, extractions,
and other natural language tasks
Methods and experimental details
Human data collection
The aim was to select a group of labelers who were
sensitive to the preferences of different demographic
groups and able to identify potentially harmful outputs
through screening tests.
During training and evaluation, their alignment criteria
may come into conflict:
● Training: prioritize helpfulness to the user
● Evaluating: prioritize truthfulness and harmlessness
Methods and experimental details
Human data collection
To test whether the model generalizes to other labelers, a
separate set of labelers are hired who do not produce any
training data.(Held-out labelers)
Despite the complexity of the task, they find that inter-
annotator agreement rates are quite high:
● Training labelers: agree with each other 72.6 ± 1.5%
● Held-out labelers: the number is 77.3 ± 1.3%
Methods and experimental details
Reinforcement learning (RL)
Supervised fine-tuning (SFT)
• Fine-tune GPT-3 on labeler
demonstrations using
supervised learning
• Train for 16 epochs using
cosine learning rate decay
and residual dropout of
• SFT models overfit on
validation loss after 1
• Starting with the SFT model
with the final unembedding
layer removed, train a
model to take in a prompt
and response and output a
scalar reward
• Present labelers with K= 4
to K=9 responses to rank
• Train on all 𝐾
from each prompt as a
single batch element
• Fine-tuned the SFT model with PPO
• Given prompt and response, it
produces a reward determined by the
reward model and ends the episode
• Add a per-token KL penalty from the
SFT model at each token to mitigate
over-optimization of the reward
Reward modeling (RM)
Methods and experimental details
Loss function for the reward model
where r𝜃 (x, y) is the scalar output of the reward model
for prompt x and completion y with parameters 𝜃, yw is the
preferred completion out of the pair of yw and yl , and D
is the dataset of human comparisons.
Methods and experimental details
Evaluations on API
• The main metric is human
preference ratings on a held-
out set of prompts from the
same source as training
• Choose prompts equally between
GPT and InstructGPT in order to
not bias between the models
• Datasets with language model
safety, truthfulness, toxicity, and
• They evaluate on two types of
public datasets, which are FLAN and
T0, both consist of a variety of
NLP tasks. Also, conduct human
evaluations of toxicity on the
RealToxicityPrompts dataset
Evaluations on public
NLP datasets
Methods and experimental details
API distribution
● Labelers significantly prefer InstructGPT outputs over outputs from
● The InstructGPT generalized to the preferences of ”held-out” labelers
Public NLP datasets
● Showing improvements in truthfulness over GPT-3
● Showing small improvements in toxicity over GPT-3, but not bias
● Minimizing performance regressions on public NLP datasets by modifying
their RLHF fine-tuning procedure
Qualitative results
● They notice that it often produces an output in English
even when the instruction is in another language.
● In comparison, they find that GPT-3 can perform these tasks
but requires more careful prompting, and rarely follows
instructions in these domains.
Preference results
Left: results on prompts submitted to
GPT models on the API.
Right: results on prompts submitted to
InstructGPT models on the API.
Top: results from held-out labelers.
Bottom: results from training labelers.
Metadata results on the API
Compared to GPT-3, the PPO
models are more appropriate
in the context of a customer
assistant, are better at
following explicit
constraints in the
instruction and attempting
the correct instruction, and
are less likely to make up
information on closed domain
Comparing with FLAN and T0
in terms of Likert scores
On a 1-7 scale, on the
InstructGPT prompt distribution.
FLAN and T0 perform better than
default GPT-3, and comparably
with a few-shot GPT-3 model
placed into ‘instruction-
following’ mode.
Results on the TruthfulQA
Gray bars indicate ratings of
Colored bars indicate ratings of
truthfulness and informativeness.
Results comparisons on
mainstream NLP tasks
They compare the performance of their
InstructGPT models to the original GPT-3 model
on several mainstream NLP tasks, including
sentiment analysis, question answering, text
classification, and other natural language tasks
(see Table 1).
They found that the InstructGPT models performed
similarly to or slightly worse than the original
GPT-3 model on these tasks, but with
improvements in truthfulness and reductions in
toxic output generation.
Implications for alignment research
1. The cost of increasing model alignment is modest relative
to pretraining
○ 175B model requires 60 petaflops/ day
○ 3600 petaflops per day for GPT-3
2. Evidence that InstructGPT generalizes “following
instructions” to settings without supervision
3. Mitigate most of the performance degradations introduced
by fine-tuning
4. Validated alignment techniques with real-world research
InstructGPT models' behavior
is influenced by human
feedback from contractors
Labeling tasks may be impacted by
contractors' beliefs, cultural
backgrounds, and personal history
Their team of contractors is not
representative of the full
spectrum of people who will use
our models
Labelers are primarily English-
speaking and their data
consists almost entirely of
English instructions
Their models are not fully aligned
or fully safe, they can generate
toxic or biased outputs, make up
facts, and generate sexual and
violent content without explicit
prompting – Examples of these
mistakes come up next slide
Their models follow the user's
instruction, even if it could lead
to harm in the real world
Model mistakes Confused by
that assume
false premises
Overly hedge,
rather than
answering simple
OQ 01
Several methods could be tried
to further decrease the models’
propensity to generate toxic,
biased, or otherwise harmful
Training models to be harmless despite
user instructions is important, but
difficult since whether an output is
harmful depends on the context in
which it’s deployed.
There is a vast space of options for
designing interfaces for labelers to
provide feedback to language models;
this is an interesting human-
computer interaction problem.
How to design an alignment
process that is transparent
OQ 03
OQ 02
OQ 04
Discussion Open Questions
Additional details
Additional prompt data details
Additional human data collection details
Additional model details
Automatic evaluation details
Additional results
Model samples
Data diversity
Additional prompt
data details
A subset of their labeled prompt metadata.
Note that their annotation fields changed
over the course of the project, so not
every prompt was annotated for every field.
Web interface
Additional human
data collection
For each output, labelers give a
Likert score for overall quality on a
1-7 scale and also provide various
metadata label.
Web interface
Additional human
data collection
Labelers rank all the outputs for a
given prompt. Ties are encouraged in
cases where two outputs seem to be of
similar quality.
Labeler demographic data
Additional human
data collection
Additional results
Performance on public NLP datasets
performance of their
models on various
public NLP datasets
Performance on public NLP datasets
Additional results
performance of
their models on
various public NLP
Future work
• Developing better ways to detect and remove
biased or harmful content
• Incorporating ethical considerations into
the design of their models
• Implementing safeguards to prevent the
generation of harmful outputs
• The paper concludes that fine-tuning language models with
human feedback is a promising direction for aligning these
models with user intent.
• The authors demonstrate this approach using the GPT-3
language model and show that their method, called
InstructGPT, can improve the truthfulness and reduce the
toxicity of model outputs while maintaining performance on
public NLP datasets.
• They also found that the 1.3B parameter InstructGPT model
is preferred to the 175B GPT-3 model in human evaluations,
despite having fewer parameters.
• The authors suggest that their method could be applied to a
wide range of NLP tasks and could help address concerns
about the ethical implications of large language models.

More Related Content

What's hot

AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMsJim Steele
Let's talk about GPT: A crash course in Generative AI for researchers
Let's talk about GPT: A crash course in Generative AI for researchersLet's talk about GPT: A crash course in Generative AI for researchers
Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
ChatGPT-the-revolution-is-coming.pdfLiang Yan
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPTLoic Merckel
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
ChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdfChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdfAddepto
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaMichal Jaskolski
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGISynaptonIncorporated
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?Bernard Marr
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNetDomyoung Lee
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021Steve Omohundro

What's hot (20)

AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
Let's talk about GPT: A crash course in Generative AI for researchers
Let's talk about GPT: A crash course in Generative AI for researchersLet's talk about GPT: A crash course in Generative AI for researchers
Let's talk about GPT: A crash course in Generative AI for researchers
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
ChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdfChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdf
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021

Similar to Training language models to follow instructions with human feedback (Instruct GPT).pptx

How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxKnoldus Inc.
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenPoo Kuan Hoong
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationJEE HYUN PARK
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Egor Kraev
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET Journal
Machine Learning Presentation
Machine Learning PresentationMachine Learning Presentation
Machine Learning PresentationSk Samiul Islam
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationKeon Kim
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningSanghamitra Deb
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPYury Kashnitsky
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfJamieDornan2
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_saRobert Martin
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde

Similar to Training language models to follow instructions with human feedback (Instruct GPT).pptx (20)

How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarization
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
Machine Learning Presentation
Machine Learning PresentationMachine Learning Presentation
Machine Learning Presentation
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generation
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment

Recently uploaded

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic) smith
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Recently uploaded (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Training language models to follow instructions with human feedback (Instruct GPT).pptx

  • 1. Training language models to follow instructions with human feedback (InstructGPT) Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI) Rama Irsheidat
  • 2. Introduction 01 TABLE OF CONTENTS 02 Objectives 03 Methodology 04 Future works 05 Conclusion
  • 4. Model Owner (Team of authors) Long Ouyang* Jeffrey Wu* Research Scientist at OpenAI Research engineer on OpenAI's safety team Open AI team* AI research and deployment company dedicated to ensuring that general- purpose artificial intelligence benefits all of humanity. * All the Pictures and information about the authors and the company from LinkedIn
  • 5. ● Large language models (LMs) can perform a range of natural language processing (NLP) tasks when prompted with examples. ● However, these models often exhibit unintended behaviors, such as generating biased or toxic text, making up facts, or not following user instructions. ● They aim to align LMs by training them to act in accordance with the user's intention, including explicit and implicit intentions. But evaluating the model to be helpful, honest, and harmless. ● Fine-tuning approaches to align language models to follow a broad class of written instructions with reinforcement learning from human feedback (RLHF).
  • 6. Pre-training task Since they focus on fine-tuning the pre-trained GPT-3 language model with human feedback. GPT-3 is pre-trained on a large corpus of text data using a language modeling objective, which involves predicting the next word in a sequence of text.
  • 7. Three steps for building InstructGPT
  • 8. ● As a first step, they hired a team of 40 contractors based on their screening test results. ● Then, they trained their supervised learning baselines based on human-written demonstrations of the desired output behavior using (mostly English) prompts submitted to the OpenAI API and some labeler-written prompts. (Source of training data) Supervisedfine-tuning model
  • 9. Three steps for building InstructGPT
  • 10. ● Secondly, gathering a dataset of human-labeled comparisons between outputs from OpenAI's models on a larger set of API prompts. ● Then, train a reward model (RM) on this dataset to predict which model output their labelers would prefer. Rewardmodel (RM) training
  • 11. Three steps for building InstructGPT
  • 12. ● Finally, they use this RM as a reward function and fine-tune the supervised learning baseline to maximize this reward using the PPO algorithm. Optimizing a policy against the reward model using Reinforcement learning (RL)
  • 13. ● They primarily evaluate their models by having their labelers rate the quality of model outputs on the test set, consisting of prompts from held-out customers who were not included in the training data. ● Additionally, they perform automatic evaluations using various public NLP datasets. Evaluation
  • 14. ●SFT: Stands for a supervised fine-tuning model. ●PPO: Stands for proximal policy optimization, which is a reinforcement learning algorithm used in this research paper to fine-tune the language model with human feedback. ●PPO-ptx:Is a variant of the PPO algorithm used to fine-tune the InstructGPT models. ●GPT: Stands for Generative Pre-trained Transformer. This model is used for natural language processing (NLP) tasks. ●GPT(prompted): Fine-tuning the GPT model with human feedback. ●(GPT, GPT prompted): GPT-3 baselines. They found that outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3 model, despite having 100x fewer parameters. Human evaluations on OpenAI API prompt distribution
  • 15. They experiment with different sizes of the GPT-3 language models (1.3B, 6B, and 175B parameters). They also compare the performance of their InstructGPT models, which have 1.3B parameters, to the original GPT-3 models as we showed in the previous slide. Variants of the model in terms of size
  • 16. InstructGPT(Gpt-3) model architecture Since InstructGPT is based on GPT-3, it is likely that InstructGPT is also a decoder model.
  • 17. Comparing the performance of GPT-3 and InstructGPTmodels on various tasks InstructGPT ● InstructGPT models outperform in terms of generating appropriate outputs, following explicit constraints in the instructions, and generating more truthful and informative answers. ● InstructGPT models generate information not present in the input about half as often as GPT-3 models. ● InstructGPT models show small improvements in toxicity. ● Minimizing performance regressions on public NLP datasets ● Generalizing to the preferences of “held-out” labelers ● InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution GPT-3 ● GPT-3 models do not outperform in terms of generating appropriate outputs even when it is given a few- shot prompt to make it better at following instructions. ● GPT-3 models show small improvements in bias. ● Maximizing performance regressions on public NLP datasets ● GPT-3 models can perform many tasks but require more careful prompting and do not usually follow instructions. Vs
  • 19. Creating a language model that can follow a broad class of written instructions helpfully and safely, while avoiding generating untruthful, toxic, or otherwise harmful outputs called InstructGPT. Objective 1 Objective 2 Using human feedback to fine-tune language models is a promising approach for aligning language models with human intent. Showing that the output model has 1.3 billion parameters, which is fewer than the 175 billion parameters in the source model. Objective 3
  • 22. ● Research on alignment and learning from human feedback. ● Training language models to follow instructions. ● Evaluating the harms of language models. ● Modifying the behavior of language models to mitigate harms. Related work
  • 23. High-level methodology 1. Collect demonstration data and train a supervised policy 2. Collect comparison data and train a reward model 3. Optimize a policy against the reward model using PPO. Methods and experimental details
  • 24. Dataset To train the first InstructGPT models, labelers needed to write prompts themselves—since it required an initial source of instruction-like prompts to bootstrap the process. Three kinds of prompts are used Plain - arbitrary task Few-shot - multiple query/ response pairs per instruction User-based - waitlist use cases for open Al API Methods and experimental details
  • 25. Data cleaning They heuristically de-duplicate prompts by checking for prompts that share a long common prefix, and they limit the number of prompts to 200 per user ID. They also create their train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. Methods and experimental details
  • 26. From the prompts used, they produced three different datasets used in their fine- tuning procedure: Methods and experimental details
  • 27. Tasks Their training tasks are from two sources: 1. Dataset of prompts written by their labelers 2. Dataset of prompts submitted to early InstructGPT models on their API These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks Methods and experimental details
  • 28. Human data collection The aim was to select a group of labelers who were sensitive to the preferences of different demographic groups and able to identify potentially harmful outputs through screening tests. During training and evaluation, their alignment criteria may come into conflict: ● Training: prioritize helpfulness to the user ● Evaluating: prioritize truthfulness and harmlessness Methods and experimental details
  • 29. Human data collection To test whether the model generalizes to other labelers, a separate set of labelers are hired who do not produce any training data.(Held-out labelers) Despite the complexity of the task, they find that inter- annotator agreement rates are quite high: ● Training labelers: agree with each other 72.6 ± 1.5% ● Held-out labelers: the number is 77.3 ± 1.3% Methods and experimental details
  • 30. Reinforcement learning (RL) Supervised fine-tuning (SFT) • Fine-tune GPT-3 on labeler demonstrations using supervised learning • Train for 16 epochs using cosine learning rate decay and residual dropout of 0.2 • SFT models overfit on validation loss after 1 epoch • Starting with the SFT model with the final unembedding layer removed, train a model to take in a prompt and response and output a scalar reward • Present labelers with K= 4 to K=9 responses to rank • Train on all 𝐾 2 comparisons from each prompt as a single batch element • Fine-tuned the SFT model with PPO • Given prompt and response, it produces a reward determined by the reward model and ends the episode • Add a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model Reward modeling (RM) Models Methods and experimental details
  • 31. Loss function for the reward model where r𝜃 (x, y) is the scalar output of the reward model for prompt x and completion y with parameters 𝜃, yw is the preferred completion out of the pair of yw and yl , and D is the dataset of human comparisons. Methods and experimental details
  • 32. Evaluations on API distribution • The main metric is human preference ratings on a held- out set of prompts from the same source as training distribution • Choose prompts equally between GPT and InstructGPT in order to not bias between the models • Datasets with language model safety, truthfulness, toxicity, and bias • They evaluate on two types of public datasets, which are FLAN and T0, both consist of a variety of NLP tasks. Also, conduct human evaluations of toxicity on the RealToxicityPrompts dataset Evaluations on public NLP datasets Evaluation Methods and experimental details
  • 33. API distribution ● Labelers significantly prefer InstructGPT outputs over outputs from GPT-3 ● The InstructGPT generalized to the preferences of ”held-out” labelers Public NLP datasets ● Showing improvements in truthfulness over GPT-3 ● Showing small improvements in toxicity over GPT-3, but not bias ● Minimizing performance regressions on public NLP datasets by modifying their RLHF fine-tuning procedure Results
  • 34. Qualitative results ● They notice that it often produces an output in English even when the instruction is in another language. ● In comparison, they find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains. Results
  • 35. Preference results Results Left: results on prompts submitted to GPT models on the API. Right: results on prompts submitted to InstructGPT models on the API. Top: results from held-out labelers. Bottom: results from training labelers.
  • 36. Metadata results on the API distribution Results Compared to GPT-3, the PPO models are more appropriate in the context of a customer assistant, are better at following explicit constraints in the instruction and attempting the correct instruction, and are less likely to make up information on closed domain tasks.
  • 37. Comparing with FLAN and T0 in terms of Likert scores Results On a 1-7 scale, on the InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably with a few-shot GPT-3 model placed into ‘instruction- following’ mode.
  • 38. Results on the TruthfulQA dataset Results Gray bars indicate ratings of truthfulness. Colored bars indicate ratings of truthfulness and informativeness.
  • 39. Results comparisons on mainstream NLP tasks Results They compare the performance of their InstructGPT models to the original GPT-3 model on several mainstream NLP tasks, including sentiment analysis, question answering, text classification, and other natural language tasks (see Table 1). They found that the InstructGPT models performed similarly to or slightly worse than the original GPT-3 model on these tasks, but with improvements in truthfulness and reductions in toxic output generation.
  • 40. Implications for alignment research 1. The cost of increasing model alignment is modest relative to pretraining ○ 175B model requires 60 petaflops/ day ○ 3600 petaflops per day for GPT-3 2. Evidence that InstructGPT generalizes “following instructions” to settings without supervision 3. Mitigate most of the performance degradations introduced by fine-tuning 4. Validated alignment techniques with real-world research Discussion
  • 41. 0 1 InstructGPT models' behavior is influenced by human feedback from contractors 0 2 Labeling tasks may be impacted by contractors' beliefs, cultural backgrounds, and personal history 0 3 Their team of contractors is not representative of the full spectrum of people who will use our models 0 4 Labelers are primarily English- speaking and their data consists almost entirely of English instructions Limitations Discussion
  • 42. 0 5 Their models are not fully aligned or fully safe, they can generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting – Examples of these mistakes come up next slide 0 6 Their models follow the user's instruction, even if it could lead to harm in the real world Limitations Discussion
  • 43. Model mistakes Confused by instruction that assume false premises Overly hedge, rather than directly answering simple questions
  • 44. OQ 01 Several methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. Training models to be harmless despite user instructions is important, but difficult since whether an output is harmful depends on the context in which it’s deployed. There is a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human- computer interaction problem. How to design an alignment process that is transparent OQ 03 OQ 02 OQ 04 Discussion Open Questions
  • 45. Additional details Additional prompt data details Additional human data collection details Additional model details Automatic evaluation details Additional results Model samples
  • 46. Data diversity Additional prompt data details A subset of their labeled prompt metadata. Note that their annotation fields changed over the course of the project, so not every prompt was annotated for every field.
  • 47. Web interface Additional human data collection details For each output, labelers give a Likert score for overall quality on a 1-7 scale and also provide various metadata label.
  • 48. Web interface Additional human data collection details Labelers rank all the outputs for a given prompt. Ties are encouraged in cases where two outputs seem to be of similar quality.
  • 49. Labeler demographic data Additional human data collection details
  • 50. Additional results Performance on public NLP datasets Zero-shot performance of their models on various public NLP datasets
  • 51. Performance on public NLP datasets Additional results Few-shot performance of their models on various public NLP datasets
  • 53. • Developing better ways to detect and remove biased or harmful content • Incorporating ethical considerations into the design of their models • Implementing safeguards to prevent the generation of harmful outputs
  • 55. • The paper concludes that fine-tuning language models with human feedback is a promising direction for aligning these models with user intent. • The authors demonstrate this approach using the GPT-3 language model and show that their method, called InstructGPT, can improve the truthfulness and reduce the toxicity of model outputs while maintaining performance on public NLP datasets. • They also found that the 1.3B parameter InstructGPT model is preferred to the 175B GPT-3 model in human evaluations, despite having fewer parameters. • The authors suggest that their method could be applied to a wide range of NLP tasks and could help address concerns about the ethical implications of large language models.