SlideShare a Scribd company logo
InstructGPT
Training language models to follow instructions
with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
January 10, 2023
1 / 50
InstructGPT
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
2 / 50
InstructGPT
Abstract
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
3 / 50
InstructGPT
Abstract
Abstract
This paper show a method to align language models with user intent
on a wide range of tasks by fine-tuning with human feedback.
With using labeler-written prompts and prompts submitted through
the OpenAI API and a dataset of labeler demonstrations of the
desired model behavior. Such that they can fine-tune GPT-3 using
supervised learning.
And then collect a dataset of rankings of model outputs, which they
use to further fine-tune this supervised model using reinforcement
learning from human feedback, the model called InstructGPT.
4 / 50
InstructGPT
Abstract
Contribution
InstructGPT shows that fine-tuning with human feedback is a
promising direction for aligning language models with human intent.
The model shows improvements in truthfulness and reductions in
toxic output generation while having minimal performance
regressions on public NLP datasets.
Also, the output model has 1.3B parameter, which have less parameter
than the source model 175B GPT-3.
5 / 50
InstructGPT
Introduction
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
6 / 50
InstructGPT
Introduction
Introduction
InstructGPT prompted large language models (LMs) to perform a
range of natural language processing (NLP) tasks, given some
examples of the task as input.
And making progress on aligning language models by training them
to act in accordance with the user’s intention. But evaluating the
model to be helpful, honest, and harmless.
Then, fine-tuning approaches to aligning language models to follow a
broad class of written instructions with reinforcement learning from
human feedback (RLHF).
7 / 50
InstructGPT
Introduction
Framework about GPT-3
8 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
9 / 50
InstructGPT
Introduction
Training supervised fine-tuning model
Firstly, hiring a team of 40 contractors to label data, based on their
performance on a screening test.
Then collecting a dataset of human-written demonstrations of the
desired output behavior on (mostly English) prompts submitted to
the OpenAI API and some labeler-written prompts, and use this to
train their supervised learning baselines.
10 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
11 / 50
InstructGPT
Introduction
Training reward model
Secondly, collecting a dataset of human-labeled comparisons
between outputs from OpenAI’s models on a larger set of API
prompts.
And then train a reward model (RM) on this dataset to predict which
model output their labelers would prefer.
12 / 50
InstructGPT
Introduction
Three steps for building InstructGPT
13 / 50
InstructGPT
Introduction
Optimizing a policy against the reward model with RL
Finally, they use this RM as a reward function and fine-tune
supervised learning baseline to maximize this reward using the PPO
algorithm.
They mainly evaluate their models by having their labelers rate the
quality of model outputs on test set, consisting of prompts from
held-out customers (who are not represented in the training data).
Also conducting automatic evaluations on a range of public NLP
datasets.
14 / 50
InstructGPT
Introduction
Human evaluations on OpenAI API prompt distribution
15 / 50
InstructGPT
Introduction
Main findings
Minimizing performance regressions on public NLP datasets
Mixing PPO updates with updates that increase the log
likelihood of the pretraining distribution (PPO-ptx)
Generalizing to the preferences of “held-out” labelers
InstructGPT models show promising generalization to
instructions outside of the RLHF fine-tuning distribution
16 / 50
InstructGPT
Methods and experimental details
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
17 / 50
InstructGPT
Methods and experimental details
Dataset
Dataset
To train the very first InstructGPT models, labelers need to write
prompts themselves, since it needed an initial source of
instruction-like prompts to bootstrap the process, which regular
GPT-3 models don’t have.
Plain: labelers to come up with an arbitrary task,
Few-shot: come up with an instruction, and multiple
query/response pairs for that instruction
User-based: come up with prompts corresponding to a number
of use-cases to the OpenAI API.
18 / 50
InstructGPT
Methods and experimental details
Dataset
Use-case categories
19 / 50
InstructGPT
Methods and experimental details
Tasks and Human data collection
Tasks and Human data collection
The aim was to select a group of labelers who were sensitive to the
preferences of different demographic groups, and who were good at
identifying outputs that were potentially harmful by undergoing a
screening test.
During training and evaluation, their alignment criteria may come into
conflict:
Training: prioritize helpfulness to the user
Evaluating: prioritize truthfulness and harmlessness
20 / 50
InstructGPT
Methods and experimental details
Tasks and Human data collection
As an initial study to see how well the model generalizes to the
preferences of other labelers, they hire a separate set of labelers who
do not produce any of the training data.
Despite the complexity of the task, they find that inter-annotator
agreement rates are quite high:
Training labelers: agree with each-other 72.6 ± 1.5%
Held-out labelers: the number is 77.3 ± 1.3%
21 / 50
InstructGPT
Methods and experimental details
Models
Models
These models are trained on a broad distribution of Internet data and
are adaptable to a wide range of downstream tasks, but have poorly
characterized behavior.
1 Supervised fine-tuning (SFT)
2 Reward modeling (RM)
3 Reinforcement learning (RL)
22 / 50
InstructGPT
Methods and experimental details
Models
Supervised fine-tuning (SFT)
They fine-tune GPT-3 on their labeler demonstrations using
supervised learning.
And then they select final SFT model based on the RM score on the
validation set. Despite finding that their SFT models overfit on
validation loss after 1 epoch, it still helps both the RM score and
human preference ratings.
23 / 50
InstructGPT
Methods and experimental details
Models
Reward modeling (RM)
Starting from the SFT model with the final unembedding layer
removed, they trained a model to take in a prompt and response, and
output a scalar reward.
They only use 6B RMs, as this saves a lot of compute, and larger
model could be unstable, since it would less suitable to be used as the
value function during RL.
24 / 50
InstructGPT
Methods and experimental details
Models
Loss function for the reward model
In order to speed up comparison collection, they present labelers with
anywhere between K = 4 and K = 9 responses to rank.
loss(𝜃) = −
1

K
2
 E(x,yw,yl)∼D [log (𝜎 (r𝜃 (x, yw) − r𝜃 (x, yl)))]
where r𝜃 (x, y) is the scalar output of the reward model for prompt x
and completion y with parameters 𝜃, yw is the preferred completion out
of the pair of yw and yl, and D is the dataset of human comparisons.
25 / 50
InstructGPT
Methods and experimental details
Models
Reinforcement learning (RL) with PPO
They fine-tuned the SFT model on their environment using PPO, and
the environment is a bandit environment which presents a random
customer prompt and expects a response to the prompt.
In addition, they add a per-token KL penalty from the SFT model at
each token to mitigate over optimization of the reward model, and the
value function is initialized from the RM.
objective(𝜙) = E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
26 / 50
InstructGPT
Methods and experimental details
Models
PPO-ptx models
In order to fix the performance regressions on public NLP datasets,
they mixing the pretraining gradients into the PPO gradients.
objective (𝜙) =E(x,y)∼D𝜋RL
𝜙
h
r𝜃 (x, y) − 𝛽 log

𝜋RL
𝜙 (y | x)/𝜋SFT
(y | x)
i
+
𝛾Ex∼Dpretrain
h
log

𝜋RL
𝜙 (x)
i
where 𝜋RL
𝜙 is the learned RL policy, 𝜋SFT is the supervised trained
model, and Dpretrain is the pretraining distribution, for each coefficient,
𝛽 is for KL reward, and 𝛾 is for the pretraining loss.
27 / 50
InstructGPT
Methods and experimental details
Evaluation
Evaluation
API distribution
Main metric is human preference ratings on a held out set of
prompts from the same source as their training distribution.
Public NLP datasets
They evaluate on two types of public datasets, which are FLAN
and T0, both consist of a variety of NLP tasks. Also, conduct
human evaluations of toxicity on the RealToxicityPrompts dataset
28 / 50
InstructGPT
Methods and experimental details
Evaluation
Labeler-collected metadata on the API distribution
29 / 50
InstructGPT
Results
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
30 / 50
InstructGPT
Results
Results
API distribution
Labelers significantly prefer InstructGPT outputs over outputs
from GPT-3
Generalizing to the preferences of ”held-out” labelers
Public NLP datasets are not reflective of how their language
models are used
Public NLP datasets
Showing improvements in truthfulness over GPT-3
Showing small improvements in toxicity over GPT-3, but not bias
Minimizing performance regressions on public NLP datasets by
modifying their RLHF fine-tuning procedure
31 / 50
InstructGPT
Results
Preference results
32 / 50
InstructGPT
Results
Metadata results on the API distribution
33 / 50
InstructGPT
Results
Comparing with FLAN and T0 in terms of Likert scores
34 / 50
InstructGPT
Results
Results on the TruthfulQA dataset
35 / 50
InstructGPT
Results
36 / 50
InstructGPT
Results
Qualitative results
They notice that it often produces an output in English even when the
instruction is in another language.
In comparison, they find that GPT-3 can perform these tasks but
requires more careful prompting, and rarely follows instructions in
these domains.
37 / 50
InstructGPT
Results
Simple mistakes
38 / 50
Confused by instruction that assume false premises
Overly hedge, rather than directly answering simple questions
InstructGPT
Discussion
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
39 / 50
InstructGPT
Discussion
Discussion
Implications for alignment research
1 Increasing investments in alignment of existing language models
is more cost-effective than training larger models
2 More research is needed to study how well this generalization
scales with increased capabilities
The goal of this paper is to demonstrate that this alignment technique
can align to an specific human reference group for a specific
application.
40 / 50
InstructGPT
Discussion
Future work
They need to design an alignment process that is transparent
They need to prevent model to generate the contents that will
harm the real world
Filtering the pretraining mix data for toxic content
The model can’t represent full spectrum of people who will use
41 / 50
InstructGPT
Additional details
Table of contents
1 Abstract
2 Introduction
3 Methods and experimental details
Dataset
Tasks and Human data collection
Models
Evaluation
4 Results
5 Discussion
6 Additional details
42 / 50
InstructGPT
Additional details
Additional details
1 Additional prompt data details
2 Additional human data collection details
3 Additional model details
4 Automatic evaluation details
5 Additional results
43 / 50
InstructGPT
Additional details
1.1 Dataset sizes
44 / 50
InstructGPT
Additional details
1.2 Data diversity
45 / 50
InstructGPT
Additional details
2.1 Web interface
46 / 50
InstructGPT
Additional details
2.1 Web interface
47 / 50
InstructGPT
Additional details
2.2 Labeler demographic data
48 / 50
InstructGPT
Additional details
5 Additional results (Zero-shot)
49 / 50
InstructGPT
Additional details
5 Additional results (Few-shot)
50 / 50

More Related Content

What's hot

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
Joaquin Vanschoren
 
A Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial NetworksA Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial Networks
Jong Wook Kim
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
Julien SIMON
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
NAVER Engineering
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
Vitaly Bondar
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
Knoldus Inc.
 
Gpt models
Gpt modelsGpt models
Gpt models
Danbi Cho
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
Robert McDermott
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Databricks
 
Bert
BertBert
Generative models
Generative modelsGenerative models
Generative models
Birger Moell
 
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNetIntroduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
Amazon Web Services
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
JEE HYUN PARK
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
Bithiah Yuan
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
Fiza987241
 

What's hot (20)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
A Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial NetworksA Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial Networks
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
Gpt models
Gpt modelsGpt models
Gpt models
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Bert
BertBert
Bert
 
Generative models
Generative modelsGenerative models
Generative models
 
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNetIntroduction to Generative Adversarial Networks (GAN) with Apache MXNet
Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 

Similar to Training language models to follow instructions with human feedback.pdf

IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET Journal
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
Model based test case prioritization using neural network classification
Model based test case prioritization using neural network classificationModel based test case prioritization using neural network classification
Model based test case prioritization using neural network classification
cseij
 
Comparative analysis of classical multi-objective evolutionary algorithms an...
 Comparative analysis of classical multi-objective evolutionary algorithms an... Comparative analysis of classical multi-objective evolutionary algorithms an...
Comparative analysis of classical multi-objective evolutionary algorithms an...
Javier Ferrer, PhD
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
InstructGPT: Follow instructions with human feedback
InstructGPT: Follow instructions with human feedbackInstructGPT: Follow instructions with human feedback
InstructGPT: Follow instructions with human feedback
Yan Xu
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 
Harnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoringHarnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoring
TELKOMNIKA JOURNAL
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problems
laxmanLaxman03209
 
Ieee doctoral progarm final
Ieee doctoral progarm finalIeee doctoral progarm final
Ieee doctoral progarm final
Joydeb Roy Chowdhury
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
IJMER
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...Amir Ziai
 
Analysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approachAnalysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approach
Lorenzo Cesaretti
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
IRJET Journal
 

Similar to Training language models to follow instructions with human feedback.pdf (20)

IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern Recognition
 
Model based test case prioritization using neural network classification
Model based test case prioritization using neural network classificationModel based test case prioritization using neural network classification
Model based test case prioritization using neural network classification
 
Comparative analysis of classical multi-objective evolutionary algorithms an...
 Comparative analysis of classical multi-objective evolutionary algorithms an... Comparative analysis of classical multi-objective evolutionary algorithms an...
Comparative analysis of classical multi-objective evolutionary algorithms an...
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
InstructGPT: Follow instructions with human feedback
InstructGPT: Follow instructions with human feedbackInstructGPT: Follow instructions with human feedback
InstructGPT: Follow instructions with human feedback
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Harnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoringHarnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoring
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problems
 
Ieee doctoral progarm final
Ieee doctoral progarm finalIeee doctoral progarm final
Ieee doctoral progarm final
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
 
Analysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approachAnalysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approach
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
06522405
0652240506522405
06522405
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
 

More from Po-Chuan Chen

Graph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdfGraph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdf
Po-Chuan Chen
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Po-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Po-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
Po-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
Po-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
Po-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Po-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
Po-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Po-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Po-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Po-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Po-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Po-Chuan Chen
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Po-Chuan Chen
 

More from Po-Chuan Chen (20)

Graph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdfGraph Neural Prompting with Large Language Models.pdf
Graph Neural Prompting with Large Language Models.pdf
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
 

Recently uploaded

Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 

Recently uploaded (20)

Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 

Training language models to follow instructions with human feedback.pdf

  • 1. InstructGPT Training language models to follow instructions with human feedback Long Ouyang, Jeff Wu, Xu Jiang et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan, Chen January 10, 2023 1 / 50
  • 2. InstructGPT Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 2 / 50
  • 3. InstructGPT Abstract Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 3 / 50
  • 4. InstructGPT Abstract Abstract This paper show a method to align language models with user intent on a wide range of tasks by fine-tuning with human feedback. With using labeler-written prompts and prompts submitted through the OpenAI API and a dataset of labeler demonstrations of the desired model behavior. Such that they can fine-tune GPT-3 using supervised learning. And then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using reinforcement learning from human feedback, the model called InstructGPT. 4 / 50
  • 5. InstructGPT Abstract Contribution InstructGPT shows that fine-tuning with human feedback is a promising direction for aligning language models with human intent. The model shows improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Also, the output model has 1.3B parameter, which have less parameter than the source model 175B GPT-3. 5 / 50
  • 6. InstructGPT Introduction Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 6 / 50
  • 7. InstructGPT Introduction Introduction InstructGPT prompted large language models (LMs) to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. And making progress on aligning language models by training them to act in accordance with the user’s intention. But evaluating the model to be helpful, honest, and harmless. Then, fine-tuning approaches to aligning language models to follow a broad class of written instructions with reinforcement learning from human feedback (RLHF). 7 / 50
  • 9. InstructGPT Introduction Three steps for building InstructGPT 9 / 50
  • 10. InstructGPT Introduction Training supervised fine-tuning model Firstly, hiring a team of 40 contractors to label data, based on their performance on a screening test. Then collecting a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API and some labeler-written prompts, and use this to train their supervised learning baselines. 10 / 50
  • 11. InstructGPT Introduction Three steps for building InstructGPT 11 / 50
  • 12. InstructGPT Introduction Training reward model Secondly, collecting a dataset of human-labeled comparisons between outputs from OpenAI’s models on a larger set of API prompts. And then train a reward model (RM) on this dataset to predict which model output their labelers would prefer. 12 / 50
  • 13. InstructGPT Introduction Three steps for building InstructGPT 13 / 50
  • 14. InstructGPT Introduction Optimizing a policy against the reward model with RL Finally, they use this RM as a reward function and fine-tune supervised learning baseline to maximize this reward using the PPO algorithm. They mainly evaluate their models by having their labelers rate the quality of model outputs on test set, consisting of prompts from held-out customers (who are not represented in the training data). Also conducting automatic evaluations on a range of public NLP datasets. 14 / 50
  • 15. InstructGPT Introduction Human evaluations on OpenAI API prompt distribution 15 / 50
  • 16. InstructGPT Introduction Main findings Minimizing performance regressions on public NLP datasets Mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx) Generalizing to the preferences of “held-out” labelers InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution 16 / 50
  • 17. InstructGPT Methods and experimental details Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 17 / 50
  • 18. InstructGPT Methods and experimental details Dataset Dataset To train the very first InstructGPT models, labelers need to write prompts themselves, since it needed an initial source of instruction-like prompts to bootstrap the process, which regular GPT-3 models don’t have. Plain: labelers to come up with an arbitrary task, Few-shot: come up with an instruction, and multiple query/response pairs for that instruction User-based: come up with prompts corresponding to a number of use-cases to the OpenAI API. 18 / 50
  • 19. InstructGPT Methods and experimental details Dataset Use-case categories 19 / 50
  • 20. InstructGPT Methods and experimental details Tasks and Human data collection Tasks and Human data collection The aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful by undergoing a screening test. During training and evaluation, their alignment criteria may come into conflict: Training: prioritize helpfulness to the user Evaluating: prioritize truthfulness and harmlessness 20 / 50
  • 21. InstructGPT Methods and experimental details Tasks and Human data collection As an initial study to see how well the model generalizes to the preferences of other labelers, they hire a separate set of labelers who do not produce any of the training data. Despite the complexity of the task, they find that inter-annotator agreement rates are quite high: Training labelers: agree with each-other 72.6 ± 1.5% Held-out labelers: the number is 77.3 ± 1.3% 21 / 50
  • 22. InstructGPT Methods and experimental details Models Models These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. 1 Supervised fine-tuning (SFT) 2 Reward modeling (RM) 3 Reinforcement learning (RL) 22 / 50
  • 23. InstructGPT Methods and experimental details Models Supervised fine-tuning (SFT) They fine-tune GPT-3 on their labeler demonstrations using supervised learning. And then they select final SFT model based on the RM score on the validation set. Despite finding that their SFT models overfit on validation loss after 1 epoch, it still helps both the RM score and human preference ratings. 23 / 50
  • 24. InstructGPT Methods and experimental details Models Reward modeling (RM) Starting from the SFT model with the final unembedding layer removed, they trained a model to take in a prompt and response, and output a scalar reward. They only use 6B RMs, as this saves a lot of compute, and larger model could be unstable, since it would less suitable to be used as the value function during RL. 24 / 50
  • 25. InstructGPT Methods and experimental details Models Loss function for the reward model In order to speed up comparison collection, they present labelers with anywhere between K = 4 and K = 9 responses to rank. loss(𝜃) = − 1 K 2 E(x,yw,yl)∼D [log (𝜎 (r𝜃 (x, yw) − r𝜃 (x, yl)))] where r𝜃 (x, y) is the scalar output of the reward model for prompt x and completion y with parameters 𝜃, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons. 25 / 50
  • 26. InstructGPT Methods and experimental details Models Reinforcement learning (RL) with PPO They fine-tuned the SFT model on their environment using PPO, and the environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. In addition, they add a per-token KL penalty from the SFT model at each token to mitigate over optimization of the reward model, and the value function is initialized from the RM. objective(𝜙) = E(x,y)∼D𝜋RL 𝜙 h r𝜃 (x, y) − 𝛽 log 𝜋RL 𝜙 (y | x)/𝜋SFT (y | x) i 26 / 50
  • 27. InstructGPT Methods and experimental details Models PPO-ptx models In order to fix the performance regressions on public NLP datasets, they mixing the pretraining gradients into the PPO gradients. objective (𝜙) =E(x,y)∼D𝜋RL 𝜙 h r𝜃 (x, y) − 𝛽 log 𝜋RL 𝜙 (y | x)/𝜋SFT (y | x) i + 𝛾Ex∼Dpretrain h log 𝜋RL 𝜙 (x) i where 𝜋RL 𝜙 is the learned RL policy, 𝜋SFT is the supervised trained model, and Dpretrain is the pretraining distribution, for each coefficient, 𝛽 is for KL reward, and 𝛾 is for the pretraining loss. 27 / 50
  • 28. InstructGPT Methods and experimental details Evaluation Evaluation API distribution Main metric is human preference ratings on a held out set of prompts from the same source as their training distribution. Public NLP datasets They evaluate on two types of public datasets, which are FLAN and T0, both consist of a variety of NLP tasks. Also, conduct human evaluations of toxicity on the RealToxicityPrompts dataset 28 / 50
  • 29. InstructGPT Methods and experimental details Evaluation Labeler-collected metadata on the API distribution 29 / 50
  • 30. InstructGPT Results Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 30 / 50
  • 31. InstructGPT Results Results API distribution Labelers significantly prefer InstructGPT outputs over outputs from GPT-3 Generalizing to the preferences of ”held-out” labelers Public NLP datasets are not reflective of how their language models are used Public NLP datasets Showing improvements in truthfulness over GPT-3 Showing small improvements in toxicity over GPT-3, but not bias Minimizing performance regressions on public NLP datasets by modifying their RLHF fine-tuning procedure 31 / 50
  • 33. InstructGPT Results Metadata results on the API distribution 33 / 50
  • 34. InstructGPT Results Comparing with FLAN and T0 in terms of Likert scores 34 / 50
  • 35. InstructGPT Results Results on the TruthfulQA dataset 35 / 50
  • 37. InstructGPT Results Qualitative results They notice that it often produces an output in English even when the instruction is in another language. In comparison, they find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains. 37 / 50
  • 38. InstructGPT Results Simple mistakes 38 / 50 Confused by instruction that assume false premises Overly hedge, rather than directly answering simple questions
  • 39. InstructGPT Discussion Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 39 / 50
  • 40. InstructGPT Discussion Discussion Implications for alignment research 1 Increasing investments in alignment of existing language models is more cost-effective than training larger models 2 More research is needed to study how well this generalization scales with increased capabilities The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. 40 / 50
  • 41. InstructGPT Discussion Future work They need to design an alignment process that is transparent They need to prevent model to generate the contents that will harm the real world Filtering the pretraining mix data for toxic content The model can’t represent full spectrum of people who will use 41 / 50
  • 42. InstructGPT Additional details Table of contents 1 Abstract 2 Introduction 3 Methods and experimental details Dataset Tasks and Human data collection Models Evaluation 4 Results 5 Discussion 6 Additional details 42 / 50
  • 43. InstructGPT Additional details Additional details 1 Additional prompt data details 2 Additional human data collection details 3 Additional model details 4 Automatic evaluation details 5 Additional results 43 / 50
  • 48. InstructGPT Additional details 2.2 Labeler demographic data 48 / 50
  • 49. InstructGPT Additional details 5 Additional results (Zero-shot) 49 / 50
  • 50. InstructGPT Additional details 5 Additional results (Few-shot) 50 / 50