The document discusses methods for training language models to follow instructions with human feedback. It proposes a method called SFT that uses human annotators to provide feedback on model responses to prompts. An initial model is fine-tuned on this demonstration data to create an SFT Model. The SFT Model responses are then ranked by users. A reward model is trained on the rankings to provide rewards. The reward model is then used to update the SFT Model via policy gradients, creating a PPO Model. A KL penalty is added to prevent the PPO Model from deviating too much from the SFT Model.
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
As many organizations are bundling large language models (LLMs) in their products, they face the problem of rigorous model selection. This talk gives a data-centric understanding of how LLMs are built and evaluated. We will discuss the limitations of current models and pay special attention to the available evaluation protocols. How do we distinguish good models from the others? What tasks and datasets should we try or avoid? How do we incorporate feedback from our users? We will present the guidelines the attendees can use in their future experiments.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
Deep neural network training is time consuming, often take days and weeks, and a hard topic to master. Selecting the right hyper-parameters is difficult, but so important since it directly affects the behavior of the training algorithm and has a significant impact on performance and accuracy.
In this talk, we will discuss a novel approach using distributed Spark to explore the vast hyper-parameter search space to find a near optimal configuration according to a targeted quality of service (QoS). Several hyper-parameter and network architecture search approaches will be discussed and compared (e.g., Random, a tree-based Parzen, Bayesian, Reinforcement Learning, …). Furthermore, we will propose a framework and method to share information across different trials to make the searching process highly efficient.
We’ll also introduce a real-time monitoring, tuning and optimization mechanism for model training to detect early stop conditions and recommend better hyper-parameters. Finally, we will use real-world models and use cases to demonstrate how hyper-parameter selection and adaptive tuning accelerates model development and training when running Caffe and Tensorflow in our distributed Spark environment.
Recommender Systems from A to Z – Model TrainingCrossing Minds
This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
As many organizations are bundling large language models (LLMs) in their products, they face the problem of rigorous model selection. This talk gives a data-centric understanding of how LLMs are built and evaluated. We will discuss the limitations of current models and pay special attention to the available evaluation protocols. How do we distinguish good models from the others? What tasks and datasets should we try or avoid? How do we incorporate feedback from our users? We will present the guidelines the attendees can use in their future experiments.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
Deep neural network training is time consuming, often take days and weeks, and a hard topic to master. Selecting the right hyper-parameters is difficult, but so important since it directly affects the behavior of the training algorithm and has a significant impact on performance and accuracy.
In this talk, we will discuss a novel approach using distributed Spark to explore the vast hyper-parameter search space to find a near optimal configuration according to a targeted quality of service (QoS). Several hyper-parameter and network architecture search approaches will be discussed and compared (e.g., Random, a tree-based Parzen, Bayesian, Reinforcement Learning, …). Furthermore, we will propose a framework and method to share information across different trials to make the searching process highly efficient.
We’ll also introduce a real-time monitoring, tuning and optimization mechanism for model training to detect early stop conditions and recommend better hyper-parameters. Finally, we will use real-world models and use cases to demonstrate how hyper-parameter selection and adaptive tuning accelerates model development and training when running Caffe and Tensorflow in our distributed Spark environment.
Recommender Systems from A to Z – Model TrainingCrossing Minds
This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
Training language models to follow instructions with human feedback (Instruct...Rama Irsheidat
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Tech Jobs Interviews Preparation - GeekGap Webinar #1
Part 1 - Algorithms & Data Structures
What is an algorithm?
What is a data structure (DS)?
Why study algorithms & DS?
How to assess good algorithms?
Algorithm & DS interviews structure
Case study: Binary Search
2 Binary Search variants
Part 2 - System Design
What is system design?
Why study system design?
System design interviews structure
Case study: ERD with Lucidchart
Demo Time: SQLAlchemy
Githu Repo: http://bit.ly/gg-io-webinar-1-github
by www.geekgap.io
Machine Learning vs Decision Optimization comparisonAlain Chabrier
Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Data science community is made of people coming from different areas, and who do not always understand each others. Everyone is using his own concepts and not always understands how these map when applied to other techniques.
In particular, Machine Learning experts do not always understand how Decision Optimization concepts maps or differs from their own concepts.
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
Dive into the essentials of ML model development, processes, and techniques to combat underfitting and overfitting, explore distributed training approaches, and understand model explainability. Enhance your skills with practical insights from a seasoned expert.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Falcon stands out as a top-tier P2P Invoice Discounting platform in India, bridging esteemed blue-chip companies and eager investors. Our goal is to transform the investment landscape in India by establishing a comprehensive destination for borrowers and investors with diverse profiles and needs, all while minimizing risk. What sets Falcon apart is the elimination of intermediaries such as commercial banks and depository institutions, allowing investors to enjoy higher yields.
Training language models to follow instructions with human feedback (Instruct...Rama Irsheidat
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Tech Jobs Interviews Preparation - GeekGap Webinar #1
Part 1 - Algorithms & Data Structures
What is an algorithm?
What is a data structure (DS)?
Why study algorithms & DS?
How to assess good algorithms?
Algorithm & DS interviews structure
Case study: Binary Search
2 Binary Search variants
Part 2 - System Design
What is system design?
Why study system design?
System design interviews structure
Case study: ERD with Lucidchart
Demo Time: SQLAlchemy
Githu Repo: http://bit.ly/gg-io-webinar-1-github
by www.geekgap.io
Machine Learning vs Decision Optimization comparisonAlain Chabrier
Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Data science community is made of people coming from different areas, and who do not always understand each others. Everyone is using his own concepts and not always understands how these map when applied to other techniques.
In particular, Machine Learning experts do not always understand how Decision Optimization concepts maps or differs from their own concepts.
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
Dive into the essentials of ML model development, processes, and techniques to combat underfitting and overfitting, explore distributed training approaches, and understand model explainability. Enhance your skills with practical insights from a seasoned expert.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Falcon stands out as a top-tier P2P Invoice Discounting platform in India, bridging esteemed blue-chip companies and eager investors. Our goal is to transform the investment landscape in India by establishing a comprehensive destination for borrowers and investors with diverse profiles and needs, all while minimizing risk. What sets Falcon apart is the elimination of intermediaries such as commercial banks and depository institutions, allowing investors to enjoy higher yields.
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...PaulBryant58
This article provides a comprehensive guide on how to
effectively manage the convert Accpac to QuickBooks , with a particular focus on utilizing online accounting services to streamline the process.
What is the TDS Return Filing Due Date for FY 2024-25.pdfseoforlegalpillers
It is crucial for the taxpayers to understand about the TDS Return Filing Due Date, so that they can fulfill your TDS obligations efficiently. Taxpayers can avoid penalties by sticking to the deadlines and by accurate filing of TDS. Timely filing of TDS will make sure about the availability of tax credits. You can also seek the professional guidance of experts like Legal Pillers for timely filing of the TDS Return.
Explore our most comprehensive guide on lookback analysis at SafePaaS, covering access governance and how it can transform modern ERP audits. Browse now!
Taurus Zodiac Sign_ Personality Traits and Sign Dates.pptxmy Pandit
Explore the world of the Taurus zodiac sign. Learn about their stability, determination, and appreciation for beauty. Discover how Taureans' grounded nature and hardworking mindset define their unique personality.
As a business owner in Delaware, staying on top of your tax obligations is paramount, especially with the annual deadline for Delaware Franchise Tax looming on March 1. One such obligation is the annual Delaware Franchise Tax, which serves as a crucial requirement for maintaining your company’s legal standing within the state. While the prospect of handling tax matters may seem daunting, rest assured that the process can be straightforward with the right guidance. In this comprehensive guide, we’ll walk you through the steps of filing your Delaware Franchise Tax and provide insights to help you navigate the process effectively.
Putting the SPARK into Virtual Training.pptxCynthia Clay
This 60-minute webinar, sponsored by Adobe, was delivered for the Training Mag Network. It explored the five elements of SPARK: Storytelling, Purpose, Action, Relationships, and Kudos. Knowing how to tell a well-structured story is key to building long-term memory. Stating a clear purpose that doesn't take away from the discovery learning process is critical. Ensuring that people move from theory to practical application is imperative. Creating strong social learning is the key to commitment and engagement. Validating and affirming participants' comments is the way to create a positive learning environment.
Affordable Stationery Printing Services in Jaipur | Navpack n PrintNavpack & Print
Looking for professional printing services in Jaipur? Navpack n Print offers high-quality and affordable stationery printing for all your business needs. Stand out with custom stationery designs and fast turnaround times. Contact us today for a quote!
Business Valuation Principles for EntrepreneursBen Wann
This insightful presentation is designed to equip entrepreneurs with the essential knowledge and tools needed to accurately value their businesses. Understanding business valuation is crucial for making informed decisions, whether you're seeking investment, planning to sell, or simply want to gauge your company's worth.
Personal Brand Statement:
As an Army veteran dedicated to lifelong learning, I bring a disciplined, strategic mindset to my pursuits. I am constantly expanding my knowledge to innovate and lead effectively. My journey is driven by a commitment to excellence, and to make a meaningful impact in the world.
Memorandum Of Association Constitution of Company.pptseri bangash
www.seribangash.com
A Memorandum of Association (MOA) is a legal document that outlines the fundamental principles and objectives upon which a company operates. It serves as the company's charter or constitution and defines the scope of its activities. Here's a detailed note on the MOA:
Contents of Memorandum of Association:
Name Clause: This clause states the name of the company, which should end with words like "Limited" or "Ltd." for a public limited company and "Private Limited" or "Pvt. Ltd." for a private limited company.
https://seribangash.com/article-of-association-is-legal-doc-of-company/
Registered Office Clause: It specifies the location where the company's registered office is situated. This office is where all official communications and notices are sent.
Objective Clause: This clause delineates the main objectives for which the company is formed. It's important to define these objectives clearly, as the company cannot undertake activities beyond those mentioned in this clause.
www.seribangash.com
Liability Clause: It outlines the extent of liability of the company's members. In the case of companies limited by shares, the liability of members is limited to the amount unpaid on their shares. For companies limited by guarantee, members' liability is limited to the amount they undertake to contribute if the company is wound up.
https://seribangash.com/promotors-is-person-conceived-formation-company/
Capital Clause: This clause specifies the authorized capital of the company, i.e., the maximum amount of share capital the company is authorized to issue. It also mentions the division of this capital into shares and their respective nominal value.
Association Clause: It simply states that the subscribers wish to form a company and agree to become members of it, in accordance with the terms of the MOA.
Importance of Memorandum of Association:
Legal Requirement: The MOA is a legal requirement for the formation of a company. It must be filed with the Registrar of Companies during the incorporation process.
Constitutional Document: It serves as the company's constitutional document, defining its scope, powers, and limitations.
Protection of Members: It protects the interests of the company's members by clearly defining the objectives and limiting their liability.
External Communication: It provides clarity to external parties, such as investors, creditors, and regulatory authorities, regarding the company's objectives and powers.
https://seribangash.com/difference-public-and-private-company-law/
Binding Authority: The company and its members are bound by the provisions of the MOA. Any action taken beyond its scope may be considered ultra vires (beyond the powers) of the company and therefore void.
Amendment of MOA:
While the MOA lays down the company's fundamental principles, it is not entirely immutable. It can be amended, but only under specific circumstances and in compliance with legal procedures. Amendments typically require shareholder
3. The three Hʼs of Model Desiderata
Askell et al. 2021 A general language assistant as a laboratory for alignment 4
4. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Askell et al. 2021 A general language assistant as a laboratory for alignment 5
5. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Askell et al. 2021 A general language assistant as a laboratory for alignment 6
6. The three Hʼs of Model Desiderata
Helpful:
● The AI should help the user solve their task (e.g. answer
their questions)
Honest:
● The AI should give accurate information
● The AI should express uncertainty when the model doesnʼt
know the answer, instead of hallucinating a wrong answer
Harmless:
● The AI should not cause physical, psychological, or social
harm to people or the environment
Askell et al. 2021 A general language assistant as a laboratory for alignment 7
7. The Misalignment of Models
Misalignment: When the training objective does not capture
the desiderata we want from models
8
8. The Misalignment of Models
Training: Predict the next token
The three Hʼs of Model Desiderata
Misalignment: When the training objective does not capture
the desiderata we want from models
9
10. Addressing Misalignment: Instruction Following
11
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
11. Addressing Misalignment: Instruction Following
Train: Next-token prediction -> Eval: Follow instructions (e.g. summarize this)
12
- The three Hʼs are one possible set of desiderata
- One more concrete desiderata is getting models to follow instructions
12. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
13
13. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
14
14. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
15
15. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
16
16. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
17
17. Addressing Misalignment: FLAN (Decoder models)
Wei et al. 2022 Finetuned Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Instruction Tuning: Fine-tune models to follow instructions
1. Aggregate Datasets (62): Collect wide
variety of public datasets
2. Instruction Templates: Manually write 10
templates / dataset that captures task
3. Fine-tune: Use the instruction templates
and datasets to fine-tune model
4. Evaluate on held-out task
18
18. Addressing Misalignment: T0 (Encoder-Decoder models)
Sanh et al. 2022 Multitask Prompted Training Enables Zero Shot Generalization
Train: Span prediction -> Eval: Follow instructions (e.g. answer this question)
Basically the same idea as FLAN, except fine-tune an encoder-decoder model (T5)
20
19. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
21
20. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
22
21. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
23
22. Addressing Misalignment: LaMDA
Thoppilan et al. 2022 LaMDA Langague Models for Dialogue Applications
Train: Next-token prediction -> Eval: Dialogue with human users
Solution: Add a bunch of dialogue text to your pretraining data
- 2.97B Documents
- 1.12B Dialogues and 13.39B Dialogue Utterances
- 1.56T words total
24
24. Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Annotates train set
26
25. Method: Human Annotators
40 Annotators from Upwork/ScaleAI
- Screened/Onboarded/Diverse etc etc etc
Different annotators from Upwork/ScaleAI
- Not screened, to better mirror real-world
Annotates train set Annotates eval
27
28. Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
32
29. Method: The SFT Model
A large collections of prompts:
- From OpenAI GPT3 Playground
- Filtered to remove PII
- Filtered to remove duplicates
- Limited to 200 examples / user
- Annotators are also tasked with writing prompts
33
32. Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
36
33. Method: The SFT Model
Finetune the model, call this model SFT Model
- Initialized with pretrained GPT-3 175B model, and trained
for 16 Epochs on demonstration data
- In notation also refer to as:
37
36. Method
To increase data collection throughput, each user is given K = 4 to
9 outputs to rank for each prompt
40
37. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
41
38. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Reward on better
completion
Reward on worse
completion
42
39. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
43
40. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
44
41. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
45
42. Method
rθ
: The reward model we are trying to optimize
x: the prompt yw
: the better completion yl
: the worse completion
Small but important detail:
- Each prompt has K completions -> K choose 2 pairs to compare
- If ∀ batch we sample uniform over every pair (from any prompt):
- Each completion can appear in K - 1 gradient updates
- This can lead to overfitting
- Solution: sample the prompt, and then put all K choose 2 pairs
from the prompt into the same batch
- Corollary: computationally more efficient, since this only
requires K forward passes through rθ
for each prompt
- This is why there is the -1/(K choose 2) normalization in loss
46
45. Method
Use RM to update the SFT model from step 1. Call model PPO
49
46. Method
Use RM to update the SFT model from step 1. Call model PPO
50
Number of Prompts
47. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
51
48. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
52
49. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
53
50. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
54
51. Method
Use RM to update the SFT model from step 1. Call model PPO
Two problems:
1. As RLHF is updated, its outputs become very different from
what the RM was trained on -> worse reward estimates
Solution: add a KL penalty that makes sure PPO
model output does not deviate too far from SFT
2. Just using RL objective leads to performance degradation
on many NLP tasks
Solution: Add a auxiliary LM objective on the
pretraining data. Call this variant PPO-ptx
55
54. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
58
55. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
59
56. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
60
57. Method: Model Summary
1. SFT: Supervised Fine-Tuning
a. GPT-3 fine-tuned on human demonstrations of prompt completions
2. RM: Reward Model
a. Not actually used to generate anything, but used to train the PPO and PPO-ptx
models
3. PPO
a. SFT model further fine-tuned using RL with the RM providing the reward signal
b. A KL-loss is provided to prevent the PPO model from deviating far from SFT
4. PPO-ptx
a. Identical to PPO, except with an additional auxiliary LM objective on the
pretraining data
61
58. Pre-lecture Q1
62
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
59. Pre-lecture Q1
63
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
60. Pre-lecture Q1
64
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
61. Pre-lecture Q1
65
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
62. Pre-lecture Q1
66
Describe the three datasets the authors collected: SFT, RM, PPO. What are the format of these datasets and
how are they used in the pipeline?
1. SFT: Set of ~13k prompts (from labellers and API) and their corresponding labeller completions. Used
to train the SFT model.
2. RM: Set of ~33k training prompts (from labellers and API), each with K corresponding SFT model
completions ranked by labellers. This is used to train the RM.
3. PPO: Set of ~31k training prompts (from API only), used as input to the PPO and PPO-ptx model for
the policy optimization step.
Note: None of these datasets are available publically :(
63. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
67
64. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
68
65. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The RM assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
69
66. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
70
67. Method: Why is RL using Human Feedback (RLHF) good?
The SFT approach also uses data to align with human desiderata, why do RLHF?
1. Reward is a more nuanced training signal than autoregressive loss
a. If the correct next token is “great”, the AR loss penalizes the prediction “amazing” the same as
“sandwiches”. The assigns similar rewards to sequences with similar quality.
2. The RM “critiques” actual completions generated from the model itself, whereas SFT training does not
use model generations, since it is completely offline.
a. This means the RM may provide more “tailored” feedback to the model
3. The RM more directly captures the notion of “preference”.
a. Preferences induce rankings, and rankings can be used to infer preferences
b. Ranking is very naturally captured by the reward signal, better sequences = higher reward
c. In SFT, preference is not explicitly captured, since we only train to regurgitate “the best” example
4. The RM is more data efficient
a. There is a reason step 1 uses 13k prompts, but step 3 can use 31k prompts.
b. For SFT, we need humans to generate target. Once we train the RM, it can be used to score any
output
71
69. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
73
70. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
74
71. Original Goal: 3H
● Helpful: need to infer intention from the user (labelersʼ preference rating)
● Honest (truthfulness):
○ Hallucination (labelerʼs rating)
○ TruthfulQA dataset
● Harmless:
○ RealToxicityPrompts (toxicity)
○ Winogender & CrowS-Pairs (social bias)
75
72. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
76
73. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
77
74. Evaluation: Testing Distributions
● API distribution
○ Prompts submitted to the original GPT-3 model (generally not instruction following)
○ Prompts submitted to the InstructGPT model
● Public NLP tasks
○ SQuAD
○ DROP
○ HellaSwag
○ WMT 2015 French to English
78
76. Baseline: 50-50 win rate against SFT
80
Helpfulness: Preferences of the Labelers
77. ● GPT vs. Instruct distribution
81
Helpfulness: Preferences of the Labelers
78. ● GPT vs. Instruct distribution
● Labelers who provide training
data vs. new labelers
(preference overfitting)
82
Helpfulness: Preferences of the Labelers
79. ● Researcher tries to find prompts
that can successfully instruct a
vanilla GPT (they donʼt include
examples in the paper)
83
Helpfulness: Preferences of the Labelers
80. ● PPO models win across the board
84
Helpfulness: Preferences of the Labelers
81. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
85
82. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
86
83. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
87
● Models trained with feedback data are less likely to hallucinate
● Interesting that SFT has lower hallucinations
85. Preferences of the Labelers: Breakdown
X-axis aggregated across model sizes
89
86. Comparing w/ Fine-Tuned Models
● Public NLP dataset does not reflect how the API is used
○ Public dataset capture mostly things that are easy to automatically evaluate
○ API is more often used for open-ended generation
Instruct prompt distribution
90
87. Truthfulness
91
● “Instruction+QA”: instruct the model to respond with “I have no comment” when it is not
certain of the correct answer
● Models do not have to be specifically instructed to “tell the truth” to be more truthfulness
91. Toxicity: RealToxicityPrompts
95
● When instructed to be respectful, InstructGPT reduces toxicity > GTP-3
● When instructed to be rude, InstructGPT amplifies toxicity > GPT-3 (in paper)
93. Bias: Winogender & CrowS-Pairs
97
● Metric: entropy of the multi-choice completion as the measure of bias
● Higher entropy -> less biased
Winogender
CrowS-Pairs
95. Pre-Lecture Q2
99
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
96. Pre-Lecture Q2
100
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
Answer:
Toxicity: InstructGPT can reduce it.
Bias: The authors say in the paper that they donʼt find clear patterns
● But a reasonable hypotheses might be that itʼs not easy to get this type of feedback
● Social biases can be subtle and hard to detect
● Labelers are not very directly instructed to catch bias
97. Pre-Lecture Q2
101
Instruction to the labelers
Summarize the evaluation results of InstructGPT vs GPT-3 in toxicity and bias. Why do
you think it is the case?
100. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
104
101. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
105
104. InstructGPT Still Makes Simple Mistakes
1. Incorrectly assumes the premise is true when itʼs not
2. Overly hedging: model might answer “no one answer to the question” when
the one answer is clear from the context
3. Performance degrades when instructions contain multiple explicit constraints
(e.g. “list 10 movies made in the 1930ʼs set in France”)
108
106. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
110
107. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
111
108. Implications
● Alignment research (more in the last lecture)
○ What are we aligning to? The Labelers? The researchers?
● How to do research when the model is constantly changing
○ text-davinci-001 is the InstructGPT described in the paper
○ text-davinci-002 is the current model behind the API—this model is
crazily powerful but we donʼt know what data itʼs trained on and any
update on the training procedure
○ How do we do model versioning when we start to iterate on the models
and train them with model-dependant data?
112
109. Summary
Performance
● Labelers preference: InstructGPT > GPT-3
● Truthfulness: InstructGPT > GPT-3
● Toxicity: InstructGPT > GPT-3, (but not bias)
Findings
● InstructGPT can generalize to “held-out” labelersʼ preferences
● Public NLP datasets do not reflect real-world LMs use
● InstructGPT can generalize: outside of the RLHF instruction distribution
● InstructGPT still makes simple mistakes
113
110. Pre-Lecture Q3
● Is preference ranking/comparison the only way to provide human feedback?
● What are other options and how to convert them into reward to train the models?
● What other types of human data do you think would be helpful?
114
112. Addressing Misalignment: GPT-3
Brown et al. 2022 Language Models are Zero shot Learners
Train: Next-token prediction -> Eval: Follow instructions (e.g. answer this question)
Prompting: Make the eval text more similar to the training corpora using prompts
116