Webinar on ChatGPT
Abhilash Majumder (Intel SCG)
ChatGPT
• Chat GPT model is trained using Reinforcement Learning from
Human Feedback (RLHF),
• ChatGPT uses the same methods as InstructGPT, but with
slight differences in the data collection setup.
• ChatGPT is trained on an initial model using supervised fine-
tuning: human AI trainers provided conversations in which they
played both sides—the user and an AI assistant.
• For supervised fine-tuning ChatGPT leverages a reward
function based on PPO on policy algorithm to achieve SOTA
generative sequences
ChatGPT
ChatGPT- GPT3
• GPT-3 is an autoregressive
transformer model with 175
billion parameters. It uses
the same architecture/model
as GPT-2, including the
modified initialization,
pre-normalization, and
reversible tokenization,
with the exception that GPT-
3 uses alternating dense and
locally banded sparse
attention patterns in the
layers of the transformer,
similar to the Sparse
Transformer.
ChatGPT- PPO(A2C)
• There are two primary variants of PPO: PPO-
Penalty and PPO-Clip.
• PPO-Penalty approximately solves a KL-
constrained update like TRPO, but penalizes
the KL-divergence in the objective function
instead of making it a hard constraint, and
automatically adjusts the penalty coefficient
over the course of training so that it’s
scaled appropriately.
• PPO-Clip doesn’t have a KL-divergence term in
the objective and doesn’t have a constraint
at all. Instead relies on specialized
clipping in the objective function to remove
incentives for the new policy to get far from
the old policy.
• PPO is an on-policy algorithm.
• PPO can be used for environments with either
discrete or continuous action spaces.
•
ChatGPT
• In case of GPT, PPO
infusion is semi
supervised. This implies
that a reward function is
moderated by human
supervision based on
previous results. The
initial LLM
(GPT)generative sequences
are ranked based on the
cumulative rewards based
on human supervised PPO.
ChatGPT
• Both models are given a
prompt and get a response.
The tuned LLM responses
are scored with the reward
function and which is then
used to update the
parameters of the fine-
tuned LLM to maximize the
reward function score (PPO
rewards)
•
ChatGPT
• But we also don't want
it to deviate too much
from the initial
response, which is what
the KL penalty is used
for. Otherwise the
optimization might
result in an LLM that
produces gibberish but
maximizes the reward
model score.
ChatGPT
ChatGPT
• OpenAI Blog: https://openai.com/blog/chatgpt/
• InstructGPT: https://t.co/2VXhz0kK1o
• Minimalist Repository (in progress) :
https://github.com/abhilash1910/Minimalist-ChatGPT
• Other Repositories in RL/LLM :
https://github.com/abhilash1910/
ChatGPT
• Twitter: https://twitter.com/abhilash1396
• Github: https://github.com/abhilash1910/
• Linkedin: https://www.linkedin.com/in/abhilash-majumder-
1aa7b9138/

Webinar on ChatGPT.pptx

  • 1.
    Webinar on ChatGPT AbhilashMajumder (Intel SCG)
  • 2.
    ChatGPT • Chat GPTmodel is trained using Reinforcement Learning from Human Feedback (RLHF), • ChatGPT uses the same methods as InstructGPT, but with slight differences in the data collection setup. • ChatGPT is trained on an initial model using supervised fine- tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. • For supervised fine-tuning ChatGPT leverages a reward function based on PPO on policy algorithm to achieve SOTA generative sequences
  • 3.
  • 4.
    ChatGPT- GPT3 • GPT-3is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT- 3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.
  • 5.
    ChatGPT- PPO(A2C) • Thereare two primary variants of PPO: PPO- Penalty and PPO-Clip. • PPO-Penalty approximately solves a KL- constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately. • PPO-Clip doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy. • PPO is an on-policy algorithm. • PPO can be used for environments with either discrete or continuous action spaces. •
  • 6.
    ChatGPT • In caseof GPT, PPO infusion is semi supervised. This implies that a reward function is moderated by human supervision based on previous results. The initial LLM (GPT)generative sequences are ranked based on the cumulative rewards based on human supervised PPO.
  • 7.
    ChatGPT • Both modelsare given a prompt and get a response. The tuned LLM responses are scored with the reward function and which is then used to update the parameters of the fine- tuned LLM to maximize the reward function score (PPO rewards) •
  • 8.
    ChatGPT • But wealso don't want it to deviate too much from the initial response, which is what the KL penalty is used for. Otherwise the optimization might result in an LLM that produces gibberish but maximizes the reward model score.
  • 9.
  • 10.
    ChatGPT • OpenAI Blog:https://openai.com/blog/chatgpt/ • InstructGPT: https://t.co/2VXhz0kK1o • Minimalist Repository (in progress) : https://github.com/abhilash1910/Minimalist-ChatGPT • Other Repositories in RL/LLM : https://github.com/abhilash1910/
  • 11.
    ChatGPT • Twitter: https://twitter.com/abhilash1396 •Github: https://github.com/abhilash1910/ • Linkedin: https://www.linkedin.com/in/abhilash-majumder- 1aa7b9138/