How Does Generative AI Actually Work? (a quick semi-technical introduction to Large Language Models)
1. How Does Generative AI Actually Work?
(a quick semi-technical introduction to Large Language Models)
William H. Press
Department of Computer Science
The University of Texas at Austin
May 12, 2023
2. What word
comes after
“Happy…” Is it “aardvark”? Or “birthday”?
Simple probabilities (but trillions of them) are
at the root of all large language models (LLMs)
3. Me: ChatGPT, I want to explain sequential word guessing to someone. If a
sentence starts with "The boy" can you tell me what your next 10
sequential guesses of words would be, one word at a time?
Sure! Here are 10 sequential word guesses, one at a time, based on the sentence
starting with "The boy":
1. went, 2. to, 3. the, 4. park, 5. to, 6. play, 7. with, 8. his, 9.friends, 10. and
Keep in mind that as a language model, my guesses are based on patterns and
probabilities derived from the text corpus I was trained on. The actual guesses may
vary depending on the specific context and the model version being used.
What if the sentence instead begins with "The secret behind GPT-4 is"?
Here are 10 sequential word guesses:
1. "its", 2. "ability", 3. "to", 4. "understand", 5. "and", 6. "generate", 7. "human-
like", 8. "text", 9. "based", 10. "on“
Again, these are just guesses based on common patterns and do not represent an
actual sentence.
The distant ancestor of all chatbots is sequential word-
guessing in natural languages. Hence “generative”.
4. LLMs as “lossy text compression”
• Represent training corpus more compactly by finding and
encoding its structures and relationships
• eliminate redundancy at all levels: syntactic, semantic, multiple
sources, etc.
• the result encoded in ~109-1012 matrix “weights”
• “Lossy” because information is irretrievably lost
• prompts are answered by (sort-of) decompressing into highly
probable responses that could have been in the training data, but,
in general, weren’t exactly so verbatim
• The decompressed data maintains accuracy when…
• it is “common sense” or “conventional wisdom”
• because then a huge redundancy in the training data
• But can be wildly inaccurate (like “digital artifacts” in a
defective video) if query is not well represented in the
compressed training corpus
• e.g., most probable answer comes from one (wrong) document
• or variants of a widely circulated conspiracy theory
• if uncompressing from no germane data at all, it just makes things
up (“hallucinates”) to get the most probable response
https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
5. GPT-4: OpenAI’s latest released Large Language Model
• OpenAI isn’t actually open! Can for many purposes be
thought of as practically a Microsoft subsidiary.
• Microsoft is said to provide the unique hardware infrastructure for
OpenAI algorithm development.
• GPT = Generative Pre-trained Transformer
• Thought to have >5x1011 trainable parameters.
• GPT-3 had 1.75x1011
• Trained on > several terabytes of language data
• Training cost claimed to be $100 million
• but this might be including amortized R&D
• once trained, cost per query is millicents per token
• I will highlight three key elements of “secret sauce”:
• 1. transformer architecture
• 2. huge scale of parameter space training corpus
• 3. “RLHF” Reinforcement Learning from Human Feedback
• mostly not reported on
6. Key #1: Transformer architecture. It is a distant
descendant of document query concepts
Document retrieval:
• Input text projected onto matrix of
possible queries
• Matrix multiply to cross queries
with keys (e.g., keywords)
• Matrix multiply to map result from
keys to values (e.g., documents)
• The brilliant idea of Vaswani et al.
(2017, “Attention Is All You Need”) is
map all of Q, K, V from the same input.
• This is “Self-Attention”
• And have all of Q, K, V learned.
• Many layers allows attention to many
different levels of structure
simultaneously
• This is “Multi-headed”
https://dugas.ch/artificial_curiosity/GPT_architecture.html
input
processing
stuff
(encoder)
output
processing
stuff
(decoder)
~ 103 layers?
7. Key #2: Sheer scale: Only a few dare to call it
emergence, but the gain-of-function is striking
• Transformer parameters:
• trillion parameters =(?) 1000 parallel instances of billion
• billion parameters in an instance =(?) 104 each query space, key space,
value space (multiply two at a time) + “glue”
• could think of as looking at every token list 107 ways in formulating the
next response
• “stateless”: looks at whole previous dialogue as a new token list, maximum
length 32768 tokens
• Training corpus parameters:
• many terabytes?
• ~1000x number of words a human hears or reads in a lifetime
• Many proprietary tricks:
• how to propagate gradients through the huge scale?
• how to maximize parallelism in training?
• special hardware?
• rumored that GPT-5 training is on hold because not enough GPUs are
obtainable in the world.
8. Key #3: Reinforcement Learning with
Human Feedback (RLHF)
• Said to involve 40 contractors with (each?) hundreds of workers.
• 1. Humans generate an initial dataset of “typical” queries and “good” responses.
• 2. Humans rank a separate dataset of responses (some good, some bad).
• 3. The reward model is trained on this data.
• 4. The language model trains itself to satisfy the reward model. (How often?)
• 5. The reward model is retrained with (free!) input from hundreds of millions of
users. (How often?)
user
prompt
“most-human”
output
many
output
guesses
feedback
https://aisupremacy.substack.com/p/
what-is-reinforcement-learning-with
https://www.bradknox.net/
human-reward/tamer/
9. In typical usage, RLHF doesn't directly train the base GPT model. The GPT
model is first pretrained on a large corpus of text data, learning to predict
the next word in a sentence. This pretraining process is unsupervised and
uses a variant of the Transformer architecture to learn patterns and
structures in the data.
After this pretraining, the model can be fine-tuned using reinforcement
learning from human feedback. In this process, human evaluators rank
different responses generated by the model, and these rankings are used
to create a reward model. The model is then fine-tuned to optimize this
reward model, typically using a method like Proximal Policy Optimization.
In this way, the RLHF is a separate step that fine-tunes the pretrained GPT
model, rather than directly training the initial GPT weights. However,
please note that this information is based on the methodology up to
September 2021, and OpenAI's approach may have evolved since then.
Me: ChatGPT, When GPT-3 or GPT-4 uses RLHF, is it a separate model, or
does RLHF train the GPT weights directly?
10. “There is no moat?” (Some insiders think that the technology
can be miniaturized and will leak via open source availability to
individuals and thence startups.)
• Meta/Facebook’s entire LLM was leaked online in March, 2023.
• and later released publicly
• A May, 2023, leaked Google document (one engineer’s private
view):
• People are now running foundation models on a Pixel 6 at 5 tokens/s
• You can fine-tune a personalized AI on your laptop in an evening.
• Open-source models are [he says] faster, more customizable, more
private, and pound-for-pound more capable.
• “They are doing things with $100 and 13B params that we struggle with
at $10M and 540B.”
• LORa (low-rank factorization of large models) saves factor ~103
• and is composable to train new models, no need to go back to full model
• Big companies might be held liable for copyright infringement on
training data, or have to buy it
• Open-source leaked models might (de-facto) not be
• Scenario: A bunch of (LORa?) matrices appear on a Discord server
11. Me: ChatGPT, To what extent can LLMs be miniaturized, for example by low
rank factorization?
As of my last update in September 2021, there's been some promising
research showing that low-rank factorization can be used to compress neural
networks, including Transformer-based models. Weight matrices in the
Transformer's self-attention layers can be approximated well with low-rank
matrices, which can significantly reduce the model's size with minimal loss in
performance.
Low-rank factorization is just one of many techniques for model
compression. Others include knowledge distillation, where a smaller model is
trained to mimic the behavior of a larger model, and pruning, where less
important parts of the model are removed.
12. Me: ChatGPT, Is LLM training data subject to copyright and thus needs to be
licensed?
The use of copyrighted material in training LLMs may be considered a
copyright infringement. The models learn by consuming vast amounts of text
data which may include copyrighted works. By using this information to
generate new text, the models are indirectly reproducing and repurposing
parts of the copyrighted works they were trained on, albeit in a non-
verbatim manner. Some argue that this constitutes a derivative work, which
is a right reserved for the copyright holder under U.S. Copyright Law and
international copyright agreements. The copyright holder has exclusive rights
to reproduce and create derivative works from their material.
The counter-argument hinges on the doctrine of fair use. Advocates can
argue that LLMs, which generate new, original text, use copyrighted material
in a transformative manner, this a crucial consideration in fair use
determinations. The vast amount of training data also means any specific
copyrighted work constitutes an insubstantial portion of the input. Lastly, this
use doesn't harm the market for the original works but may encourage
further knowledge creation. However, as of September 2021, these complex
legal questions lack clear legal precedent and interpretations may vary across
jurisdictions.
13. The LLM Value Chain:
figure courtesy W. Dally
May turn out that most value is with the end-service provider
who specializes the model and knows the end-customer needs.