# How Does Generative AI Actually Work? (a quick semi-technical introduction to Large Language Models)

May. 20, 2023
1 of 14

### How Does Generative AI Actually Work? (a quick semi-technical introduction to Large Language Models)

• 1. How Does Generative AI Actually Work? (a quick semi-technical introduction to Large Language Models) William H. Press Department of Computer Science The University of Texas at Austin May 12, 2023
• 2. What word comes after “Happy…” Is it “aardvark”? Or “birthday”? Simple probabilities (but trillions of them) are at the root of all large language models (LLMs)
• 3. Me: ChatGPT, I want to explain sequential word guessing to someone. If a sentence starts with "The boy" can you tell me what your next 10 sequential guesses of words would be, one word at a time? Sure! Here are 10 sequential word guesses, one at a time, based on the sentence starting with "The boy": 1. went, 2. to, 3. the, 4. park, 5. to, 6. play, 7. with, 8. his, 9.friends, 10. and Keep in mind that as a language model, my guesses are based on patterns and probabilities derived from the text corpus I was trained on. The actual guesses may vary depending on the specific context and the model version being used. What if the sentence instead begins with "The secret behind GPT-4 is"? Here are 10 sequential word guesses: 1. "its", 2. "ability", 3. "to", 4. "understand", 5. "and", 6. "generate", 7. "human- like", 8. "text", 9. "based", 10. "on“ Again, these are just guesses based on common patterns and do not represent an actual sentence. The distant ancestor of all chatbots is sequential word- guessing in natural languages. Hence “generative”.
• 4. LLMs as “lossy text compression” • Represent training corpus more compactly by finding and encoding its structures and relationships • eliminate redundancy at all levels: syntactic, semantic, multiple sources, etc. • the result encoded in ~109-1012 matrix “weights” • “Lossy” because information is irretrievably lost • prompts are answered by (sort-of) decompressing into highly probable responses that could have been in the training data, but, in general, weren’t exactly so verbatim • The decompressed data maintains accuracy when… • it is “common sense” or “conventional wisdom” • because then a huge redundancy in the training data • But can be wildly inaccurate (like “digital artifacts” in a defective video) if query is not well represented in the compressed training corpus • e.g., most probable answer comes from one (wrong) document • or variants of a widely circulated conspiracy theory • if uncompressing from no germane data at all, it just makes things up (“hallucinates”) to get the most probable response https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
• 5. GPT-4: OpenAI’s latest released Large Language Model • OpenAI isn’t actually open! Can for many purposes be thought of as practically a Microsoft subsidiary. • Microsoft is said to provide the unique hardware infrastructure for OpenAI algorithm development. • GPT = Generative Pre-trained Transformer • Thought to have >5x1011 trainable parameters. • GPT-3 had 1.75x1011 • Trained on > several terabytes of language data • Training cost claimed to be \$100 million • but this might be including amortized R&D • once trained, cost per query is millicents per token • I will highlight three key elements of “secret sauce”: • 1. transformer architecture • 2. huge scale of parameter space training corpus • 3. “RLHF” Reinforcement Learning from Human Feedback • mostly not reported on
• 6. Key #1: Transformer architecture. It is a distant descendant of document query concepts Document retrieval: • Input text projected onto matrix of possible queries • Matrix multiply to cross queries with keys (e.g., keywords) • Matrix multiply to map result from keys to values (e.g., documents) • The brilliant idea of Vaswani et al. (2017, “Attention Is All You Need”) is map all of Q, K, V from the same input. • This is “Self-Attention” • And have all of Q, K, V learned. • Many layers allows attention to many different levels of structure simultaneously • This is “Multi-headed” https://dugas.ch/artificial_curiosity/GPT_architecture.html input processing stuff (encoder) output processing stuff (decoder) ~ 103 layers?
• 7. Key #2: Sheer scale: Only a few dare to call it emergence, but the gain-of-function is striking • Transformer parameters: • trillion parameters =(?) 1000 parallel instances of billion • billion parameters in an instance =(?) 104 each query space, key space, value space (multiply two at a time) + “glue” • could think of as looking at every token list 107 ways in formulating the next response • “stateless”: looks at whole previous dialogue as a new token list, maximum length 32768 tokens • Training corpus parameters: • many terabytes? • ~1000x number of words a human hears or reads in a lifetime • Many proprietary tricks: • how to propagate gradients through the huge scale? • how to maximize parallelism in training? • special hardware? • rumored that GPT-5 training is on hold because not enough GPUs are obtainable in the world.
• 8. Key #3: Reinforcement Learning with Human Feedback (RLHF) • Said to involve 40 contractors with (each?) hundreds of workers. • 1. Humans generate an initial dataset of “typical” queries and “good” responses. • 2. Humans rank a separate dataset of responses (some good, some bad). • 3. The reward model is trained on this data. • 4. The language model trains itself to satisfy the reward model. (How often?) • 5. The reward model is retrained with (free!) input from hundreds of millions of users. (How often?) user prompt “most-human” output many output guesses feedback https://aisupremacy.substack.com/p/ what-is-reinforcement-learning-with https://www.bradknox.net/ human-reward/tamer/
• 9. In typical usage, RLHF doesn't directly train the base GPT model. The GPT model is first pretrained on a large corpus of text data, learning to predict the next word in a sentence. This pretraining process is unsupervised and uses a variant of the Transformer architecture to learn patterns and structures in the data. After this pretraining, the model can be fine-tuned using reinforcement learning from human feedback. In this process, human evaluators rank different responses generated by the model, and these rankings are used to create a reward model. The model is then fine-tuned to optimize this reward model, typically using a method like Proximal Policy Optimization. In this way, the RLHF is a separate step that fine-tunes the pretrained GPT model, rather than directly training the initial GPT weights. However, please note that this information is based on the methodology up to September 2021, and OpenAI's approach may have evolved since then. Me: ChatGPT, When GPT-3 or GPT-4 uses RLHF, is it a separate model, or does RLHF train the GPT weights directly?
• 10. “There is no moat?” (Some insiders think that the technology can be miniaturized and will leak via open source availability to individuals and thence startups.) • Meta/Facebook’s entire LLM was leaked online in March, 2023. • and later released publicly • A May, 2023, leaked Google document (one engineer’s private view): • People are now running foundation models on a Pixel 6 at 5 tokens/s • You can fine-tune a personalized AI on your laptop in an evening. • Open-source models are [he says] faster, more customizable, more private, and pound-for-pound more capable. • “They are doing things with \$100 and 13B params that we struggle with at \$10M and 540B.” • LORa (low-rank factorization of large models) saves factor ~103 • and is composable to train new models, no need to go back to full model • Big companies might be held liable for copyright infringement on training data, or have to buy it • Open-source leaked models might (de-facto) not be • Scenario: A bunch of (LORa?) matrices appear on a Discord server
• 11. Me: ChatGPT, To what extent can LLMs be miniaturized, for example by low rank factorization? As of my last update in September 2021, there's been some promising research showing that low-rank factorization can be used to compress neural networks, including Transformer-based models. Weight matrices in the Transformer's self-attention layers can be approximated well with low-rank matrices, which can significantly reduce the model's size with minimal loss in performance. Low-rank factorization is just one of many techniques for model compression. Others include knowledge distillation, where a smaller model is trained to mimic the behavior of a larger model, and pruning, where less important parts of the model are removed.