Deep Dive: step-by-step inference for text generation
Companion video: https://youtu.be/cl3MAhAr9-M
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Text generation: decoder-only inference
• GPT-like models are decoder-only models
• No encoder, no encoder-decoder attention
• Input processing (aka prefill): highly parallel
• Inputs (the tokenized prompt) are embedded and encoded
• Attention computes the keys and values (KV) required for attention outputs
• Large matrix multiplication, high usage of the hardware accelerator
• Output generation: sequential
• The answer is generated one token at a time
• Each generated token is appended to the previous input
• The process is repeated until the stopping criteria is met (max. length or EOS)
• Low usage of the hardware accelerator
"Attention is all you need"
https://arxiv.org/abs/1706.03762
Inputs
(prompt)
Input
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Attention
• The self-attention mechanism is at the core of Transformer models
• "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Computing Attention outputs
Tensor Dimensions Example Purpose
Input sequence
the quick brown fox jumps over the lazy dog
The input sequence to run attention on: 9 words
Tokenized input sequence
[101, 1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 3899, 102]
N 11 The same, in tokenized form, plus start and end tokens
X: input embeddings N x dhidden 11x512 The same, in embedded form: dhidden is the embedding size
WQ : query weight matrix
WK: key weight matrix
WV: value weight matrix
W0: feed-forward weight matrix
dhidden x dhidden 512 x 512 Model weights learned during training
Q = X WQ : query matrix
K = X WK : key matrix
V = X WV : value matrix
N x dhidden 11 x 512
Q expresses what each token needs to know from others (« search query »)
K encodes the information that each token provides (« keywords »)
V stores the actual content that each token shares when attended to (« values »)
Q KT : attention scores N x N 11 x 11 How well each token matches the keys of the other tokens (« similarity scores »)
N x dhidden 11 x 512
Tokens collect information from each other based on their relevance
(« attention weights »)
Attention output:
Attention weights x W0
N x dhidden 11 x 512 W0 captures additional interactions across the sequence.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The KV cache
• Each round of text generation adds a new token to the input sequence
• Can we avoid recomputing KV values again and again for the previous tokens?
• We only really need to do it for the new token we just generated
• The KV cache stores the KV values for all previous tokens in the accelerator RAM
• Of course, this doesn't work for the first token, which is why it takes longer to generate 😃
• Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes
"Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Multi-head Attention (MHA)
• Attention is split across several layers (aka « heads »)
• All heads see the full input sequence
• Each head only sees a subset of embedding dimensions:
dhidden / number of heads
• Each head has its own set of query, key, and value weight matrices
• All heads compute their attention scores in parallel
• Their scores are then concatenated
• Main benefit: each head can explore and learn different patterns in the
training data
« Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions. With a single
attention head, averaging inhibits this. »
Qi
Ki
Vi
MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Computing Multi-head Attention outputs
Tensor Dimensions Example Purpose
Input sequence
the quick brown fox jumps over the lazy dog
The input sequence to run attention on
Tokenized input sequence N 11 The same, in tokenized form
X: input embeddings N x dhidden 11 x 512 The same, in embedded form
WQi : query weight matrix
WKi: key weight matrix
WVi: value weight matrix
W0: feed-forward weight matrix
dhidden x dmha 512 x 64
dmha = dhidden / number of attention heads (here, 8) = 64
Each head has its own weight matrices, which learn different features.
Qi = X WQi : query matrix
Ki = X WKi : key matrix
Vi = X WVi : value matrix
N x dmha 11 x 64
All heads run this in parallel
Ki and Vi are stored in the KV cache
Qi Ki
T : attention scores N x N 11 x 11 All heads run this in parallel
N x dmha 11 x 64 All heads run this in parallel
Concatenate head outputs N x dhidden 11 x 512
Attention output:
Attention weights x W0
N x dhidden 11 x 512 W0 captures additional interactions across the sequence.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The Memory Bottleneck in Multi-head Attention (MHA)
• The KV cache (intermediate dot-product results) is stored in High Bandwidth
Memory (HBM), which is off-GPU
• Quadratic complexity for HBM accesses with respect to sequence length
• Lots of techniques to reduce the amount of KV data transferred between the
GPU and the HBM. See https://youtu.be/2TT384U4vQg
• The latest one is a new attention variant:
Multi-Head Latent Attention
Qi
Ki
Vi
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Multi-head Latent Attention (MLA)
DeepSeek v2 https://arxiv.org/abs/2405.04434 (05/2024)
DeepSeek v3 https://arxiv.org/abs/2412.19437 (12/2024)
• Implemented in DeepSeek v2 and v3
• K and V matrices are not cached
• A low-rank representation learned
during training is cached instead
• Much lower KV cache usage
(90%+ savings)
• 5-6x inference speedup
• Accuracy higher than MHA
• See appendix D.1 in v2 paper
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Multi-head Latent Attention
Tensor Dimensions Example Purpose
X: input embeddings N x dhidden 11 x 512 The input sequence, in embedded form
WQi : query weight matrix
WKi: key weight matrix
WVi: value weight matrix
W0: feed-forward weight matrix
dhidden x dmha 512 x 64
dmha = dhidden / number of attention heads (here, 8)
Each head has its own weight matrices, which learn different features.
Wi down : down-projection matrix
Wup : up-projection matrix
dmha x dmha_latent
dlatent x dhidden
64 x 4
32 x 512
dlatent should be much smaller than dhidden (here, 32)
dmha_latent = dlatent / number of attention heads (here, 8) = 4
Qi = X WQi : query matrix
Ki = X WKi Wi down : key matrix
Vi = X WVi Wi down : value matrix
N x dmha
N x dmha_latent
N x dmha_latent
11 x 64
11 x 4
11 x 4
All heads run this in parallel.
Ki and Vi are stored in the KV cache and are dhidden/dlatent times smaller (here, 512/32 =
16). The tradeoff is an extra matmut to compute Ki and Vi.
DeepSeek uses different down-projection matrices for K and V
Qi Wi down Ki
T : attention scores N x N 11 x 11
All heads run this in parallel. Qi is down-projected for this calculation only
(possibly with its own matrix)
N x dmha_latent 11 x 4 All heads run this in parallel
Concatenate head outputs N x dlatent 11 x 32
Outputs x Wup N x dhidden 11 x 512 Bring back the output to the initial dimension
Attention weights x W0 N x dhidden 11 x 512 Capture additional interactions across the sequence
scores
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
From Attention Outputs to Text Generation
Tensor Dimensions Example Purpose
Attention output for the input sequence
(aka pre-
fi
ll), stored in the KV cache
N x dhidden 11 x 512
Updated token embeddings after considering the context of all
other tokens.
Retrieve the attention output for the last
token in the input sequence
1 x dhidden 1 x 512
Woutput : linear layer
(aka projection layer)
V x dhidden 100,000 x 512 V : vocabulary size (here, 100,000)
Logits = attention output x WT
output 1 x V 1 x 100,000 Raw scores for all tokens in the vocabulary
softmax(Logits) 1 x V 1 x 100,000 Turn token scores into token probabilities
Decode the token 1 token 1
Greedy decoding: pick the token with the highest probability
Top-k sampling: pick a token from the k most likely tokens
Top-p decoding: pick a token from the smallest subset of tokens
such that their cumulative probability exceeds the p threshold
Use the new token as the next input
Repeat until a stopping condition is met End-of-sentence token, or maximum number of output tokens

deep_dive_multihead_latent_attention.pdf

  • 1.
    Deep Dive: step-by-stepinference for text generation Companion video: https://youtu.be/cl3MAhAr9-M Julien Simon https://www.linkedin.com/in/juliensimon https://www.youtube.com/juliensimonfr The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
  • 2.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Text generation: decoder-only inference • GPT-like models are decoder-only models • No encoder, no encoder-decoder attention • Input processing (aka prefill): highly parallel • Inputs (the tokenized prompt) are embedded and encoded • Attention computes the keys and values (KV) required for attention outputs • Large matrix multiplication, high usage of the hardware accelerator • Output generation: sequential • The answer is generated one token at a time • Each generated token is appended to the previous input • The process is repeated until the stopping criteria is met (max. length or EOS) • Low usage of the hardware accelerator "Attention is all you need" https://arxiv.org/abs/1706.03762 Inputs (prompt) Input
  • 3.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Attention • The self-attention mechanism is at the core of Transformer models • "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017)
  • 4.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Computing Attention outputs Tensor Dimensions Example Purpose Input sequence the quick brown fox jumps over the lazy dog The input sequence to run attention on: 9 words Tokenized input sequence [101, 1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 3899, 102] N 11 The same, in tokenized form, plus start and end tokens X: input embeddings N x dhidden 11x512 The same, in embedded form: dhidden is the embedding size WQ : query weight matrix WK: key weight matrix WV: value weight matrix W0: feed-forward weight matrix dhidden x dhidden 512 x 512 Model weights learned during training Q = X WQ : query matrix K = X WK : key matrix V = X WV : value matrix N x dhidden 11 x 512 Q expresses what each token needs to know from others (« search query ») K encodes the information that each token provides (« keywords ») V stores the actual content that each token shares when attended to (« values ») Q KT : attention scores N x N 11 x 11 How well each token matches the keys of the other tokens (« similarity scores ») N x dhidden 11 x 512 Tokens collect information from each other based on their relevance (« attention weights ») Attention output: Attention weights x W0 N x dhidden 11 x 512 W0 captures additional interactions across the sequence.
  • 5.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. The KV cache • Each round of text generation adds a new token to the input sequence • Can we avoid recomputing KV values again and again for the previous tokens? • We only really need to do it for the new token we just generated • The KV cache stores the KV values for all previous tokens in the accelerator RAM • Of course, this doesn't work for the first token, which is why it takes longer to generate 😃 • Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes "Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com
  • 6.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Multi-head Attention (MHA) • Attention is split across several layers (aka « heads ») • All heads see the full input sequence • Each head only sees a subset of embedding dimensions: dhidden / number of heads • Each head has its own set of query, key, and value weight matrices • All heads compute their attention scores in parallel • Their scores are then concatenated • Main benefit: each head can explore and learn different patterns in the training data « Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. » Qi Ki Vi MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
  • 7.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Computing Multi-head Attention outputs Tensor Dimensions Example Purpose Input sequence the quick brown fox jumps over the lazy dog The input sequence to run attention on Tokenized input sequence N 11 The same, in tokenized form X: input embeddings N x dhidden 11 x 512 The same, in embedded form WQi : query weight matrix WKi: key weight matrix WVi: value weight matrix W0: feed-forward weight matrix dhidden x dmha 512 x 64 dmha = dhidden / number of attention heads (here, 8) = 64 Each head has its own weight matrices, which learn different features. Qi = X WQi : query matrix Ki = X WKi : key matrix Vi = X WVi : value matrix N x dmha 11 x 64 All heads run this in parallel Ki and Vi are stored in the KV cache Qi Ki T : attention scores N x N 11 x 11 All heads run this in parallel N x dmha 11 x 64 All heads run this in parallel Concatenate head outputs N x dhidden 11 x 512 Attention output: Attention weights x W0 N x dhidden 11 x 512 W0 captures additional interactions across the sequence.
  • 8.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. The Memory Bottleneck in Multi-head Attention (MHA) • The KV cache (intermediate dot-product results) is stored in High Bandwidth Memory (HBM), which is off-GPU • Quadratic complexity for HBM accesses with respect to sequence length • Lots of techniques to reduce the amount of KV data transferred between the GPU and the HBM. See https://youtu.be/2TT384U4vQg • The latest one is a new attention variant: Multi-Head Latent Attention Qi Ki Vi
  • 9.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Multi-head Latent Attention (MLA) DeepSeek v2 https://arxiv.org/abs/2405.04434 (05/2024) DeepSeek v3 https://arxiv.org/abs/2412.19437 (12/2024) • Implemented in DeepSeek v2 and v3 • K and V matrices are not cached • A low-rank representation learned during training is cached instead • Much lower KV cache usage (90%+ savings) • 5-6x inference speedup • Accuracy higher than MHA • See appendix D.1 in v2 paper
  • 10.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Multi-head Latent Attention Tensor Dimensions Example Purpose X: input embeddings N x dhidden 11 x 512 The input sequence, in embedded form WQi : query weight matrix WKi: key weight matrix WVi: value weight matrix W0: feed-forward weight matrix dhidden x dmha 512 x 64 dmha = dhidden / number of attention heads (here, 8) Each head has its own weight matrices, which learn different features. Wi down : down-projection matrix Wup : up-projection matrix dmha x dmha_latent dlatent x dhidden 64 x 4 32 x 512 dlatent should be much smaller than dhidden (here, 32) dmha_latent = dlatent / number of attention heads (here, 8) = 4 Qi = X WQi : query matrix Ki = X WKi Wi down : key matrix Vi = X WVi Wi down : value matrix N x dmha N x dmha_latent N x dmha_latent 11 x 64 11 x 4 11 x 4 All heads run this in parallel. Ki and Vi are stored in the KV cache and are dhidden/dlatent times smaller (here, 512/32 = 16). The tradeoff is an extra matmut to compute Ki and Vi. DeepSeek uses different down-projection matrices for K and V Qi Wi down Ki T : attention scores N x N 11 x 11 All heads run this in parallel. Qi is down-projected for this calculation only (possibly with its own matrix) N x dmha_latent 11 x 4 All heads run this in parallel Concatenate head outputs N x dlatent 11 x 32 Outputs x Wup N x dhidden 11 x 512 Bring back the output to the initial dimension Attention weights x W0 N x dhidden 11 x 512 Capture additional interactions across the sequence scores
  • 11.
    The author ofthis material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. From Attention Outputs to Text Generation Tensor Dimensions Example Purpose Attention output for the input sequence (aka pre- fi ll), stored in the KV cache N x dhidden 11 x 512 Updated token embeddings after considering the context of all other tokens. Retrieve the attention output for the last token in the input sequence 1 x dhidden 1 x 512 Woutput : linear layer (aka projection layer) V x dhidden 100,000 x 512 V : vocabulary size (here, 100,000) Logits = attention output x WT output 1 x V 1 x 100,000 Raw scores for all tokens in the vocabulary softmax(Logits) 1 x V 1 x 100,000 Turn token scores into token probabilities Decode the token 1 token 1 Greedy decoding: pick the token with the highest probability Top-k sampling: pick a token from the k most likely tokens Top-p decoding: pick a token from the smallest subset of tokens such that their cumulative probability exceeds the p threshold Use the new token as the next input Repeat until a stopping condition is met End-of-sentence token, or maximum number of output tokens