deep_dive_multihead_latent_attention.pdf

Deep Dive: step-by-step inference for text generation
Companion video: https://youtu.be/cl3MAhAr9-M
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

Text generation: decoder-only inference
• GPT-like models are decoder-only models
• No encoder, no encoder-decoder attention
• Input processing (aka prefill): highly parallel
• Inputs (the tokenized prompt) are embedded and encoded
• Attention computes the keys and values (KV) required for attention outputs
• Large matrix multiplication, high usage of the hardware accelerator
• Output generation: sequential
• The answer is generated one token at a time
• Each generated token is appended to the previous input
• The process is repeated until the stopping criteria is met (max. length or EOS)
• Low usage of the hardware accelerator
"Attention is all you need"
https://arxiv.org/abs/1706.03762
Inputs
(prompt)
Input

Attention
• The self-attention mechanism is at the core of Transformer models
• "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017)

Computing Attention outputs
Tensor Dimensions Example Purpose
Input sequence
the quick brown fox jumps over the lazy dog
The input sequence to run attention on: 9 words
Tokenized input sequence
[101, 1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 3899, 102]
N 11 The same, in tokenized form, plus start and end tokens
X: input embeddings N x dhidden 11x512 The same, in embedded form: dhidden is the embedding size
WQ : query weight matrix
WK: key weight matrix
WV: value weight matrix
W0: feed-forward weight matrix
dhidden x dhidden 512 x 512 Model weights learned during training
Q = X WQ : query matrix
K = X WK : key matrix
V = X WV : value matrix
N x dhidden 11 x 512
Q expresses what each token needs to know from others (« search query »)
K encodes the information that each token provides (« keywords »)
V stores the actual content that each token shares when attended to (« values »)
Q KT : attention scores N x N 11 x 11 How well each token matches the keys of the other tokens (« similarity scores »)
Tokens collect information from each other based on their relevance
(« attention weights »)
Attention output:
Attention weights x W0
N x dhidden 11 x 512 W0 captures additional interactions across the sequence.

The KV cache
• Each round of text generation adds a new token to the input sequence
• Can we avoid recomputing KV values again and again for the previous tokens?
• We only really need to do it for the new token we just generated
• The KV cache stores the KV values for all previous tokens in the accelerator RAM
• Of course, this doesn't work for the first token, which is why it takes longer to generate 😃
• Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes
"Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com

Multi-head Attention (MHA)
• Attention is split across several layers (aka « heads »)
• All heads see the full input sequence
• Each head only sees a subset of embedding dimensions:
dhidden / number of heads
• Each head has its own set of query, key, and value weight matrices
• All heads compute their attention scores in parallel
• Their scores are then concatenated
• Main benefit: each head can explore and learn different patterns in the
training data
« Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions. With a single
attention head, averaging inhibits this. »
Qi
Ki
Vi
MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py

Computing Multi-head Attention outputs
Input sequence
the quick brown fox jumps over the lazy dog
The input sequence to run attention on
Tokenized input sequence N 11 The same, in tokenized form
X: input embeddings N x dhidden 11 x 512 The same, in embedded form
WQi : query weight matrix
WKi: key weight matrix
WVi: value weight matrix
dhidden x dmha 512 x 64
dmha = dhidden / number of attention heads (here, 8) = 64
Each head has its own weight matrices, which learn different features.
Qi = X WQi : query matrix
Ki = X WKi : key matrix
Vi = X WVi : value matrix
N x dmha 11 x 64
All heads run this in parallel
Ki and Vi are stored in the KV cache
Qi Ki
T : attention scores N x N 11 x 11 All heads run this in parallel
N x dmha 11 x 64 All heads run this in parallel
Concatenate head outputs N x dhidden 11 x 512
Attention output:
Attention weights x W0
N x dhidden 11 x 512 W0 captures additional interactions across the sequence.

The Memory Bottleneck in Multi-head Attention (MHA)
• The KV cache (intermediate dot-product results) is stored in High Bandwidth
Memory (HBM), which is off-GPU
• Quadratic complexity for HBM accesses with respect to sequence length
• Lots of techniques to reduce the amount of KV data transferred between the
GPU and the HBM. See https://youtu.be/2TT384U4vQg
• The latest one is a new attention variant:
Multi-Head Latent Attention
Qi
Ki
Vi

Multi-head Latent Attention (MLA)
DeepSeek v2 https://arxiv.org/abs/2405.04434 (05/2024)
DeepSeek v3 https://arxiv.org/abs/2412.19437 (12/2024)
• Implemented in DeepSeek v2 and v3
• K and V matrices are not cached
• A low-rank representation learned
during training is cached instead
• Much lower KV cache usage
(90%+ savings)
• 5-6x inference speedup
• Accuracy higher than MHA
• See appendix D.1 in v2 paper

Multi-head Latent Attention
X: input embeddings N x dhidden 11 x 512 The input sequence, in embedded form
WQi : query weight matrix
WKi: key weight matrix
WVi: value weight matrix
dhidden x dmha 512 x 64
dmha = dhidden / number of attention heads (here, 8)
Each head has its own weight matrices, which learn different features.
Wi down : down-projection matrix
Wup : up-projection matrix
dmha x dmha_latent
dlatent x dhidden
64 x 4
32 x 512
dlatent should be much smaller than dhidden (here, 32)
dmha_latent = dlatent / number of attention heads (here, 8) = 4
Qi = X WQi : query matrix
Ki = X WKi Wi down : key matrix
Vi = X WVi Wi down : value matrix
N x dmha
N x dmha_latent
N x dmha_latent
11 x 64
11 x 4
11 x 4
All heads run this in parallel.
Ki and Vi are stored in the KV cache and are dhidden/dlatent times smaller (here, 512/32 =
16). The tradeoff is an extra matmut to compute Ki and Vi.
DeepSeek uses different down-projection matrices for K and V
Qi Wi down Ki
T : attention scores N x N 11 x 11
All heads run this in parallel. Qi is down-projected for this calculation only
(possibly with its own matrix)
N x dmha_latent 11 x 4 All heads run this in parallel
Concatenate head outputs N x dlatent 11 x 32
Outputs x Wup N x dhidden 11 x 512 Bring back the output to the initial dimension
Attention weights x W0 N x dhidden 11 x 512 Capture additional interactions across the sequence
scores

From Attention Outputs to Text Generation
Attention output for the input sequence
(aka pre-
fi
ll), stored in the KV cache
Updated token embeddings after considering the context of all
other tokens.
Retrieve the attention output for the last
token in the input sequence
1 x dhidden 1 x 512
Woutput : linear layer
(aka projection layer)
V x dhidden 100,000 x 512 V : vocabulary size (here, 100,000)
Logits = attention output x WT
output 1 x V 1 x 100,000 Raw scores for all tokens in the vocabulary
softmax(Logits) 1 x V 1 x 100,000 Turn token scores into token probabilities
Decode the token 1 token 1
Greedy decoding: pick the token with the highest probability
Top-k sampling: pick a token from the k most likely tokens
Top-p decoding: pick a token from the smallest subset of tokens
such that their cumulative probability exceeds the p threshold
Use the new token as the next input
Repeat until a stopping condition is met End-of-sentence token, or maximum number of output tokens

deep_dive_multihead_latent_attention.pdf

More Related Content

What's hot

Similar to deep_dive_multihead_latent_attention.pdf

More from Julien SIMON

Recently uploaded

deep_dive_multihead_latent_attention.pdf