AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Reducing Prefill for LLM Serving in
RAG (Using Knowledge Sharing)
Junchen Jiang
1

The Internet age is punctuated with innovations
2
Web apps
CDN
Video
MPEG-
DASH, 5G
Big Data
Cloud,
MapReduce
Search engine
Data Centers
Gen AI
????
OpenAI alone already has 1.8 billion monthly visitors
(YouTube has 2.7 billion monthly visitors)
What will be the key system innovations for generative AI?
time

Do we know how to build the Gen AI system yet?
3
Basics
How to build
websites & players
1990s 2000s – 2020s
Building a global distributed system?
P2P or CDN, video transcoding, scale out streaming, streaming
quality monitoring, DNS redirection, video caching, …
Basics
How to build AI
apps and servers
2022-2024 2024 - ??????
Building a global distributed system
???
We are still at the very early stage of LLM infrastructure
These took us 20 years
This talk:
Sharing knowledge across LLMs
Internet
video
Gen AI
(LLMs)
We are here

LLMs are more powerful when paired with "knowledge"
LLMs need to read a large amount of data in real-time
(looooooooooooog) contexts
output text
LLM
(short) query
News Business
docs
Chat/
shopping
history
Book
User

The prefill delay will only grow (longer contexts, bigger models),
while users get less patient.
Yet, it takes time to "learn" (prefill) the context
LLM
LLM
LLM
Queries about a book
Prefilling
Prefilling
Prefilling
LLM-Learned
knowledge
(KV cache)

Knowledge
Sharing
You Only Learn Once:
Once one LLM learns something, other LLMs will immediately know
Vision: Knowledge Sharing
LLM
LLM
LLM
Queries about a book
Prefilling
LLM-Learned
knowledge
(KV cache)

Feel the speedup!
Context text
(13K tokens)
6.5sec
Query 2
0.9sec (7x faster)
Mistral 7B
on A40
Mistral 7B
on A40
Query 1
KV Cache
Sharing
KV cache
w/o KV cache
With efficient KV cache sharing
(explained shortly)

Vision: Knowledge Sharing
Why will the same knowledge (KV cache) be reused?
20% of your knowledge is used 80% of the time. (20-80 rule)
Faster (shorter time-to-first-token)
Ex. 5,000-token document (context) + 100-token question
With document's KV cache, time-to-first-token is at least 50x faster
Higher throughput
Without prefill, generation (decoding) would be easier to batch
On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second*
Will it be too expensive to store KV cache?
KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs.
With longer contexts (or bigger models), KV cache size grows slower than prefill delay.

LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System

LLM
LLM
LLM
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Perfect fit for storage
solutions, like Alluxio

Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
LLM
LLM
LLM
Challenge:
KV cache is 100,000x bigger than text.
Simply loading them remotely is too slow
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)

Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
LLM
LLM
LLM
Challenge:
If a text is not at the prefix, its KV
cache cannot be reused
Key technique #2:
Flexible join of multiple KV caches

Knowledge-
Sharing
System LLM
LLM
LLM
Key technique #1:
Fast KV retrieval via KV encoding
Key technique #2:

CacheGen: KV Cache Compression and
Streaming for Fast Language Model Serving
14
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang,
Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh
Ananthanarayanan, Junchen Jiang
ACM SIGCOMM 2024

CacheGen: Compressing KV cache for fast prefill
15
loooooo…oooooong context + query output text
LLM
Prefill on query
time
Generate
output
Loading KV cache
Prefill on query
time
Generate
output
Loading compressed KV
cache & decompress
Compressed KV cache
KV cache
Faster prefill even if the reused
KV cache is loaded remotely

10,000 ft view of CacheGen
16
K tensor
KV cache
V tensor
Binary representations Storage
Encoding
Binary representations
K tensor
Decompressed KV cache
V tensor
Decoding
CacheGen: Encode KV cache to compact binary representation

Several emerging approaches to KV compression
17
They all keep the KV's tensor shape è Complementary to CachGen.
CacheGen can improve them too!
CacheGen: Encode KV cache to compact binary representation
Quantizing KV cache directly?
Dropping less important tokens from the text?
Dropping less important tokens from the KV cache?

Can KV cache be encoded efficiently?
size of encoded KV cache
text
quality
B
e
t
t
e
r
size of encoded video
video
quality
B
e
t
t
e
r
Encode a video in a small size with small degradation on video quality
KV cache Generated text
Analogy with video compression
We could borrow the 20-year research literature of video compression

Why can fewer bits represent KV cache?
19
KV cache is similar between neighboring tokens
Some parts of a KV cache are less sensitive to quantization
Quantized KV cache can be entropy-encoded with fewer bits
Key distributional properties of KV cache

Opportunity 1: Locality of KV cache values
20
K tensor
K @ layer j
# of layers
# of tokens
tokens
Channels
#
of channels
Opportunity: The KV values at nearby tokens have similar values

Delta values have much smaller variance
21
For any token 𝑖
Original: |𝐾!|, |𝑉!|
Delta: |𝐾! − 𝑘!"#|, |𝑉! − 𝑉!"#|
Encode the delta between neighboring tokens, rather than the tokens themselves
Delta values have much smaller variance è Easier to quantize
0
0.2
0.4
0.6
0.8
1
0 2 4
CDF
Values (abs.)
Original
Delta

Opportunity 2: Heterogeneous sensitivity to quantization
22
The output quality of LLM is more sensitive to losses in the KV cache values
of the shallower layers than to those in the deeper layers.
0
0.2
0.4
0.6
0.8
1
[0, 3] [4, 7] [8, 11] [12, 15] [16, 19] [20, 23]
LLM
output
quality
(Accuracy)
Layer

Opportunity 3: Arithmetic coding
23
001101001110010…
KV cache
Compute
delta
Quantize
Adaptive
arithmetic
coding
More compact binary
representation
- stored on disk
- sent via network

Reducing decoding overhead?
24
K tensor
KV cache
V tensor
Binary
representations
Encoding
K tensor
Decompressed KV cache
V tensor
GPU-based
Decoding
Loading
Decoding and loading can be pipelined

Evaluation setup
3Gbps
(cloud server bandwidth)
Llama-70B, Llama-34B, Mistral-7B
Llama-70B
Llama-70B
Llama-70B
LongChat
TriviaQA
NarrativeQA
WikiText
Context length distribution Various quality metrics
Accuracy
F1 score
Perplexity

Quality vs. Size & TTFT (time to first token)
26
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
Accuracy
Size of KV cache (MB)
Better
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6
Accuracy
Time to first token (TTFT) in seconds
CacheGen
Uniform
quantization CacheGen
Uniform
quantization
Full prefill
(no caching)
Better
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache (1.6 GB 8bit): 3Gbps
3x smaller KV cache size è 3-6x lower time to first token (TTFT)

Impact of context length
27
0
1000
2000
3000
4000
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
KV
cache
size
(MB)
Context length (K tokens)
CacheGen Uniform (8-bit quant.)
Setup
Model: Llama-70B
The size reduction remains under various context lengths

Breakdown of Time to First Token (TTFT)
28
6.1 sec
Prefill on
input time
Prefill on query
time
7.2 sec
Load
0.12 sec
Prefill on query
time
1.8 sec
Load + decompress
0.12 sec
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache: 3Gbps
Full recompute
Naïve KV cache loading
(w/o KV encoding)
CacheGen
(KV encoding)

Knowledge
Storing &
Sharing LLM
LLM
LLM
Towards Efficient Knowledge Storing & Sharing
Key technique #1:
Fast KV cache loading via KV codec
Key technique #2:
Happy to chat about
technique #2 after the talk

Try it yourself!
30
https://github.com/uchi-jcl/cachegen
https://arxiv.org/pdf/2310.07240.pdf
Research paper
Code repo

Efficient Knowledge Sharing System
31
Delay
(time
to
first
token)
Cost
(storage, compute, communication)
Better
GPU prefill
Storing KV cache in CPU
Storing KV cache in SSD
Storing KV cache in S3
Efficient
Knowledge
Sharing
Contact me if you are a potential
user or contributor to our
Knowledge-Sharing System!!
junchenj@uchicago.edu

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Recommended

Recommended

More Related Content

Similar to AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Similar to AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG