SlideShare a Scribd company logo
Reducing Prefill for LLM Serving in
RAG (Using Knowledge Sharing)
Junchen Jiang
1
The Internet age is punctuated with innovations
2
Web apps
CDN
Video
MPEG-
DASH, 5G
Big Data
Cloud,
MapReduce
Search engine
Data Centers
Gen AI
????
OpenAI alone already has 1.8 billion monthly visitors
(YouTube has 2.7 billion monthly visitors)
What will be the key system innovations for generative AI?
time
Do we know how to build the Gen AI system yet?
3
Basics
How to build
websites & players
1990s 2000s – 2020s
Building a global distributed system?
P2P or CDN, video transcoding, scale out streaming, streaming
quality monitoring, DNS redirection, video caching, …
Basics
How to build AI
apps and servers
2022-2024 2024 - ??????
Building a global distributed system
???
We are still at the very early stage of LLM infrastructure
These took us 20 years
This talk:
Sharing knowledge across LLMs
Internet
video
Gen AI
(LLMs)
We are here
LLMs are more powerful when paired with "knowledge"
LLMs need to read a large amount of data in real-time
(looooooooooooog) contexts
output text
LLM
(short) query
News Business
docs
Chat/
shopping
history
Book
User
The prefill delay will only grow (longer contexts, bigger models),
while users get less patient.
Yet, it takes time to "learn" (prefill) the context
LLM
LLM
LLM
Queries about a book
Prefilling
Prefilling
Prefilling
LLM-Learned
knowledge
(KV cache)
Knowledge
Sharing
You Only Learn Once:
Once one LLM learns something, other LLMs will immediately know
Vision: Knowledge Sharing
LLM
LLM
LLM
Queries about a book
Prefilling
LLM-Learned
knowledge
(KV cache)
Feel the speedup!
Context text
(13K tokens)
6.5sec
Query 2
0.9sec (7x faster)
Mistral 7B
on A40
Mistral 7B
on A40
Query 1
KV Cache
Sharing
KV cache
w/o KV cache
With efficient KV cache sharing
(explained shortly)
Vision: Knowledge Sharing
Why will the same knowledge (KV cache) be reused?
20% of your knowledge is used 80% of the time. (20-80 rule)
Faster (shorter time-to-first-token)
Ex. 5,000-token document (context) + 100-token question
With document's KV cache, time-to-first-token is at least 50x faster
Higher throughput
Without prefill, generation (decoding) would be easier to batch
On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second*
Will it be too expensive to store KV cache?
KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs.
With longer contexts (or bigger models), KV cache size grows slower than prefill delay.
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
Perfect fit for storage
solutions, like Alluxio
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
KV cache is 100,000x bigger than text.
Simply loading them remotely is too slow
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
If a text is not at the prefix, its KV
cache cannot be reused
Key technique #2:
Flexible join of multiple KV caches
Knowledge-
Sharing
System LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
CacheGen: KV Cache Compression and
Streaming for Fast Language Model Serving
14
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang,
Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh
Ananthanarayanan, Junchen Jiang
ACM SIGCOMM 2024
CacheGen: Compressing KV cache for fast prefill
15
loooooo…oooooong context + query output text
LLM
Prefill on query
time
Generate
output
Loading KV cache
Prefill on query
time
Generate
output
Loading compressed KV
cache & decompress
Compressed KV cache
KV cache
Faster prefill even if the reused
KV cache is loaded remotely
10,000 ft view of CacheGen
16
K tensor
KV cache
V tensor
Binary representations Storage
Encoding
Binary representations
K tensor
Decompressed KV cache
V tensor
Decoding
CacheGen: Encode KV cache to compact binary representation
Several emerging approaches to KV compression
17
They all keep the KV's tensor shape è Complementary to CachGen.
CacheGen can improve them too!
CacheGen: Encode KV cache to compact binary representation
Quantizing KV cache directly?
Dropping less important tokens from the text?
Dropping less important tokens from the KV cache?
Can KV cache be encoded efficiently?
size of encoded KV cache
text
quality
B
e
t
t
e
r
size of encoded video
video
quality
B
e
t
t
e
r
Encode a video in a small size with small degradation on video quality
KV cache Generated text
Analogy with video compression
We could borrow the 20-year research literature of video compression
Why can fewer bits represent KV cache?
19
KV cache is similar between neighboring tokens
Some parts of a KV cache are less sensitive to quantization
Quantized KV cache can be entropy-encoded with fewer bits
Key distributional properties of KV cache
Opportunity 1: Locality of KV cache values
20
K tensor
K @ layer j
# of layers
# of tokens
tokens
Channels
#
of channels
Opportunity: The KV values at nearby tokens have similar values
Delta values have much smaller variance
21
For any token 𝑖
Original: |𝐾!|, |𝑉!|
Delta: |𝐾! − 𝑘!"#|, |𝑉! − 𝑉!"#|
Encode the delta between neighboring tokens, rather than the tokens themselves
Delta values have much smaller variance è Easier to quantize
0
0.2
0.4
0.6
0.8
1
0 2 4
CDF
Values (abs.)
Original
Delta
Opportunity 2: Heterogeneous sensitivity to quantization
22
The output quality of LLM is more sensitive to losses in the KV cache values
of the shallower layers than to those in the deeper layers.
0
0.2
0.4
0.6
0.8
1
[0, 3] [4, 7] [8, 11] [12, 15] [16, 19] [20, 23]
LLM
output
quality
(Accuracy)
Layer
Opportunity 3: Arithmetic coding
23
001101001110010…
KV cache
Compute
delta
Quantize
Adaptive
arithmetic
coding
More compact binary
representation
- stored on disk
- sent via network
Reducing decoding overhead?
24
K tensor
KV cache
V tensor
Binary
representations
Encoding
K tensor
Decompressed KV cache
V tensor
GPU-based
Decoding
Loading
Decoding and loading can be pipelined
Evaluation setup
3Gbps
(cloud server bandwidth)
Llama-70B, Llama-34B, Mistral-7B
Llama-70B
Llama-70B
Llama-70B
LongChat
TriviaQA
NarrativeQA
WikiText
Context length distribution Various quality metrics
Accuracy
F1 score
Perplexity
Quality vs. Size & TTFT (time to first token)
26
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
Accuracy
Size of KV cache (MB)
Better
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6
Accuracy
Time to first token (TTFT) in seconds
CacheGen
Uniform
quantization CacheGen
Uniform
quantization
Full prefill
(no caching)
Better
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache (1.6 GB 8bit): 3Gbps
3x smaller KV cache size è 3-6x lower time to first token (TTFT)
Impact of context length
27
0
1000
2000
3000
4000
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
KV
cache
size
(MB)
Context length (K tokens)
CacheGen Uniform (8-bit quant.)
Setup
Model: Llama-70B
The size reduction remains under various context lengths
Breakdown of Time to First Token (TTFT)
28
6.1 sec
Prefill on
input time
Prefill on query
time
7.2 sec
Load
0.12 sec
Prefill on query
time
1.8 sec
Load + decompress
0.12 sec
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache: 3Gbps
Full recompute
Naïve KV cache loading
(w/o KV encoding)
CacheGen
(KV encoding)
Knowledge
Storing &
Sharing LLM
LLM
LLM
Towards Efficient Knowledge Storing & Sharing
Key technique #1:
Fast KV cache loading via KV codec
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
Happy to chat about
technique #2 after the talk
Try it yourself!
30
https://github.com/uchi-jcl/cachegen
https://arxiv.org/pdf/2310.07240.pdf
Research paper
Code repo
Efficient Knowledge Sharing System
31
Delay
(time
to
first
token)
Cost
(storage, compute, communication)
Better
GPU prefill
Storing KV cache in CPU
Storing KV cache in SSD
Storing KV cache in S3
Efficient
Knowledge
Sharing
Contact me if you are a potential
user or contributor to our
Knowledge-Sharing System!!
junchenj@uchicago.edu

More Related Content

Similar to AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
ETCenter
 
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
Amazon Web Services
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
TechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlexTechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlex
Jarut Nakaramaleerat
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
OPNFV
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
Michelle Holley
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptx
ssuserabc741
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
Brent Salisbury
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Kieran Kunhya
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS Riyadh User Group
 
#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout
1CloudRoad.com
 
High Performance Computing in AWS, Immersion Day Huntsville 2019
High Performance Computing in AWS, Immersion Day Huntsville 2019High Performance Computing in AWS, Immersion Day Huntsville 2019
High Performance Computing in AWS, Immersion Day Huntsville 2019
Amazon Web Services
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108
qnapivan
 
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
Amazon Web Services Korea
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 

Similar to AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG (20)

Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
 
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
TechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlexTechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlex
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptx
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout
 
High Performance Computing in AWS, Immersion Day Huntsville 2019
High Performance Computing in AWS, Immersion Day Huntsville 2019High Performance Computing in AWS, Immersion Day Huntsville 2019
High Performance Computing in AWS, Immersion Day Huntsville 2019
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108
 
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 

More from Alluxio, Inc.

Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 

Recently uploaded

All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
OnePlan Solutions
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Microsoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptxMicrosoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptx
jrodriguezq3110
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
ICS
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
kalichargn70th171
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Luigi Fugaro
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
 
bgiolcb
bgiolcbbgiolcb
bgiolcb
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Microsoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptxMicrosoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptx
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

  • 1. Reducing Prefill for LLM Serving in RAG (Using Knowledge Sharing) Junchen Jiang 1
  • 2. The Internet age is punctuated with innovations 2 Web apps CDN Video MPEG- DASH, 5G Big Data Cloud, MapReduce Search engine Data Centers Gen AI ???? OpenAI alone already has 1.8 billion monthly visitors (YouTube has 2.7 billion monthly visitors) What will be the key system innovations for generative AI? time
  • 3. Do we know how to build the Gen AI system yet? 3 Basics How to build websites & players 1990s 2000s – 2020s Building a global distributed system? P2P or CDN, video transcoding, scale out streaming, streaming quality monitoring, DNS redirection, video caching, … Basics How to build AI apps and servers 2022-2024 2024 - ?????? Building a global distributed system ??? We are still at the very early stage of LLM infrastructure These took us 20 years This talk: Sharing knowledge across LLMs Internet video Gen AI (LLMs) We are here
  • 4. LLMs are more powerful when paired with "knowledge" LLMs need to read a large amount of data in real-time (looooooooooooog) contexts output text LLM (short) query News Business docs Chat/ shopping history Book User
  • 5. The prefill delay will only grow (longer contexts, bigger models), while users get less patient. Yet, it takes time to "learn" (prefill) the context LLM LLM LLM Queries about a book Prefilling Prefilling Prefilling LLM-Learned knowledge (KV cache)
  • 6. Knowledge Sharing You Only Learn Once: Once one LLM learns something, other LLMs will immediately know Vision: Knowledge Sharing LLM LLM LLM Queries about a book Prefilling LLM-Learned knowledge (KV cache)
  • 7. Feel the speedup! Context text (13K tokens) 6.5sec Query 2 0.9sec (7x faster) Mistral 7B on A40 Mistral 7B on A40 Query 1 KV Cache Sharing KV cache w/o KV cache With efficient KV cache sharing (explained shortly)
  • 8. Vision: Knowledge Sharing Why will the same knowledge (KV cache) be reused? 20% of your knowledge is used 80% of the time. (20-80 rule) Faster (shorter time-to-first-token) Ex. 5,000-token document (context) + 100-token question With document's KV cache, time-to-first-token is at least 50x faster Higher throughput Without prefill, generation (decoding) would be easier to batch On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second* Will it be too expensive to store KV cache? KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs. With longer contexts (or bigger models), KV cache size grows slower than prefill delay.
  • 9. LLM LLM LLM Architecting Efficient Knowledge Sharing Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System
  • 10. LLM LLM LLM Architecting Efficient Knowledge Sharing Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System Perfect fit for storage solutions, like Alluxio
  • 11. Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Challenge: KV cache is 100,000x bigger than text. Simply loading them remotely is too slow Key technique #1: Fast KV retrieval via KV encoding (Speed up KV loading by 3-10x)
  • 12. Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Challenge: If a text is not at the prefix, its KV cache cannot be reused Key technique #2: Flexible join of multiple KV caches
  • 13. Knowledge- Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Key technique #1: Fast KV retrieval via KV encoding (Speed up KV loading by 3-10x) Key technique #2: Flexible join of multiple KV caches
  • 14. CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving 14 Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh Ananthanarayanan, Junchen Jiang ACM SIGCOMM 2024
  • 15. CacheGen: Compressing KV cache for fast prefill 15 loooooo…oooooong context + query output text LLM Prefill on query time Generate output Loading KV cache Prefill on query time Generate output Loading compressed KV cache & decompress Compressed KV cache KV cache Faster prefill even if the reused KV cache is loaded remotely
  • 16. 10,000 ft view of CacheGen 16 K tensor KV cache V tensor Binary representations Storage Encoding Binary representations K tensor Decompressed KV cache V tensor Decoding CacheGen: Encode KV cache to compact binary representation
  • 17. Several emerging approaches to KV compression 17 They all keep the KV's tensor shape è Complementary to CachGen. CacheGen can improve them too! CacheGen: Encode KV cache to compact binary representation Quantizing KV cache directly? Dropping less important tokens from the text? Dropping less important tokens from the KV cache?
  • 18. Can KV cache be encoded efficiently? size of encoded KV cache text quality B e t t e r size of encoded video video quality B e t t e r Encode a video in a small size with small degradation on video quality KV cache Generated text Analogy with video compression We could borrow the 20-year research literature of video compression
  • 19. Why can fewer bits represent KV cache? 19 KV cache is similar between neighboring tokens Some parts of a KV cache are less sensitive to quantization Quantized KV cache can be entropy-encoded with fewer bits Key distributional properties of KV cache
  • 20. Opportunity 1: Locality of KV cache values 20 K tensor K @ layer j # of layers # of tokens tokens Channels # of channels Opportunity: The KV values at nearby tokens have similar values
  • 21. Delta values have much smaller variance 21 For any token 𝑖 Original: |𝐾!|, |𝑉!| Delta: |𝐾! − 𝑘!"#|, |𝑉! − 𝑉!"#| Encode the delta between neighboring tokens, rather than the tokens themselves Delta values have much smaller variance è Easier to quantize 0 0.2 0.4 0.6 0.8 1 0 2 4 CDF Values (abs.) Original Delta
  • 22. Opportunity 2: Heterogeneous sensitivity to quantization 22 The output quality of LLM is more sensitive to losses in the KV cache values of the shallower layers than to those in the deeper layers. 0 0.2 0.4 0.6 0.8 1 [0, 3] [4, 7] [8, 11] [12, 15] [16, 19] [20, 23] LLM output quality (Accuracy) Layer
  • 23. Opportunity 3: Arithmetic coding 23 001101001110010… KV cache Compute delta Quantize Adaptive arithmetic coding More compact binary representation - stored on disk - sent via network
  • 24. Reducing decoding overhead? 24 K tensor KV cache V tensor Binary representations Encoding K tensor Decompressed KV cache V tensor GPU-based Decoding Loading Decoding and loading can be pipelined
  • 25. Evaluation setup 3Gbps (cloud server bandwidth) Llama-70B, Llama-34B, Mistral-7B Llama-70B Llama-70B Llama-70B LongChat TriviaQA NarrativeQA WikiText Context length distribution Various quality metrics Accuracy F1 score Perplexity
  • 26. Quality vs. Size & TTFT (time to first token) 26 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 Accuracy Size of KV cache (MB) Better 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 Accuracy Time to first token (TTFT) in seconds CacheGen Uniform quantization CacheGen Uniform quantization Full prefill (no caching) Better Setup Dataset: Longchat (200 contexts, ~9.6K tokens each) Model: Llama-70B Link to load KV cache (1.6 GB 8bit): 3Gbps 3x smaller KV cache size è 3-6x lower time to first token (TTFT)
  • 27. Impact of context length 27 0 1000 2000 3000 4000 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 KV cache size (MB) Context length (K tokens) CacheGen Uniform (8-bit quant.) Setup Model: Llama-70B The size reduction remains under various context lengths
  • 28. Breakdown of Time to First Token (TTFT) 28 6.1 sec Prefill on input time Prefill on query time 7.2 sec Load 0.12 sec Prefill on query time 1.8 sec Load + decompress 0.12 sec Setup Dataset: Longchat (200 contexts, ~9.6K tokens each) Model: Llama-70B Link to load KV cache: 3Gbps Full recompute Naïve KV cache loading (w/o KV encoding) CacheGen (KV encoding)
  • 29. Knowledge Storing & Sharing LLM LLM LLM Towards Efficient Knowledge Storing & Sharing Key technique #1: Fast KV cache loading via KV codec (Speed up KV loading by 3-10x) Key technique #2: Flexible join of multiple KV caches Happy to chat about technique #2 after the talk
  • 31. Efficient Knowledge Sharing System 31 Delay (time to first token) Cost (storage, compute, communication) Better GPU prefill Storing KV cache in CPU Storing KV cache in SSD Storing KV cache in S3 Efficient Knowledge Sharing Contact me if you are a potential user or contributor to our Knowledge-Sharing System!! junchenj@uchicago.edu