AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU T...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Exploring Distributed Caching
for Faster GPU Training with
NVMe, GDS and RDMA
Bin Fan
Founding Engineer, VP of Technology ...

Agenda
● Why I/O Matters in LLM Training: Challenges & Opportunities
● Three Options Handling Eﬀicient I/O
● Designing a H...

LLM Performance is Driven by Compute, Data Size, Parameters
=> I/O Becomes Critical
Scaling Law: The performance of Large ...

# Tokens Grows Exponentially => Need Faster I/O for Datasets
● LLM training demands a huge amount
of data (billions -> 15T...

Model Size Grows Exponentially => Need Faster Checkpointing
● Large model size, ranging from 7B to more than 1T
● With lar...

Emerging Hardware Advancements => New Opportunities
New hardware advancements such as low-latency NVMe storage, RDMA, and ...

Explore Efﬁcient & Scalable I/O for Model Training
Questions:
▪ Possible architectures
▪ How to design a eﬀicient,
scalabl...

Option 1: Connect to Cloud Storage Directly
Pros:
Easy to manage – Single source of
truth
Data Lake
Cons:
● Slow or Incons...

Option 2: Add High-performance Storage
Pros:
High and consistent I/O performance
Cons:
● Costly Infrastructure
● Extra ove...

Option 3: Adding a Distributed Caching Layer
Pros:
● High and consistent I/O
performance
● Still Keep Single-source of tru...

Designing a High-performance,
Scalable, Distributed Caching for
Faster GPU Training:
Using Alluxio as An Example

12
Alluxio Data Platform
Accelerate data-intensive AI training workloads

Powered by Alluxio
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS

14
Alluxio (Tachyon) was born in UC Berkeley

Requirements of Serving ML GPUs Training
● Programming interface: Subset of POSIX
● Data format: Structured (parquet) plus...

Alluxio
Worker n
Alluxio
Worker 2
Big Data Query
Big Data ETL Model Training
Basic Architecture: Fully Sharded on Consiste...

A Bonus Feature:
Mapping Storage Address to Logical Address
● Alluxio can be viewed as a logical file system
○ Multiple di...

Under the hood
● Use consistent hashing to cache both data and metadata on workers.
○ Reduced I/O RPC length, Performance ...

Performance Results By the numbers
● High Performance:
○ Cache serving: Tens of GB/s per worker
○ Cache preloading: fully ...

Accelerates Model Development
● Increase productivity and accelerate model development by providing
seamless access to dat...

Applying the Architecture With NVMe SSDs

Alluxio Worker on EC2 i3en.metal
vCPU: 96
RAM: 768 GiB
Storage: 8 x 7500 NVMe SSD
Network Bandwidth: 11.6 GiB/s
FIO Benchm...

~8GiB/s per Worker for Sequential Reads

~7 GiB/s Throughput for Random Reads

8 → 32 GiB/s Scaling from 1 to 4 Workers

Network Transportation

Using Netty to Optimize Transportation
Performance
High Concurrent
Position Read
Solve up to 150X Read
Amplification issue...

Experimental: IB vs IPoIB
- Experimental IB
- Reimplement data plane
using Uniﬁed
Communication X (UCX)
- Experimental IPo...

Experimental: Faster Checkpointing

Experimental: GPUDirect over RDMA
checkpointing
● GPU memory -> (Remote) Alluxio worker CPU memory -> Alluxio worker NVMe
...

Case Study: Customer Z
Use Case: Accelerate Model Training & Deployment
Painpoint: Complexity of Training/Inferencing Cros...

Storage
Training Data
C
h
e
c
k
p
o
i
n
t
s
Training
Data Checkpoints
Model
Training
Model
Training
Model
Inference
Model
...

Customer Journey: Architecture Before Alluxio
Primary Data Lake
Training Data
Checkpoints
Training
Data
Checkpoints
Model
...

Customer Journey: Architecture After Alluxio
Primary Data Lake
Model
Training
Model
Training
Training - Cloud
Training - O...

Storage
Training
Data
Checkpoints
Model
Training
Checkpoint
Training - On Prem
Primary data lake
Online ML platform - Clou...

Thank You
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
communit...

Share Slideshare

LinkedIn
Facebook
Twitter

Embed

Size (px)

Show related Slideshows at end

WordPress Shortcode

Link

Share
Email

2024 Trend Updates: What Really Wor... by Search Engine Jou... 1059379 views
Storytelling For The Web: Integrate... by Chiara Aliotta 976667 views
Artificial Intelligence, Data and C... by OECD Directorate ... 914599 views
How to Leverage AI to Boost Employe... by SocialHRCamp 387831 views
2024 State of Marketing Report – by... by Marius Sescu 219508 views
Everything You Need To Know About C... by Expeed Software 236350 views

View on Slideshare

1 of 36 Ad

View on Slideshare

1 of 36 Ad