AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI

From Data Preparation to Inference:
How Alluxio Speeds Up AI
Jingwen Ouyang
Sr Product Manager, Alluxio
June, 2025

AI Has Transformed Daily Life
Traditional Search & Recommendation
Things that people constantly buy
that match the search key words
2
What to buy?
Top N items

What to buy?
Top N items personalized
Inference Engine
User Profile
Get User
features
Characteristics of Items
Get items
features
Scores, Ranks
Get model
Model Repository
AI Has Transformed Daily Life
Personalized Search & Recommendation
3

AI Data Life Cycle
Data
Collection
Data
Preprocessing
Model
Training
Model
Veriﬁcation
Model
Loading
Inference
Data
Archiving
Data is everywhere in every stage of the journey.
Data needs to be accessed fast, friction free, and low cost.
4

6
Alluxio makes it easy to share and
manage data from
any storage
to any compute engine
in any environment
with high performance and low cost.

7
Alluxio Data Platform
Distributed caching layer close to the
compute engines for low latency
high throughput data access
Various API Support
(S3, POSIX, HDFS, etc)
Connector for different UFSs,
provide uniﬁed global view

Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
Alluxio open source
project founded
UC Berkeley AMPLab
2019 2023
Baidu deploys
1000+ node cluster
2014
Alluxio scales to
1 billion files
7/10 top internet brands
accelerated by Alluxio
AliPay accelerates
model training
BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI
1000+ OSS
Contributors
Meta accelerates
Presto workloads
9/10 top internet brands
accelerated by Alluxio
2024
Alluxio scales to
10+ billion files
Leading ecommerce brand
accelerates model training
Fortune 5 brand
accelerates model training
Zhihu accelerates
LLM model training
9

10
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS

DATACENTER 2 DATACENTER 1
UNIFIED
NAMESPACE
/s3/path
/onprem/path
DATACENTER 3 PUBLIC CLOUDS
DATA
ENGINEERING
OR
Data Science
Data Lakehouse
Data Preprocessing CASE STUDY:
Data Federation Across Clouds / IDCs
12
BUSINESS BENEFIT:
53%
Reduced infra and
operations
TECH BENEFIT:
● Sped up data
migration to cloud
● Enabled compute
burst to cloud
Hive

14
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Oﬀline Cloud Online Cloud
Model Training CASE STUDY 1:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
● Hybrid cloud, cross region
● UFS thruttle, limited throughput
● GPU stay idle while waiting for data

High Performance AI Platform for Search & Recommendation
Challenges - Manual Bottleneck
⚙ Data Prep Delay
• Copy & validation introduce 1-day delay on 20TB datasets
• Manual prep & staging required in each region
🌐 Cross-Region Cost
• Large datasets incur repeated cross-region egress charges
📦 Storage Ops Pain
• Manual storage management, coordination across teams
15

High Performance AI Platform for Search & Recommendation
Benefits w/ Alluxio - Clean & Streamlined Process
🚀 Performance
• Outperforms traditional HPC storage (e.g., 40% better than
AWS FSx Lustre)
• No prep delay (instant data from cache)
• Global view enables overflow training with no code change
💰 Cost & Resource Savings
• 50%+ reduction in S3 calls and egress
• No redundant datasets across storage systems or teams
⚙ Ops Simplicity
• No manual cleanup
• K8s-based scaling
• Seamless across clouds 16

Model Training Case Study 3
Model Checkpoint Write
Active Training
Training Paused
for Checkpoint Creation
Training Complete
Training
Load on S3
Load on Alluxio
Alluxio overcomes the UFS bandwidth limits
Reduced load on S3, sophisticated failure handling
Training
Load on S3
Slow, bandwidth limitation, throttle, retry on failures
Before
Pain Point
● Slow model checkpointing write
(can be hours) stalls GPU and
cause slow business insight
● Bursty writes put stress on UFS
After
Solution
● Alluxio as write cache for bursty write
● Alluxio asynchronously persist to UFS
Targeted use cases
● Temporary file write caching
17

INFERENCE SERVERS
Model Distribution is Slow, Complex, & Costly
MODEL REPO
Cloud: AWS (S3)
Region: US-WEST
MODEL TRAINING
INFERENCE SERVERS INFERENCE SERVERS
Model Distribution Challenges
● Distributing model files is slow & manual
● Egress & cloud access costs are high
● Orgs have 1000’s of globally distributed inference servers
● Model files can reach 100GB+
● Model files copied to each inference server
● Model files copied each time model is updated
without Alluxio
UFS bandwidth
19

Accelerate & Simplify Model Distribution with Alluxio
Cloud: AWS
Region: US-WEST
MODEL REPO
Cloud: AWS (S3)
Region: US-WEST
MODEL TRAINING
Model Distribution Solved
● Fast and automated model file distribution
● Lightning fast model loading & server starts/restarts
● Reduces egress & cloud access costs
● Each region has Alluxio Distributed Cache
● Model files ‘copied’ from Model Repo to Alluxio Distributed
cluster once per region vs once per server
● Inference servers get new/updated model files from Alluxio
Distributed Cache
INFERENCE SERVERS
Alluxio Distributed Cache
INFERENCE SERVERS
INFERENCE SERVERS
UFS bandwidth
Alluxio can saturate
90% network bandwidth
20

Model Inference CASE STUDY
Ultra Low Latency Access for Parquet on S3
Pain Point
● Data‑Sync Ops
○ Sync from offline feature store to online feature store
○ A 10‑min delay shows stale offers to users.
● Economic Burden
○ Redis memory ≈ US $4–6 K / GB‑mo; vs S3 pennies.
○ 2 TB hot set ⇒ six‑figure annual bill.
● Governance Drift
○ ACLs, lineage, backups managed twice.
○ Security must audit S3 IAM and Redis ACLs.
Store Primary Purpose Latency Target Typical Backend
Offline Training dataset,
historical features
Minutes–hours
per batch
Parquet on S3
Online Real‑time GET(id)
during inference
≤ 10 ms Redis, DynamoDB
Today’s Dual‑Store Architecture
Features
Models
22

Model Inference CASE STUDY
Ultra Low Latency Access for Parquet on S3
Simplified Architecture with Alluxio
For pointed query, Alluxio achieves < 1ms parquet
access on S3 using predicate push down that reduces
data retrieval to a single RPC
Enables parquet on S3 to be used as online features
store*!
* With Assumptions
S3 Express One Zone EC2: i3en.metal S3 Standard
List Price/TB/Month $110* $132** $23***
Example Data Set Size
in TB
500 500 500
% of Data Set Stored 100% 20% 100%
Actual Cost/TB/Month $55,000 $13,200 $11,500
Latency <1 ms <1 ms 100+ ms
Call For Collaboration!
Proof of concept with customized interface in real customer environment
Next Step
● Integrate with upper layers (more query engines / compute frameworks)
to bring this low latency benefit to more applications!
Features
Models
23

25
Thank you!
$
Data
Collection
Data
Preprocessing
Model Training Model
Veriﬁcation
Model
Distribution
Inference Data Archiving

AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI

More Related Content

Similar to AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI

More from Alluxio, Inc.

Recently uploaded

AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI