From Data Preparation to Inference:
How Alluxio Speeds Up AI
Jingwen Ouyang
Sr Product Manager, Alluxio
June, 2025
AI Has Transformed Daily Life
Traditional Search & Recommendation
Things that people constantly buy
that match the search key words
2
What to buy?
Top N items
What to buy?
Top N items personalized
Inference Engine
User Profile
Get User
features
Characteristics of Items
Get items
features
Scores, Ranks
Get model
Model Repository
AI Has Transformed Daily Life
Personalized Search & Recommendation
3
AI Data Life Cycle
Data
Collection
Data
Preprocessing
Model
Training
Model
Verification
Model
Loading
Inference
Data
Archiving
Data is everywhere in every stage of the journey.
Data needs to be accessed fast, friction free, and low cost.
4
6
Alluxio makes it easy to share and
manage data from
any storage
to any compute engine
in any environment
with high performance and low cost.
7
Alluxio Data Platform
Distributed caching layer close to the
compute engines for low latency
high throughput data access
Various API Support
(S3, POSIX, HDFS, etc)
Connector for different UFSs,
provide unified global view
Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
Alluxio open source
project founded
UC Berkeley AMPLab
2019 2023
Baidu deploys
1000+ node cluster
2014
Alluxio scales to
1 billion files
7/10 top internet brands
accelerated by Alluxio
AliPay accelerates
model training
BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI
1000+ OSS
Contributors
Meta accelerates
Presto workloads
9/10 top internet brands
accelerated by Alluxio
2024
Alluxio scales to
10+ billion files
Leading ecommerce brand
accelerates model training
Fortune 5 brand
accelerates model training
Zhihu accelerates
LLM model training
9
10
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS
11
Data Preprocessing $
DATACENTER 2 DATACENTER 1
UNIFIED
NAMESPACE
/s3/path
/onprem/path
DATACENTER 3 PUBLIC CLOUDS
DATA
ENGINEERING
OR
Data Science
Data Lakehouse
Data Preprocessing CASE STUDY:
Data Federation Across Clouds / IDCs
12
BUSINESS BENEFIT:
53%
Reduced infra and
operations
TECH BENEFIT:
● Sped up data
migration to cloud
● Enabled compute
burst to cloud
Hive
13
Model Training $
14
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
Model Training CASE STUDY 1:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
● Hybrid cloud, cross region
● UFS thruttle, limited throughput
● GPU stay idle while waiting for data
Model Training CASE STUDY 2:
High Performance AI Platform for Search & Recommendation
Challenges - Manual Bottleneck
⚙ Data Prep Delay
• Copy & validation introduce 1-day delay on 20TB datasets
• Manual prep & staging required in each region
🌐 Cross-Region Cost
• Large datasets incur repeated cross-region egress charges
📦 Storage Ops Pain
• Manual storage management, coordination across teams
15
Model Training CASE STUDY 2:
High Performance AI Platform for Search & Recommendation
Benefits w/ Alluxio - Clean & Streamlined Process
🚀 Performance
• Outperforms traditional HPC storage (e.g., 40% better than
AWS FSx Lustre)
• No prep delay (instant data from cache)
• Global view enables overflow training with no code change
💰 Cost & Resource Savings
• 50%+ reduction in S3 calls and egress
• No redundant datasets across storage systems or teams
⚙ Ops Simplicity
• No manual cleanup
• K8s-based scaling
• Seamless across clouds 16
Model Training Case Study 3
Model Checkpoint Write
Active Training
Training Paused
for Checkpoint Creation
Training Complete
Training
Load on S3
Load on Alluxio
Alluxio overcomes the UFS bandwidth limits
Reduced load on S3, sophisticated failure handling
Training
Load on S3
Slow, bandwidth limitation, throttle, retry on failures
Before
Pain Point
● Slow model checkpointing write
(can be hours) stalls GPU and
cause slow business insight
● Bursty writes put stress on UFS
After
Solution
● Alluxio as write cache for bursty write
● Alluxio asynchronously persist to UFS
Targeted use cases
● Temporary file write caching
17
18
Model Distribution $
INFERENCE SERVERS
Model Distribution is Slow, Complex, & Costly
MODEL REPO
Cloud: AWS (S3)
Region: US-WEST
MODEL TRAINING
INFERENCE SERVERS INFERENCE SERVERS
Model Distribution Challenges
● Distributing model files is slow & manual
● Egress & cloud access costs are high
● Orgs have 1000’s of globally distributed inference servers
● Model files can reach 100GB+
● Model files copied to each inference server
● Model files copied each time model is updated
without Alluxio
UFS bandwidth
19
Accelerate & Simplify Model Distribution with Alluxio
Cloud: AWS
Region: US-WEST
MODEL REPO
Cloud: AWS (S3)
Region: US-WEST
MODEL TRAINING
Model Distribution Solved
● Fast and automated model file distribution
● Lightning fast model loading & server starts/restarts
● Reduces egress & cloud access costs
● Each region has Alluxio Distributed Cache
● Model files ‘copied’ from Model Repo to Alluxio Distributed
cluster once per region vs once per server
● Inference servers get new/updated model files from Alluxio
Distributed Cache
INFERENCE SERVERS
Alluxio Distributed Cache
INFERENCE SERVERS
Alluxio Distributed Cache
INFERENCE SERVERS
Alluxio Distributed Cache
UFS bandwidth
Alluxio can saturate
90% network bandwidth
20
21
Model Inference $
Model Inference CASE STUDY
Ultra Low Latency Access for Parquet on S3
Pain Point
● Data‑Sync Ops
○ Sync from offline feature store to online feature store
○ A 10‑min delay shows stale offers to users.
● Economic Burden
○ Redis memory ≈ US $4–6 K / GB‑mo; vs S3 pennies.
○ 2 TB hot set ⇒ six‑figure annual bill.
● Governance Drift
○ ACLs, lineage, backups managed twice.
○ Security must audit S3 IAM and Redis ACLs.
Store Primary Purpose Latency Target Typical Backend
Offline Training dataset,
historical features
Minutes–hours
per batch
Parquet on S3
Online Real‑time GET(id)
during inference
≤ 10 ms Redis, DynamoDB
Today’s Dual‑Store Architecture
Features
Models
22
Model Inference CASE STUDY
Ultra Low Latency Access for Parquet on S3
Simplified Architecture with Alluxio
For pointed query, Alluxio achieves < 1ms parquet
access on S3 using predicate push down that reduces
data retrieval to a single RPC
Enables parquet on S3 to be used as online features
store*!
* With Assumptions
S3 Express One Zone EC2: i3en.metal S3 Standard
List Price/TB/Month $110* $132** $23***
Example Data Set Size
in TB
500 500 500
% of Data Set Stored 100% 20% 100%
Actual Cost/TB/Month $55,000 $13,200 $11,500
Latency <1 ms <1 ms 100+ ms
Call For Collaboration!
Proof of concept with customized interface in real customer environment
Next Step
● Integrate with upper layers (more query engines / compute frameworks)
to bring this low latency benefit to more applications!
Features
Models
23
25
Thank you!
$
Data
Collection
Data
Preprocessing
Model Training Model
Verification
Model
Distribution
Inference Data Archiving

AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI

  • 1.
    From Data Preparationto Inference: How Alluxio Speeds Up AI Jingwen Ouyang Sr Product Manager, Alluxio June, 2025
  • 2.
    AI Has TransformedDaily Life Traditional Search & Recommendation Things that people constantly buy that match the search key words 2 What to buy? Top N items
  • 3.
    What to buy? TopN items personalized Inference Engine User Profile Get User features Characteristics of Items Get items features Scores, Ranks Get model Model Repository AI Has Transformed Daily Life Personalized Search & Recommendation 3
  • 4.
    AI Data LifeCycle Data Collection Data Preprocessing Model Training Model Verification Model Loading Inference Data Archiving Data is everywhere in every stage of the journey. Data needs to be accessed fast, friction free, and low cost. 4
  • 5.
    6 Alluxio makes iteasy to share and manage data from any storage to any compute engine in any environment with high performance and low cost.
  • 6.
    7 Alluxio Data Platform Distributedcaching layer close to the compute engines for low latency high throughput data access Various API Support (S3, POSIX, HDFS, etc) Connector for different UFSs, provide unified global view
  • 7.
    Alluxio Technology Journey OpenSource Started From UC Berkeley AMPLab in 2014 Alluxio open source project founded UC Berkeley AMPLab 2019 2023 Baidu deploys 1000+ node cluster 2014 Alluxio scales to 1 billion files 7/10 top internet brands accelerated by Alluxio AliPay accelerates model training BIG DATA ANALYTICS CLOUD ADOPTION GENERATIVE AI 1000+ OSS Contributors Meta accelerates Presto workloads 9/10 top internet brands accelerated by Alluxio 2024 Alluxio scales to 10+ billion files Leading ecommerce brand accelerates model training Fortune 5 brand accelerates model training Zhihu accelerates LLM model training 9
  • 8.
    10 Case Studies Zhihu TELCO &MEDIA E-COMMERCE FINANCIAL SERVICES TECH & INTERNET OTHERS
  • 9.
  • 10.
    DATACENTER 2 DATACENTER1 UNIFIED NAMESPACE /s3/path /onprem/path DATACENTER 3 PUBLIC CLOUDS DATA ENGINEERING OR Data Science Data Lakehouse Data Preprocessing CASE STUDY: Data Federation Across Clouds / IDCs 12 BUSINESS BENEFIT: 53% Reduced infra and operations TECH BENEFIT: ● Sped up data migration to cloud ● Enabled compute burst to cloud Hive
  • 11.
  • 12.
    14 BUSINESS BENEFIT: TECH BENEFIT: IncreaseGPU utilization 50% 93% HDFS Training Data Training Data M o d e l s Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Offline Cloud Online Cloud Model Training CASE STUDY 1: High Performance AI Platform for LLM 2 - 4X faster time-to-market ● Hybrid cloud, cross region ● UFS thruttle, limited throughput ● GPU stay idle while waiting for data
  • 13.
    Model Training CASESTUDY 2: High Performance AI Platform for Search & Recommendation Challenges - Manual Bottleneck ⚙ Data Prep Delay • Copy & validation introduce 1-day delay on 20TB datasets • Manual prep & staging required in each region 🌐 Cross-Region Cost • Large datasets incur repeated cross-region egress charges 📦 Storage Ops Pain • Manual storage management, coordination across teams 15
  • 14.
    Model Training CASESTUDY 2: High Performance AI Platform for Search & Recommendation Benefits w/ Alluxio - Clean & Streamlined Process 🚀 Performance • Outperforms traditional HPC storage (e.g., 40% better than AWS FSx Lustre) • No prep delay (instant data from cache) • Global view enables overflow training with no code change 💰 Cost & Resource Savings • 50%+ reduction in S3 calls and egress • No redundant datasets across storage systems or teams ⚙ Ops Simplicity • No manual cleanup • K8s-based scaling • Seamless across clouds 16
  • 15.
    Model Training CaseStudy 3 Model Checkpoint Write Active Training Training Paused for Checkpoint Creation Training Complete Training Load on S3 Load on Alluxio Alluxio overcomes the UFS bandwidth limits Reduced load on S3, sophisticated failure handling Training Load on S3 Slow, bandwidth limitation, throttle, retry on failures Before Pain Point ● Slow model checkpointing write (can be hours) stalls GPU and cause slow business insight ● Bursty writes put stress on UFS After Solution ● Alluxio as write cache for bursty write ● Alluxio asynchronously persist to UFS Targeted use cases ● Temporary file write caching 17
  • 16.
  • 17.
    INFERENCE SERVERS Model Distributionis Slow, Complex, & Costly MODEL REPO Cloud: AWS (S3) Region: US-WEST MODEL TRAINING INFERENCE SERVERS INFERENCE SERVERS Model Distribution Challenges ● Distributing model files is slow & manual ● Egress & cloud access costs are high ● Orgs have 1000’s of globally distributed inference servers ● Model files can reach 100GB+ ● Model files copied to each inference server ● Model files copied each time model is updated without Alluxio UFS bandwidth 19
  • 18.
    Accelerate & SimplifyModel Distribution with Alluxio Cloud: AWS Region: US-WEST MODEL REPO Cloud: AWS (S3) Region: US-WEST MODEL TRAINING Model Distribution Solved ● Fast and automated model file distribution ● Lightning fast model loading & server starts/restarts ● Reduces egress & cloud access costs ● Each region has Alluxio Distributed Cache ● Model files ‘copied’ from Model Repo to Alluxio Distributed cluster once per region vs once per server ● Inference servers get new/updated model files from Alluxio Distributed Cache INFERENCE SERVERS Alluxio Distributed Cache INFERENCE SERVERS Alluxio Distributed Cache INFERENCE SERVERS Alluxio Distributed Cache UFS bandwidth Alluxio can saturate 90% network bandwidth 20
  • 19.
  • 20.
    Model Inference CASESTUDY Ultra Low Latency Access for Parquet on S3 Pain Point ● Data‑Sync Ops ○ Sync from offline feature store to online feature store ○ A 10‑min delay shows stale offers to users. ● Economic Burden ○ Redis memory ≈ US $4–6 K / GB‑mo; vs S3 pennies. ○ 2 TB hot set ⇒ six‑figure annual bill. ● Governance Drift ○ ACLs, lineage, backups managed twice. ○ Security must audit S3 IAM and Redis ACLs. Store Primary Purpose Latency Target Typical Backend Offline Training dataset, historical features Minutes–hours per batch Parquet on S3 Online Real‑time GET(id) during inference ≤ 10 ms Redis, DynamoDB Today’s Dual‑Store Architecture Features Models 22
  • 21.
    Model Inference CASESTUDY Ultra Low Latency Access for Parquet on S3 Simplified Architecture with Alluxio For pointed query, Alluxio achieves < 1ms parquet access on S3 using predicate push down that reduces data retrieval to a single RPC Enables parquet on S3 to be used as online features store*! * With Assumptions S3 Express One Zone EC2: i3en.metal S3 Standard List Price/TB/Month $110* $132** $23*** Example Data Set Size in TB 500 500 500 % of Data Set Stored 100% 20% 100% Actual Cost/TB/Month $55,000 $13,200 $11,500 Latency <1 ms <1 ms 100+ ms Call For Collaboration! Proof of concept with customized interface in real customer environment Next Step ● Integrate with upper layers (more query engines / compute frameworks) to bring this low latency benefit to more applications! Features Models 23
  • 22.
    25 Thank you! $ Data Collection Data Preprocessing Model TrainingModel Verification Model Distribution Inference Data Archiving