Simplify Data Access for AI in
Multi-Cloud
Bin Fan, VP of Technology @ Alluxio - binfan@alluxio.com
ChanChan Mao, Developer Advocate @ Alluxio - chanchan.mao@alluxio.com
1
Optimizing PyTorch Training and Serving in Practice
VP of Technology
@ Alluxio
Bin Fan
Developer Advocate
@ Alluxio
ChanChan Mao
3
Letʼs recap the Multi-Cloud series so far…
Why a Multi-Cloud Strategy Matters for Your AI Platform
● A multi-cloud architecture allows organizations to leverage different cloud
services to meet diverse workload demands while maximizing efficiency,
reducing costs, and avoiding vendor lock-in.
Architecting Data Platform Across Regions & Clouds
● Efficient data access is key to a successful heterogeneous data platform, whether
your data is distributed across multiple datacenters and/or clouds.
Cloud-Native Model Training on Distributed Data
● The evolution of the modern data stack created challenges for data locality in
the multi-region/cloud ML pipeline - resulting in high latency, expensive cloud
costs, and lack of reliability.
01
02
03
ALLUXIO 4
Data Access Patterns of
AI/ML
4
Data Access Patterns
5
Data Access Patterns
6
Cloud Data Access Patterns:
Multi-Cloud/Multi-Region Data Access
7
Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
8
Separation of compute and storage
Challenges of Data Access
1. Performance
● Pulling data from cloud storage is hurting training/serving.
2. Cost
● Repeatedly requesting data from cloud storage is costly.
3. Reliability
● Availability is the key for every service in cloud.
4. Usability
● Manual data management is unfavorable.
9
Data access:
1. Read data directly from cloud storage
2. Copy data from cloud to local before training
3. Local cache layer for data reuse
4. Distributed cache system
Model access:
1. Pull models directly from cloud storage
Existing Solutions
10
Option 1: Read From Cloud Storage
● Easy to set up
● Performance are not ideal
■ Model access: Models are repeatedly pulled from cloud storage
■ Data access: Reading data can take more time than actual training
82% of the time
spent by
DataLoader
11
Option 2: Copy Data To Local Before Training
● Data is now local
■ Faster access + less cost
● Management is hard
■ Must manually delete training data after use
● Local storage space is limited
■ Dataset is huge - limited benefits
12
Option 3: Local Cache for Data Reuse
Examples: S3FS built-in local cache, Alluxio Fuse SDK
● Reused data is local
■ Faster access + less cost
● Cache layer provider helps data management
■ No manual deletion/supervision
● Cache space is limited
■ Dataset is huge - limited benefits
13
Option 4: Distributed Cache System
Clients
Worker
Worker
Worker
…
● Training data and trained models can
be kept in cache - distributed.
● Typically with data management
functionalities.
14
ALLUXIO 15
Alluxio as an example
15
Clients Worker
Worker
…
Masters
Worker
● Use consistent hashing to cache both data
and metadata on workers.
● Worker nodes have plenty space for cache.
Training data and models only need to be
pulled once from cloud storage. Cost --
● No more single point of failure. Reliability ++
● No more performance bottleneck on masters.
Performance ++
● Data management system.
Consistent Hashing for caching
16
By the numbers
● High Scalability
■ One worker supports 30 - 50 million files
■ Scale linearly - easy to support 10 billions of files
● High Availability
■ 99.99% uptime
■ No single point of failure
● High Performance
■ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management
17
Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users
18
Alluxio CSI x Alluxio FUSE for Data Access
● FUSE: Turn remote dataset in cloud
into local folder for training
● CSI: Launch Alluxio FUSE pod only
when dataset is needed
Alluxio Fuse pod
Fuse
Container
Host Machine
Application pod
Application
Container
Persistent
volume +
claim
mount
mount
19
ALLUXIO 20
Data Access
Management for
PyTorch
20
Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
21
Data Loading Performance
ImageNet (subset)
22
Yelp review
23
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement
Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement
ALLUXIO 25
How to enable Python
Applications
25
Use Alluxio - Ray Integration as an example
26
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses
Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
27
Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB
28
Cost Saving – Egress/Data Transfer Fees
29
Cost Saving – API Calls/S3 Operations (List, Get)
List/Get API calls only access Alluxio
30
ALLUXIO 31
Use Cases
31
Alluxio Benefits
30-50%
90% +
Reduce 30%+ time
compare consuming from
Cloud object storage
Manage the on-going training dataset from cold storage
Alluxio server data to GPU with advanced caching capability
Avoid 50%+ data copy
Stable GPU utilization no
matter where you start GPU
cluster
Virtual layer to different
storage
Use case
Autonomous driving
33
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
34
Thank you!
35
Up Next:
Awesome AI Dev Tools - Mon May 20 @ Github, SF
https://lu.ma/s2ghbk5i
Speak at an Alluxio event:
https://forms.gle/iJX9GTMaAVQdzKc28

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud

  • 1.
    Simplify Data Accessfor AI in Multi-Cloud Bin Fan, VP of Technology @ Alluxio - binfan@alluxio.com ChanChan Mao, Developer Advocate @ Alluxio - chanchan.mao@alluxio.com 1 Optimizing PyTorch Training and Serving in Practice
  • 2.
    VP of Technology @Alluxio Bin Fan Developer Advocate @ Alluxio ChanChan Mao
  • 3.
    3 Letʼs recap theMulti-Cloud series so far… Why a Multi-Cloud Strategy Matters for Your AI Platform ● A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. Architecting Data Platform Across Regions & Clouds ● Efficient data access is key to a successful heterogeneous data platform, whether your data is distributed across multiple datacenters and/or clouds. Cloud-Native Model Training on Distributed Data ● The evolution of the modern data stack created challenges for data locality in the multi-region/cloud ML pipeline - resulting in high latency, expensive cloud costs, and lack of reliability. 01 02 03
  • 4.
    ALLUXIO 4 Data AccessPatterns of AI/ML 4
  • 5.
  • 6.
  • 7.
    Cloud Data AccessPatterns: Multi-Cloud/Multi-Region Data Access 7
  • 8.
    Hybrid/Multi-Cloud ML Platforms OnlineML platform Serving cluster Models Training Data Models 1 2 3 Offline training platform Training cluster DC/Cloud A DC/Cloud B 8 Separation of compute and storage
  • 9.
    Challenges of DataAccess 1. Performance ● Pulling data from cloud storage is hurting training/serving. 2. Cost ● Repeatedly requesting data from cloud storage is costly. 3. Reliability ● Availability is the key for every service in cloud. 4. Usability ● Manual data management is unfavorable. 9
  • 10.
    Data access: 1. Readdata directly from cloud storage 2. Copy data from cloud to local before training 3. Local cache layer for data reuse 4. Distributed cache system Model access: 1. Pull models directly from cloud storage Existing Solutions 10
  • 11.
    Option 1: ReadFrom Cloud Storage ● Easy to set up ● Performance are not ideal ■ Model access: Models are repeatedly pulled from cloud storage ■ Data access: Reading data can take more time than actual training 82% of the time spent by DataLoader 11
  • 12.
    Option 2: CopyData To Local Before Training ● Data is now local ■ Faster access + less cost ● Management is hard ■ Must manually delete training data after use ● Local storage space is limited ■ Dataset is huge - limited benefits 12
  • 13.
    Option 3: LocalCache for Data Reuse Examples: S3FS built-in local cache, Alluxio Fuse SDK ● Reused data is local ■ Faster access + less cost ● Cache layer provider helps data management ■ No manual deletion/supervision ● Cache space is limited ■ Dataset is huge - limited benefits 13
  • 14.
    Option 4: DistributedCache System Clients Worker Worker Worker … ● Training data and trained models can be kept in cache - distributed. ● Typically with data management functionalities. 14
  • 15.
    ALLUXIO 15 Alluxio asan example 15
  • 16.
    Clients Worker Worker … Masters Worker ● Useconsistent hashing to cache both data and metadata on workers. ● Worker nodes have plenty space for cache. Training data and models only need to be pulled once from cloud storage. Cost -- ● No more single point of failure. Reliability ++ ● No more performance bottleneck on masters. Performance ++ ● Data management system. Consistent Hashing for caching 16
  • 17.
    By the numbers ●High Scalability ■ One worker supports 30 - 50 million files ■ Scale linearly - easy to support 10 billions of files ● High Availability ■ 99.99% uptime ■ No single point of failure ● High Performance ■ Faster data loading ● Cloud-native K8s Operator and CSI-FUSE for data access management 17
  • 18.
    Alluxio FUSE ● Exposethe Alluxio file system as a local file system. ● Can access the cloud storage just as accessing local storage. ○ cat, ls ○ f = open(“a.txt”, “r”) ● Very low impact for end users 18
  • 19.
    Alluxio CSI xAlluxio FUSE for Data Access ● FUSE: Turn remote dataset in cloud into local folder for training ● CSI: Launch Alluxio FUSE pod only when dataset is needed Alluxio Fuse pod Fuse Container Host Machine Application pod Application Container Persistent volume + claim mount mount 19
  • 20.
  • 21.
    Under Storage Integration withPyTorch Training (Alluxio) Training Node Get Task Info Alluxio Client PyTorch Get Cluster Info Send Result Cache Cluster Service Registry Cache Worker Cache Worker Execute Task Cache Worker Cache Client Find Worker(s) Affinity Block Location Policy Client-side load balance 1 2 3 4 5 Cache miss - Under storage task 21
  • 22.
    Data Loading Performance ImageNet(subset) 22 Yelp review
  • 23.
    23 Training Directly fromStorage (S3-FUSE) - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) GPU Utilization Improvement
  • 24.
    Training with Alluxio-FUSE -Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X) GPU Utilization Improvement
  • 25.
    ALLUXIO 25 How toenable Python Applications 25
  • 26.
    Use Alluxio -Ray Integration as an example 26 Ray Dataloader fsspec - Alluxio impl Alluxio Python client Ray etcd Alluxio Worker REST API server Alluxio Worker REST API server PyArrow Dataset loading Registration Get worker addresses
  • 27.
    Alluxio+Ray Benchmark –Small Files ● Dataset ○ 130GB imagenet dataset ● Process Settings ○ 4 train workers ○ 9 process reading ● Active Object Store Memory ○ 400-500 MiB 27
  • 28.
    Alluxio+Ray Benchmark –Large Parquet files ● Dataset ○ 200MiB files, adds up to 60GiB ● Process Settings ○ 28 train workers ○ 28 process reading ● Active Object Store Memory ○ 20-30 GiB 28
  • 29.
    Cost Saving –Egress/Data Transfer Fees 29
  • 30.
    Cost Saving –API Calls/S3 Operations (List, Get) List/Get API calls only access Alluxio 30
  • 31.
  • 32.
    Alluxio Benefits 30-50% 90% + Reduce30%+ time compare consuming from Cloud object storage Manage the on-going training dataset from cold storage Alluxio server data to GPU with advanced caching capability Avoid 50%+ data copy Stable GPU utilization no matter where you start GPU cluster Virtual layer to different storage Use case Autonomous driving
  • 33.
  • 34.
    Any Questions? Scanthe QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts! 34
  • 35.
    Thank you! 35 Up Next: AwesomeAI Dev Tools - Mon May 20 @ Github, SF https://lu.ma/s2ghbk5i Speak at an Alluxio event: https://forms.gle/iJX9GTMaAVQdzKc28