Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems

Webinar:
Why NFS/NAS on Object
Storage May not Solve
Your AI Problems

Senior Staﬀ Engineer @ Alluxio
Trino Contributor
PrestoDB Committer
Senior Solutions Engineer
@ Alluxio
Dr. Beinan Wang
Tarik Bennett
2

“Global spending on public cloud services is forecast to increase 20.4% in
2024, and similarly to 2023, the source of growth will be combination of cloud
vendor price increases and increased utilization.”
1. Some organizations in production
2. Many organization in early stages
Source: Gartner 2023

Seeking Scalable, Sustainable Performance
Many organizations are training in the cloud.
And many are expecting data volumes and cloud usage to rise in the next year.
Weʼve seen many who are developing with data sizes that currently fit in memory
and preparing for workloads that will be much larger 12 months.

Make it run, make it right, make it fast
AI/ML Development Stages
Source: Data Driven Science
1. In production, but reducing ineﬀiciencies
2. Optimizing early architectures
Problem
Definition
Data
Collection
Data
Preparation
Data
Visualization
AI/ML
Modeling
Feature
Engineering
Model
Deployment
✓
X X X X X
✓

Scenarios We’ll Address Today
Some additional
hardware purchases
Greenfield
deployments
Working with
existing tech stacks

Critical Components
Compute
Networking
Storage
*Data Access
GPUs (on-prem, cloud, and remote aggregators)
S3, object storage, data lakes, data centers
Ethernet, Infiniband, etc
Data serving, backing stores (NFS / NAS), Alluxio
GPUs are growing faster and datasets for model training are growing larger. We propose that data access,
including throughput and data loading eﬀiciency, is another core component of forward architectures.

Critical Components with Data Access
Compute
Networking
Storage
Data Access
GPUs (on-prem, cloud, and remote aggregators)
S3, object storage, data lakes, data centers
Ethernet, Infiniband, etc
Data serving, backing stores (NFS / NAS), Alluxio
Shuttling data eﬀectively from storage to training sets is our topic of discussion today.

Common Issues in Pre-Production Architectures
1. Model training eﬀiciency below expectations
2. Bottlenecks around data synchronization
3. Concurrency and metadata issues
4. Slow data access or low GPU utilization
Teams are managing
- Slow I/O storage that serves high performance GPUs
- Workflows that include manual replication
- Multiple data sources (i.e. hybrid infra, multiple clouds)
There can be many sources of bottlenecks in data pipelines

Storage IOPS vs GPU Memory Bandwidth
Source: Nvidia, MinIO
Storage IOPS (Total Reads + Write Throughputs) / Time (in seconds)
● Throughput - number of bits read or written per second
○ MinIO - 16.3 GB/sec avg read throughput on 24 node cluster
GPU Memory Bandwidth
● “The H100 SXM5 GPU raises the bar considerably… delivering over
3 TB/sec of memory bandwidth, eﬀectively a 2x increase over the
memory bandwidth of A100 that was launched just two years ago.
The PCIe H100 provides 80 GB of fast HBM2e with over 2 TB/sec of
memory bandwidth.”

How are These Issues Being Addressed?
1. Many are attempting to resolve slow data access with faster storage
a. Cloud vendor options
b. Specialized hardware vendors
2. Adding NAS / NFS as backing stores for S3, MinIO, Ceph, etc.
a. Data sharing and collaboration
b. Scalability
c. Data consistency
d. Simplify management

Problems with Existing Options
1. Faster storage hardware means data migration, even if hidden
a. The data must be stored in order to increase the speed
b. Data migration into new storage
c. Non-transparent storage
3. What if vendor changes are required for business reasons?
a. Potential downtime while migrating from the source of truth
b. GPU scarcity and cloud agreements may increase this likelihood
2. NFS/NAS: Maintenance and bottlenecks
a. Stability, reliability, and bottlenecks
b. Manual copies
c. Duplicating data from local storage
d. Data syncing issues from remote storage

Drawbacks of Data Migration
1. Data Transfer Bottlenecks
Data volume and transfer speed
Risk of data corruption or loss
2. Operational Downtime
Service interruptions during migration
Impact on research and development timelines

S3 Object
Store
Cloud
Before: Inefﬁcient Loading from S3 to GPU via NAS
NAS
On-Prem
$ aws s3 sync <bucket> <nas>

Consider Data Abstraction and Distributed Caching
High Performance Caching
1. High throughput for model training
2. High concurrency for model serving
3. Automatic data and metadata syncing
4. Automatic fallback to data lake
5. Data abstraction and transparency
6. Reduced hardware dependency
7. Pre-load data
8. Cache on query
High Performance
Data Access Layer
Data Sources
Compute

Co-located w/ NAS
NAS
How might Alluxio work with your architecture?
Standalone
High Performance
Data Access Layer
Data from multiple
sources served to
GPU nodes
Virtual Caching Across
Local GPU Storage
Data from S3
synced to Virtual
Alluxio Storage
and shared
between GPU
nodes

What problems are addressed by Alluxio?
Increasing Capacity
● Serves training data sizes too
large to fit on single node
● Serves only active data from
data lake or source of truth
● Supports performance as
data volumes increase
● Reduces management of
manual copies
● Distributes data eﬀiciently
across nodes
● Reduces syncing issues
Reducing Data Management Improving Performance
● Addresses I/O limits and
throttling from storage
● Improves GPU utilization
● Reduces requests from
remote data storage
● 50 million objects per node

Beneﬁts
Optimizes data loading for training and model serving
Less maintenance. No manual copies
Support for scaling
Faster switchovers
No hardware
No data migration

Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
19
Alluxio
GCP

AI Training Test with Alluxio
20
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard

Before using Alluxio
GPU Utilization Rate ~17%
DataLoader Rate accounts for ~80% of total time
21
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7.5
GPU Utilization 16.96%
Est. SM Efficiency 16.91%
Est. Achieved
Occupancy
68.75%
Kernel Time using
Tensor Cores
0.0%
Category Pct (%) Time Duration (us)
Average Step Time 100 1,763,649,145
Kernel 16.96 299,168,905
Memcpy 0.6 10,521,722
Memset 0 39,459
Runtime 0.17 3,043,169
DataLoader 81.99 1,446,068,956
CPU Exec 0.09 1,570,076
Other 0.18 3,245,858
Resnet-50
3 epochs
S3 Fuse

After using Alluxio
GPU Utilization Rate Increased from 17% to 93%
DataLoader Rate Reduced to 1%
22
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7.5
GPU Utilization 93.29%
Est. SM Efficiency 92.98%
Est. Achieved
Occupancy
68.03%
Kernel Time using
Tensor Cores
0.0%
Category Pct (%) Time Duration (us)
Average Step Time 100% 334,274,946
Kernel 93.29 311,847,023
Memcpy 3.14 10,500,126
Memset 0.01 43,946
Runtime 1.17 3,899,241
DataLoader 1% 3,343,301
CPU Exec 0.49 1,648,391
Other 0.9 2,992,918
Resnet-50
3 epochs
S3 Fuse

23
Application Interface: Alluxio-FUSE
17 min
Total training time
(3 epochs)
93%
GPU utilization
(TensorBoard)
Alluxio - FUSE
85 min
Total training time
(3 epochs)
17%
GPU utilization
(TensorBoard)
S3 - FUSE
Alluxio is
5 times
faster than
S3-FUSE

Q&A
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems

Recommended

Recommended

More Related Content

Similar to Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems

Similar to Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems