SlideShare a Scribd company logo
Webinar:
Why NFS/NAS on Object
Storage May not Solve
Your AI Problems
Senior Staff Engineer @ Alluxio
Trino Contributor
PrestoDB Committer
Senior Solutions Engineer
@ Alluxio
Dr. Beinan Wang
Tarik Bennett
2
“Global spending on public cloud services is forecast to increase 20.4% in
2024, and similarly to 2023, the source of growth will be combination of cloud
vendor price increases and increased utilization.”
1. Some organizations in production
2. Many organization in early stages
Source: Gartner 2023
Seeking Scalable, Sustainable Performance
Many organizations are training in the cloud.
And many are expecting data volumes and cloud usage to rise in the next year.
Weʼve seen many who are developing with data sizes that currently fit in memory
and preparing for workloads that will be much larger 12 months.
Make it run, make it right, make it fast
AI/ML Development Stages
Source: Data Driven Science
1. In production, but reducing inefficiencies
2. Optimizing early architectures
Problem
Definition
Data
Collection
Data
Preparation
Data
Visualization
AI/ML
Modeling
Feature
Engineering
Model
Deployment
✓
X X X X X
✓
Scenarios We’ll Address Today
Some additional
hardware purchases
Greenfield
deployments
Working with
existing tech stacks
Critical Components
Compute
Networking
Storage
*Data Access
GPUs (on-prem, cloud, and remote aggregators)
S3, object storage, data lakes, data centers
Ethernet, Infiniband, etc
Data serving, backing stores (NFS / NAS), Alluxio
GPUs are growing faster and datasets for model training are growing larger. We propose that data access,
including throughput and data loading efficiency, is another core component of forward architectures.
Critical Components with Data Access
Compute
Networking
Storage
Data Access
GPUs (on-prem, cloud, and remote aggregators)
S3, object storage, data lakes, data centers
Ethernet, Infiniband, etc
Data serving, backing stores (NFS / NAS), Alluxio
Shuttling data effectively from storage to training sets is our topic of discussion today.
Common Issues in Pre-Production Architectures
1. Model training efficiency below expectations
2. Bottlenecks around data synchronization
3. Concurrency and metadata issues
4. Slow data access or low GPU utilization
Teams are managing
- Slow I/O storage that serves high performance GPUs
- Workflows that include manual replication
- Multiple data sources (i.e. hybrid infra, multiple clouds)
There can be many sources of bottlenecks in data pipelines
Storage IOPS vs GPU Memory Bandwidth
Source: Nvidia, MinIO
Storage IOPS (Total Reads + Write Throughputs) / Time (in seconds)
● Throughput - number of bits read or written per second
○ MinIO - 16.3 GB/sec avg read throughput on 24 node cluster
GPU Memory Bandwidth
● “The H100 SXM5 GPU raises the bar considerably… delivering over
3 TB/sec of memory bandwidth, effectively a 2x increase over the
memory bandwidth of A100 that was launched just two years ago.
The PCIe H100 provides 80 GB of fast HBM2e with over 2 TB/sec of
memory bandwidth.”
How are These Issues Being Addressed?
1. Many are attempting to resolve slow data access with faster storage
a. Cloud vendor options
b. Specialized hardware vendors
2. Adding NAS / NFS as backing stores for S3, MinIO, Ceph, etc.
a. Data sharing and collaboration
b. Scalability
c. Data consistency
d. Simplify management
Problems with Existing Options
1. Faster storage hardware means data migration, even if hidden
a. The data must be stored in order to increase the speed
b. Data migration into new storage
c. Non-transparent storage
3. What if vendor changes are required for business reasons?
a. Potential downtime while migrating from the source of truth
b. GPU scarcity and cloud agreements may increase this likelihood
2. NFS/NAS: Maintenance and bottlenecks
a. Stability, reliability, and bottlenecks
b. Manual copies
c. Duplicating data from local storage
d. Data syncing issues from remote storage
Drawbacks of Data Migration
1. Data Transfer Bottlenecks
Data volume and transfer speed
Risk of data corruption or loss
2. Operational Downtime
Service interruptions during migration
Impact on research and development timelines
S3 Object
Store
Cloud
Before: Inefficient Loading from S3 to GPU via NAS
NAS
On-Prem
$ aws s3 sync <bucket> <nas>
Consider Data Abstraction and Distributed Caching
High Performance Caching
1. High throughput for model training
2. High concurrency for model serving
3. Automatic data and metadata syncing
4. Automatic fallback to data lake
5. Data abstraction and transparency
6. Reduced hardware dependency
7. Pre-load data
8. Cache on query
High Performance
Data Access Layer
Data Sources
Compute
Co-located w/ NAS
NAS
How might Alluxio work with your architecture?
Standalone
High Performance
Data Access Layer
Data from multiple
sources served to
GPU nodes
Virtual Caching Across
Local GPU Storage
Data from S3
synced to Virtual
Alluxio Storage
and shared
between GPU
nodes
What problems are addressed by Alluxio?
Increasing Capacity
● Serves training data sizes too
large to fit on single node
● Serves only active data from
data lake or source of truth
● Supports performance as
data volumes increase
● Reduces management of
manual copies
● Distributes data efficiently
across nodes
● Reduces syncing issues
Reducing Data Management Improving Performance
● Addresses I/O limits and
throttling from storage
● Improves GPU utilization
● Reduces requests from
remote data storage
● 50 million objects per node
Benefits
Optimizes data loading for training and model serving
Less maintenance. No manual copies
Support for scaling
Faster switchovers
No hardware
No data migration
Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
19
Alluxio
GCP
AI Training Test with Alluxio
20
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
Before using Alluxio
GPU Utilization Rate ~17%
DataLoader Rate accounts for ~80% of total time
21
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7.5
GPU Utilization 16.96%
Est. SM Efficiency 16.91%
Est. Achieved
Occupancy
68.75%
Kernel Time using
Tensor Cores
0.0%
Category Pct (%) Time Duration (us)
Average Step Time 100 1,763,649,145
Kernel 16.96 299,168,905
Memcpy 0.6 10,521,722
Memset 0 39,459
Runtime 0.17 3,043,169
DataLoader 81.99 1,446,068,956
CPU Exec 0.09 1,570,076
Other 0.18 3,245,858
Resnet-50
3 epochs
S3 Fuse
After using Alluxio
GPU Utilization Rate Increased from 17% to 93%
DataLoader Rate Reduced to 1%
22
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7.5
GPU Utilization 93.29%
Est. SM Efficiency 92.98%
Est. Achieved
Occupancy
68.03%
Kernel Time using
Tensor Cores
0.0%
Category Pct (%) Time Duration (us)
Average Step Time 100% 334,274,946
Kernel 93.29 311,847,023
Memcpy 3.14 10,500,126
Memset 0.01 43,946
Runtime 1.17 3,899,241
DataLoader 1% 3,343,301
CPU Exec 0.49 1,648,391
Other 0.9 2,992,918
Resnet-50
3 epochs
S3 Fuse
23
Application Interface: Alluxio-FUSE
17 min
Total training time
(3 epochs)
93%
GPU utilization
(TensorBoard)
Alluxio - FUSE
85 min
Total training time
(3 epochs)
17%
GPU utilization
(TensorBoard)
S3 - FUSE
Alluxio is
5 times
faster than
S3-FUSE
Q&A
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

More Related Content

Similar to Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
Lviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
Lviv Startup Club
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
Avere Systems
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio, Inc.
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
DATAVERSITY
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
jimliddle
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
niallmilton
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage ReductionPerforce
 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
IBM India Smarter Computing
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
The Right Approach To Cloud Storage
The Right Approach To Cloud StorageThe Right Approach To Cloud Storage
The Right Approach To Cloud StorageStephen Foskett
 

Similar to Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems (20)

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
The Right Approach To Cloud Storage
The Right Approach To Cloud StorageThe Right Approach To Cloud Storage
The Right Approach To Cloud Storage
 

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
Alluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
Alluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
Alluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI Problems

  • 1. Webinar: Why NFS/NAS on Object Storage May not Solve Your AI Problems
  • 2. Senior Staff Engineer @ Alluxio Trino Contributor PrestoDB Committer Senior Solutions Engineer @ Alluxio Dr. Beinan Wang Tarik Bennett 2
  • 3. “Global spending on public cloud services is forecast to increase 20.4% in 2024, and similarly to 2023, the source of growth will be combination of cloud vendor price increases and increased utilization.” 1. Some organizations in production 2. Many organization in early stages Source: Gartner 2023
  • 4. Seeking Scalable, Sustainable Performance Many organizations are training in the cloud. And many are expecting data volumes and cloud usage to rise in the next year. Weʼve seen many who are developing with data sizes that currently fit in memory and preparing for workloads that will be much larger 12 months.
  • 5. Make it run, make it right, make it fast AI/ML Development Stages Source: Data Driven Science 1. In production, but reducing inefficiencies 2. Optimizing early architectures Problem Definition Data Collection Data Preparation Data Visualization AI/ML Modeling Feature Engineering Model Deployment ✓ X X X X X ✓
  • 6. Scenarios We’ll Address Today Some additional hardware purchases Greenfield deployments Working with existing tech stacks
  • 7. Critical Components Compute Networking Storage *Data Access GPUs (on-prem, cloud, and remote aggregators) S3, object storage, data lakes, data centers Ethernet, Infiniband, etc Data serving, backing stores (NFS / NAS), Alluxio GPUs are growing faster and datasets for model training are growing larger. We propose that data access, including throughput and data loading efficiency, is another core component of forward architectures.
  • 8. Critical Components with Data Access Compute Networking Storage Data Access GPUs (on-prem, cloud, and remote aggregators) S3, object storage, data lakes, data centers Ethernet, Infiniband, etc Data serving, backing stores (NFS / NAS), Alluxio Shuttling data effectively from storage to training sets is our topic of discussion today.
  • 9. Common Issues in Pre-Production Architectures 1. Model training efficiency below expectations 2. Bottlenecks around data synchronization 3. Concurrency and metadata issues 4. Slow data access or low GPU utilization Teams are managing - Slow I/O storage that serves high performance GPUs - Workflows that include manual replication - Multiple data sources (i.e. hybrid infra, multiple clouds) There can be many sources of bottlenecks in data pipelines
  • 10. Storage IOPS vs GPU Memory Bandwidth Source: Nvidia, MinIO Storage IOPS (Total Reads + Write Throughputs) / Time (in seconds) ● Throughput - number of bits read or written per second ○ MinIO - 16.3 GB/sec avg read throughput on 24 node cluster GPU Memory Bandwidth ● “The H100 SXM5 GPU raises the bar considerably… delivering over 3 TB/sec of memory bandwidth, effectively a 2x increase over the memory bandwidth of A100 that was launched just two years ago. The PCIe H100 provides 80 GB of fast HBM2e with over 2 TB/sec of memory bandwidth.”
  • 11. How are These Issues Being Addressed? 1. Many are attempting to resolve slow data access with faster storage a. Cloud vendor options b. Specialized hardware vendors 2. Adding NAS / NFS as backing stores for S3, MinIO, Ceph, etc. a. Data sharing and collaboration b. Scalability c. Data consistency d. Simplify management
  • 12. Problems with Existing Options 1. Faster storage hardware means data migration, even if hidden a. The data must be stored in order to increase the speed b. Data migration into new storage c. Non-transparent storage 3. What if vendor changes are required for business reasons? a. Potential downtime while migrating from the source of truth b. GPU scarcity and cloud agreements may increase this likelihood 2. NFS/NAS: Maintenance and bottlenecks a. Stability, reliability, and bottlenecks b. Manual copies c. Duplicating data from local storage d. Data syncing issues from remote storage
  • 13. Drawbacks of Data Migration 1. Data Transfer Bottlenecks Data volume and transfer speed Risk of data corruption or loss 2. Operational Downtime Service interruptions during migration Impact on research and development timelines
  • 14. S3 Object Store Cloud Before: Inefficient Loading from S3 to GPU via NAS NAS On-Prem $ aws s3 sync <bucket> <nas>
  • 15. Consider Data Abstraction and Distributed Caching High Performance Caching 1. High throughput for model training 2. High concurrency for model serving 3. Automatic data and metadata syncing 4. Automatic fallback to data lake 5. Data abstraction and transparency 6. Reduced hardware dependency 7. Pre-load data 8. Cache on query High Performance Data Access Layer Data Sources Compute
  • 16. Co-located w/ NAS NAS How might Alluxio work with your architecture? Standalone High Performance Data Access Layer Data from multiple sources served to GPU nodes Virtual Caching Across Local GPU Storage Data from S3 synced to Virtual Alluxio Storage and shared between GPU nodes
  • 17. What problems are addressed by Alluxio? Increasing Capacity ● Serves training data sizes too large to fit on single node ● Serves only active data from data lake or source of truth ● Supports performance as data volumes increase ● Reduces management of manual copies ● Distributes data efficiently across nodes ● Reduces syncing issues Reducing Data Management Improving Performance ● Addresses I/O limits and throttling from storage ● Improves GPU utilization ● Reduces requests from remote data storage ● 50 million objects per node
  • 18. Benefits Optimizes data loading for training and model serving Less maintenance. No manual copies Support for scaling Faster switchovers No hardware No data migration
  • 19. Model Training Alluxio on AWS - Reference Architecture Model Serving Inference cluster Models Training Data Models 1 2 3 4 5 Alluxio Training cluster Training Data 2 19 Alluxio GCP
  • 20. AI Training Test with Alluxio 20 Local Folder /dataset Alluxio GPU Training Remote Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard
  • 21. Before using Alluxio GPU Utilization Rate ~17% DataLoader Rate accounts for ~80% of total time 21 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7.5 GPU Utilization 16.96% Est. SM Efficiency 16.91% Est. Achieved Occupancy 68.75% Kernel Time using Tensor Cores 0.0% Category Pct (%) Time Duration (us) Average Step Time 100 1,763,649,145 Kernel 16.96 299,168,905 Memcpy 0.6 10,521,722 Memset 0 39,459 Runtime 0.17 3,043,169 DataLoader 81.99 1,446,068,956 CPU Exec 0.09 1,570,076 Other 0.18 3,245,858 Resnet-50 3 epochs S3 Fuse
  • 22. After using Alluxio GPU Utilization Rate Increased from 17% to 93% DataLoader Rate Reduced to 1% 22 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7.5 GPU Utilization 93.29% Est. SM Efficiency 92.98% Est. Achieved Occupancy 68.03% Kernel Time using Tensor Cores 0.0% Category Pct (%) Time Duration (us) Average Step Time 100% 334,274,946 Kernel 93.29 311,847,023 Memcpy 3.14 10,500,126 Memset 0.01 43,946 Runtime 1.17 3,899,241 DataLoader 1% 3,343,301 CPU Exec 0.49 1,648,391 Other 0.9 2,992,918 Resnet-50 3 epochs S3 Fuse
  • 23. 23 Application Interface: Alluxio-FUSE 17 min Total training time (3 epochs) 93% GPU utilization (TensorBoard) Alluxio - FUSE 85 min Total training time (3 epochs) 17% GPU utilization (TensorBoard) S3 - FUSE Alluxio is 5 times faster than S3-FUSE