Alluxio Product School Webinar
Aug. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker: Roland Theron (Senior Solutions Engineer, Alluxio)
Organizations are retooling their enterprise data infrastructure in the race for AI/ML. However, growing datasets, extensive data engineering overhead, high GPU costs, and expensive specialized storage can make it difficult to get fast results from model development.
The data access layer is the key to accelerating your path to AI/ML. In this webinar, Roland Theron, Senior Solutions Engineer at Alluxio, discusses how the data access layer can help you:
- Build AI architecture on your existing data lake without the need for specialized hardware.
- Streamline the time-consuming process of managing data copies in data engineering.
- Speed up training workloads with high GPU utilization.
- Achieve optimal concurrency to deliver models to inference clusters for demanding applications
4. High Scalability
Training billions files
ESSENTIAL
High Availability
99.99%
ESSENTIAL
High Performance
Higher GPU utilization
ESSENTIAL
Always Increasing Expectations…
Icons created by kerismaker, HJ Studio - Flaticon
5. What Does Managing Data Involve?
Data Preprocessing
Improving the quality and reliability
of the data for model training
Model Training
Read training data, vision (image) or
NLP/LLM (text), for DL using GPUs
Model Deployment
Consumption of trained models for
online or offline inference
Feature Engineering
Selecting relevant and informative
features from raw data
PyTorch | Tensorflow | Spark
Spark PyTorch | Tensorflow | Spark
Model
Training Data Result
Model
Compute
Stage
Spark | Trino | Presto
Result
Curated Data
Not discussed today
- Security
- Privacy (PII)
- Data Cleaning
- Data Pipelines
- Data Governance
Curated /
Processed Data
14. Architecture Overview
Online ML platform
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
14
16. AI Training Test with Alluxio
16
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
17. Test Setup
● Alluxio via Kubernetes - Provides caching for training data
● GPU server - AWS EC2/Kubernetes
● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms)
● Deep learning framework - PyTorch
● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB)
● Dataset storage - S3 (single region)
● Mounting - FUSE
● Visualization - TensorBoard
● Code execution - Jupyter notebook
17
18. Training Test Steps
1. Loading the dataset into Alluxio
2. Running the training job
3. Reading the dataset from Alluxio through PyTorch
DataLoader in each epoch
4. Visualizing the GPU utilization and other metrics
18
19. 19
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (Control)
20. 20
Visualization Dashboard Results (Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)