Accelerating Cloud Training With Alluxio

Accelerate Cloud Training
with Alluxio
Alluxio Day 15
Lu Qiu @ Alluxio

Lu Qiu ● Machine Learning Engineer @ Alluxio
● Alluxio PMC maintainer
● Master Data Science @ GWU
● Responsible for integrating Alluxio with
deep learning
● Areas: Alluxio fault tolerant system,
journal system, metrics system, and
POSIX API. Alluxio integration with Cloud
2

Agenda
● Alluxio and its POSIX API
● Accelerate Cloud Training with Alluxio
○ Round 1 Storage Read Accelerating
○ Round 2 Data Preprocessing & Training
○ Round 3 Data Orchestration Layer
3

Data Orchestration for
Analytics & AI in the Cloud
Available:

ALLUXIO 6
DATA ACCESSIBILITY
Convert from client-side interface to native storage interface

ALLUXIO 7
DATA LOCALITY
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query

ALLUXIO 8
METADATA LOCALITY
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query

Alluxio POSIX API
10
HDFS #1
Obj Store
NFS
HDFS #2
Connecting to
● HDFS
● Amazon S3
● Azure
● Google Cloud
● Ceph
● NFS
● Many more
Accessing Remote/Distributed Data as Local Directories

Accelerating Cloud
Training with Alluxio
11

Round 1
Accelerating under
storage data access

Training Clusters
Data Data Data
SSD SSD SSD
Read Buffering
Transparent to App
Under Storage Kubernetes Cloud Cluster
1. Accelerating under storage data access

One Click to Mount UFS to Alluxio
All the data locates in s3://<bucket_name>/ will be
cached by Alluxio and provide data locality for training
jobs.
$ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option
aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key>
$ bin/alluxio fs distributedLoad /s3
One Click to Load all Training data into Alluxio

Alluxio @ Alibaba —— Improve
Throughput
https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/
https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv
e-performance-of-kubernetes-based-deep-learning-in-the-cloud/

Alluxio @ Microsoft Task
● More than 400 tasks need to read data from
Azure and write data to Azure
● The total data size is larger than 1T
Previously they directly copy data from cloud to training
nodes.
Challenges
● Easy to exceed request rate. Azure blob-fuse
requires downloading data from Azure to local
before starting the tasks, and uploading data to
Azure after finishing the tasks.
● Large amount of data input and output, easy to
cause I/O errors
● GPU idle when waiting for I/O operations
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/

Alluxio @ Microsoft Alluxio Speed up Training by 18%
Reduce I/O wait time, improve training
performance
● Use data pre-cache to improve
performance
● Dynamically cache data during training
● Share data across multiple tasks
Streaming read data to disperse I/O request and
avoid exceeding cloud storage request limit
Auto retry to reduce I/O error rate
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/

Round 2
Data Processing &
Training Speed Up

Big Data ETL Cluster Training Clusters
DATA DATA DATA
Read Buffering
Transparent to App
2. Data processing to training speed up

Alluxio @ Boss Zinpin
Task
● Use Spark/Flink to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple compute/training frameworks
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control

Alluxio in BOSSZP
21
Big Data ETL Model Training
HDFS Interface POSIX Interface

● Improve under storage stability
● Speed up whole data preprocessing to training pipeline
● Can launch more Alluxio clusters to meet burst ETL/Training
requirements

23
Data Preprocessing Model Training
POSIX Interface

Round 3
Data Orchestration
Layer

Big Data ETL Cluster
Training Clusters
DAT
A
DAT
A
DAT
A
Read Buffering
Transparent to App
Under Storage System
Data Preprocessing

Big Data ETL Cluster
DAT
A
DAT
A
DAT
A
Write Buffering
Under Storage
Data Preprocessing
Training Clusters

Alluxio @ Momo
Momo has multiple Alluxio clusters including thousands of Alluxio nodes.
Stores more than 100+ TB data. Alluxio serves searching and training tasks
of Momo. Momo continues to develop new use cases of Alluxio.
● Alluxio supports multiple under storage and multiple
compute/training frameworks.
● Accelerate compute/training tasks
● Reduce the metadata and data overhead of under storage

Alluxio @ Momo
Billions image training
- 2 billion small files
- Pytorch + Alluxio + Ceph
- Reduce the metadata and data interactions
with Ceph to improve performance

Alluxio @ Momo
Speed up recommendation system model loading
● Upload recommendation system model to HDFS
● Distributed load model from HDFS to Alluxio
● Recommendation system load model from Alluxio
concurrently
Speed up loading indexes for ANN system
● Creating indexes
● Upload indexes to HDFS (or object store)
● Nodes loading indexes from Alluxio

Alluxio may help you if
● Distributed Training
● Large amount of data (>= TB), large amount of small
files/images
● Network I/O cannot satisfy GPU requirements
● Multiple data sources and multiple training/compute frameworks
● Keep under storage stable and avoid exceeding request rate
problems
● Share data between multiple training tasks

Community Driven Project
● Community driven cooperation. Special thanks to excellent
engineers from Microsoft, Shopee, Tencent, AntFinance,
Alibaba, Bilibili, and Nanjing University.
● In production in Microsoft, Shopee, Bilibili, MOMO, Boss
Zhipin, and etc

Deployment & Usage
https://www.alluxio.io/alluxio-day/
Alluxio on Kubernetes talk on Alluxio Day XII 2022
https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media

Accelerating Cloud Training With Alluxio

More Related Content

Similar to Accelerating Cloud Training With Alluxio

More from Alluxio, Inc.

Recently uploaded

Accelerating Cloud Training With Alluxio