Accelerate Cloud Training
with Alluxio
Alluxio Day 15
Lu Qiu @ Alluxio
Lu Qiu ● Machine Learning Engineer @ Alluxio
● Alluxio PMC maintainer
● Master Data Science @ GWU
● Responsible for integrating Alluxio with
deep learning
● Areas: Alluxio fault tolerant system,
journal system, metrics system, and
POSIX API. Alluxio integration with Cloud
2
Agenda
● Alluxio and its POSIX API
● Accelerate Cloud Training with Alluxio
○ Round 1 Storage Read Accelerating
○ Round 2 Data Preprocessing & Training
○ Round 3 Data Orchestration Layer
3
Alluxio
& its POSIX API
4
Data Orchestration for
Analytics & AI in the Cloud
Available:
ALLUXIO 6
DATA ACCESSIBILITY
Convert from client-side interface to native storage interface
ALLUXIO 7
DATA LOCALITY
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
ALLUXIO 8
METADATA LOCALITY
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
Alluxio POSIX API
Alluxio POSIX API
10
HDFS #1
Obj Store
NFS
HDFS #2
Connecting to
● HDFS
● Amazon S3
● Azure
● Google Cloud
● Ceph
● NFS
● Many more
Accessing Remote/Distributed Data as Local Directories
Accelerating Cloud
Training with Alluxio
11
Round 1
Accelerating under
storage data access
Training Clusters
Data Data Data
SSD SSD SSD
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Under Storage Kubernetes Cloud Cluster
1. Accelerating under storage data access
One Click to Mount UFS to Alluxio
All the data locates in s3://<bucket_name>/ will be
cached by Alluxio and provide data locality for training
jobs.
$ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option
aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key>
$ bin/alluxio fs distributedLoad /s3
One Click to Load all Training data into Alluxio
Alluxio @ Alibaba —— Improve
Throughput
https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/
https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv
e-performance-of-kubernetes-based-deep-learning-in-the-cloud/
Alluxio @ Microsoft Task
● More than 400 tasks need to read data from
Azure and write data to Azure
● The total data size is larger than 1T
Previously they directly copy data from cloud to training
nodes.
Challenges
● Easy to exceed request rate. Azure blob-fuse
requires downloading data from Azure to local
before starting the tasks, and uploading data to
Azure after finishing the tasks.
● Large amount of data input and output, easy to
cause I/O errors
● GPU idle when waiting for I/O operations
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
Alluxio @ Microsoft Alluxio Speed up Training by 18%
Reduce I/O wait time, improve training
performance
● Use data pre-cache to improve
performance
● Dynamically cache data during training
● Share data across multiple tasks
Streaming read data to disperse I/O request and
avoid exceeding cloud storage request limit
Auto retry to reduce I/O error rate
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
Round 2
Data Processing &
Training Speed Up
Big Data ETL Cluster Training Clusters
DATA DATA DATA
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
2. Data processing to training speed up
Alluxio @ Boss Zinpin
Task
● Use Spark/Flink to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple compute/training frameworks
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
Alluxio in BOSSZP
21
Big Data ETL Model Training
HDFS Interface POSIX Interface
2. Data processing to training speed up
● Improve under storage stability
● Speed up whole data preprocessing to training pipeline
● Can launch more Alluxio clusters to meet burst ETL/Training
requirements
2. Data processing to training speed up
23
Data Preprocessing Model Training
POSIX Interface
Round 3
Data Orchestration
Layer
Big Data ETL Cluster
Training Clusters
DAT
A
DAT
A
DAT
A
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Under Storage System
Data Preprocessing
Big Data ETL Cluster
DAT
A
DAT
A
DAT
A
Write Buffering
Policies for pinning,
promotion/demotion,TTL
Under Storage
Data Preprocessing
Training Clusters
Data Orchestration for
Analytics & AI in the Cloud
Available:
Alluxio @ Momo
Momo has multiple Alluxio clusters including thousands of Alluxio nodes.
Stores more than 100+ TB data. Alluxio serves searching and training tasks
of Momo. Momo continues to develop new use cases of Alluxio.
● Alluxio supports multiple under storage and multiple
compute/training frameworks.
● Accelerate compute/training tasks
● Reduce the metadata and data overhead of under storage
Alluxio @ Momo
Billions image training
- 2 billion small files
- Pytorch + Alluxio + Ceph
- Reduce the metadata and data interactions
with Ceph to improve performance
Alluxio @ Momo
Speed up recommendation system model loading
● Upload recommendation system model to HDFS
● Distributed load model from HDFS to Alluxio
● Recommendation system load model from Alluxio
concurrently
Speed up loading indexes for ANN system
● Creating indexes
● Upload indexes to HDFS (or object store)
● Nodes loading indexes from Alluxio
Alluxio may help you if
● Distributed Training
● Large amount of data (>= TB), large amount of small
files/images
● Network I/O cannot satisfy GPU requirements
● Multiple data sources and multiple training/compute frameworks
● Keep under storage stable and avoid exceeding request rate
problems
● Share data between multiple training tasks
Community Driven Project
● Community driven cooperation. Special thanks to excellent
engineers from Microsoft, Shopee, Tencent, AntFinance,
Alibaba, Bilibili, and Nanjing University.
● In production in Microsoft, Shopee, Bilibili, MOMO, Boss
Zhipin, and etc
Deployment & Usage
https://www.alluxio.io/alluxio-day/
Alluxio on Kubernetes talk on Alluxio Day XII 2022
https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media

Accelerating Cloud Training With Alluxio

  • 1.
    Accelerate Cloud Training withAlluxio Alluxio Day 15 Lu Qiu @ Alluxio
  • 2.
    Lu Qiu ●Machine Learning Engineer @ Alluxio ● Alluxio PMC maintainer ● Master Data Science @ GWU ● Responsible for integrating Alluxio with deep learning ● Areas: Alluxio fault tolerant system, journal system, metrics system, and POSIX API. Alluxio integration with Cloud 2
  • 3.
    Agenda ● Alluxio andits POSIX API ● Accelerate Cloud Training with Alluxio ○ Round 1 Storage Read Accelerating ○ Round 2 Data Preprocessing & Training ○ Round 3 Data Orchestration Layer 3
  • 4.
  • 5.
    Data Orchestration for Analytics& AI in the Cloud Available:
  • 6.
    ALLUXIO 6 DATA ACCESSIBILITY Convertfrom client-side interface to native storage interface
  • 7.
    ALLUXIO 7 DATA LOCALITY Localperformance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 8.
    ALLUXIO 8 METADATA LOCALITY Synchronizationof changes across clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 9.
  • 10.
    Alluxio POSIX API 10 HDFS#1 Obj Store NFS HDFS #2 Connecting to ● HDFS ● Amazon S3 ● Azure ● Google Cloud ● Ceph ● NFS ● Many more Accessing Remote/Distributed Data as Local Directories
  • 11.
  • 12.
  • 13.
    Training Clusters Data DataData SSD SSD SSD Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL Under Storage Kubernetes Cloud Cluster 1. Accelerating under storage data access
  • 14.
    One Click toMount UFS to Alluxio All the data locates in s3://<bucket_name>/ will be cached by Alluxio and provide data locality for training jobs. $ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key> $ bin/alluxio fs distributedLoad /s3 One Click to Load all Training data into Alluxio
  • 15.
    Alluxio @ Alibaba—— Improve Throughput https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/ https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv e-performance-of-kubernetes-based-deep-learning-in-the-cloud/
  • 16.
    Alluxio @ MicrosoftTask ● More than 400 tasks need to read data from Azure and write data to Azure ● The total data size is larger than 1T Previously they directly copy data from cloud to training nodes. Challenges ● Easy to exceed request rate. Azure blob-fuse requires downloading data from Azure to local before starting the tasks, and uploading data to Azure after finishing the tasks. ● Large amount of data input and output, easy to cause I/O errors ● GPU idle when waiting for I/O operations https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
  • 17.
    Alluxio @ MicrosoftAlluxio Speed up Training by 18% Reduce I/O wait time, improve training performance ● Use data pre-cache to improve performance ● Dynamically cache data during training ● Share data across multiple tasks Streaming read data to disperse I/O request and avoid exceeding cloud storage request limit Auto retry to reduce I/O error rate https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
  • 18.
    Round 2 Data Processing& Training Speed Up
  • 19.
    Big Data ETLCluster Training Clusters DATA DATA DATA Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL 2. Data processing to training speed up
  • 20.
    Alluxio @ BossZinpin Task ● Use Spark/Flink to process data ● Model training on top of the processed data Previous solution ● Spark/flink + Ceph + model training Problems ● Write temporary files into Ceph cause high Ceph pressure ● Cannot control Ceph read/write pressure, cluster unstable Solution with Alluxio Spark/flink + Alluxio + Ceph + Alluxio + model training ● Alluxio supports multiple data sources and multiple compute/training frameworks ● Multiple independent Alluxio clusters, support multi-tenants, customized configuration, access control
  • 21.
    Alluxio in BOSSZP 21 BigData ETL Model Training HDFS Interface POSIX Interface
  • 22.
    2. Data processingto training speed up ● Improve under storage stability ● Speed up whole data preprocessing to training pipeline ● Can launch more Alluxio clusters to meet burst ETL/Training requirements
  • 23.
    2. Data processingto training speed up 23 Data Preprocessing Model Training POSIX Interface
  • 24.
  • 25.
    Big Data ETLCluster Training Clusters DAT A DAT A DAT A Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL Under Storage System Data Preprocessing
  • 26.
    Big Data ETLCluster DAT A DAT A DAT A Write Buffering Policies for pinning, promotion/demotion,TTL Under Storage Data Preprocessing Training Clusters
  • 27.
    Data Orchestration for Analytics& AI in the Cloud Available:
  • 28.
    Alluxio @ Momo Momohas multiple Alluxio clusters including thousands of Alluxio nodes. Stores more than 100+ TB data. Alluxio serves searching and training tasks of Momo. Momo continues to develop new use cases of Alluxio. ● Alluxio supports multiple under storage and multiple compute/training frameworks. ● Accelerate compute/training tasks ● Reduce the metadata and data overhead of under storage
  • 29.
    Alluxio @ Momo Billionsimage training - 2 billion small files - Pytorch + Alluxio + Ceph - Reduce the metadata and data interactions with Ceph to improve performance
  • 30.
    Alluxio @ Momo Speedup recommendation system model loading ● Upload recommendation system model to HDFS ● Distributed load model from HDFS to Alluxio ● Recommendation system load model from Alluxio concurrently Speed up loading indexes for ANN system ● Creating indexes ● Upload indexes to HDFS (or object store) ● Nodes loading indexes from Alluxio
  • 31.
    Alluxio may helpyou if ● Distributed Training ● Large amount of data (>= TB), large amount of small files/images ● Network I/O cannot satisfy GPU requirements ● Multiple data sources and multiple training/compute frameworks ● Keep under storage stable and avoid exceeding request rate problems ● Share data between multiple training tasks
  • 32.
    Community Driven Project ●Community driven cooperation. Special thanks to excellent engineers from Microsoft, Shopee, Tencent, AntFinance, Alibaba, Bilibili, and Nanjing University. ● In production in Microsoft, Shopee, Bilibili, MOMO, Boss Zhipin, and etc
  • 33.
    Deployment & Usage https://www.alluxio.io/alluxio-day/ Alluxioon Kubernetes talk on Alluxio Day XII 2022 https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html
  • 34.