Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

Speeding Up In Atlas Deep Learning Platform
with Alluxio+Fluid
Yuandong Xie
xieydd@gmail.com
Platform Researcher of Unisound AI labs

Agenda
• Introduction to Unisound & Atlas AI Platform
• Challenges & Solutions
• Test scenarios & Results
• Analysis of speeding up
• Contribution to Fluid
• Conclusion & Next steps

Unisound
AI Service
AI UI
AI Chip
Chip
Cloud
Edge
Cloud AI Service Provider
Application: Medical、Education
AI Chip Provider
Application: Home
Interactive system Provider
Application: Car
Education Home
Medical
Car
Security
Financial
Atlas support all application and expand new business

Atlas AI Platform
Infrastructure
Computing
Controller Manager
Application
GPU clusterCPU cluster Distributed storage cluster
Machine Learning Model
Data Preprocessing Feature Extraction Statistical Analysis
Resource Management
Job Scheduler Model Registry User Management
Logging and Monitoring Image Registry
Semantic Processing
Recommendation System Data Mining Text Analysis
Speech Processing Image Processing
100G InfiniBand High Performance Network
Deep Learning Model

Platform Storage - Lustre
E
• Distributed file storage system provides
continuous large space (PB scale)
• Multi-Lustre clusters with different networks （
InfiniBand and Ethernet ）
• Data verification and replication, ensure data
security
• Support online expansion and version update
Site: https://documen.site/download/03-session-intel-manager-for-lustre_pdf

Challenges with the current architecture
• Storage IO Pressure
• Small file(<1Mb) increases pressure of MDT and OST
• Single-node multi-task IO preemption
• Data duplication

Impacts of the challenge
• IO pressure -> Hardware
damage
• Small files -> Low QoS
• IO preemption -> Resource waste
• Duplicate data -> Storage waste

Current solution for challenges
$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid”
$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start other_clients nid={client@o2ib} rate=80"
• Limit client IO
• Pack small files
• Priority scheduling to idle nodes
• Check duplicate data
Not effective

New Architecture of Atlas
E
Computer Job
Node
Alluxio-fuse
Pod
Worker Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Access Control
(UGO+RWX) Node
Alluxio-fuse
Pod
Worker Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Dataset Runtime
Master Job Master
Pod
Access Control
(UGO+RWX)
/mnt
/mnt/$group/$user
Dataload
Cache(Alluxio+Fluid)
Client(atlasctl)
Storage(Lustre)

Alluxio is all we need & Why
Poor performance
Low QoS
Good performance
Limited scalability
Expensive
Good Performance
Good scalability
Cheap!
Atlas Atlas Atlas
Customized Storage

Fluid is all we need & Why
Data Type
Application
Fluid-SchedulerFluid-controller-manager
Medium size files
Massive small
files
How to manage different user`s dataset?
Natural Language
Processing
… …
Speech Processing Computer Vision
Large size files
How to speed up different type dataset?
How to auto scale alluxio engine?
Fluid-Function …
How to schedule job with cache locality?

Scenarios Test
• Datasets :
• Massive small file (<10Mb)
• Medium size file(>100Mb,<1Gb)
• Large size file(>100Gb)
• Different Scenarios:
• Noise Reduction
• Image Classification
• Optical Character
Recognition(OCR)
• Comparative Test:
• Read directly from Lustre
• Read from Cache（Alluxio Engine）

Experiment #1 --- Massive small files
• Test Details :
• Noise Reduction Application
• 500000 files ,total size 183 Gb
• Cache in Memory
• Three experiments:
• Load raw wav files from lustre
• Load clear data from Alluxio(cold), noise data from lustre
• Load all data from Alluxio(warm)

Massive small files --- Speed
comparision
• Test Results:
Speed Up: 10x
Massive small files Test Results

Massive small files --- Bandwidth and GPU Usage comparision
• Test Results:
230 Mb/s -> 0
Up: 10%
I/O Bandwidth

Experiment #2 --- Medium size files
• Test Details :
• Image Classification Application
• ImageNet TFRecord (Avg size: 138Mb)
• Cache in Memory
• Comparative Test:
• Run 10 GPUs job (Exclusive) on the same node
• Run 7 GPUs job (Preemption) on the same node

Medium size files
• Test Summary
Lustre Alluxio(warm) Memory Theoretical peak
(Preemption)2500 steps
(images/s) per gpu
236.9 601.8 N/A 768.9
(Exclusive)4000 step
(images/s) per gpu
247.2 702.6 699.1 765.9
E2E Time 50 min 20 min 15 min
• Notes :
• Lustre: directly load data from distributed storage system
• Alluxio(warm): pre-warm data in Cache
• Memory: directly load data from memory
• Theoretical peak: load synthetic data
Speed Up: 2.5x

Experiment #3 --- Large size files
• Test Details :
• Optical Character Recognition(Application)
• LMDB size: 125 Gb
• Cache in Memory
• Three experiments:
• Read LMDB from lustre
• Read LMDB from Alluxio, but not pre-warm
• Read LMDB from Alluxio, pre-warm

Large size files --- Speed comparision
• Test Result:
Speed Up: 30x
All in Memory
Large size files Test Results

Large size files --- Bandwidth and GPU Usage comparision
• Test Result:
1300 Mb/s -> 0
15.6 Mb/s -> 0
Up: 31.5%
I/O Bandwidth

Results Summary
• Fluid with Alluxio engine can greatly speed up different type datasets
• Fluid with Alluxio engine can increase the GPU usage of the task
• Fluid with Alluxio engine can significantly reduce the pressure on the
Lustre distributed file system

Conclusion
• Alluxio : High-performance architecture
• Fluid customizes and optimizes Alluxio parameters for different datasets
of scenarios
• Alluxio perfect integration :
• Data is always immutable in deep learning life cycle
• Deep learning need deterministic job
• Fluid perfect integration :
• Locality based scheduling is needed for Lustre
• Cache share different task is needed

Next Steps
• More Scenarios for Fluid with Alluxio Engine
• More Best Practice Sharing
• Continuous Content Contributions & Community Activities
• ……

Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

More Related Content

What's hot

Similar to Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

More from Alluxio, Inc.

Recently uploaded

Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid