Speeding Up In Atlas Deep Learning Platform
with Alluxio+Fluid
Yuandong Xie
xieydd@gmail.com
Platform Researcher of Unisound AI labs
Agenda
• Introduction to Unisound & Atlas AI Platform
• Challenges & Solutions
• Test scenarios & Results
• Analysis of speeding up
• Contribution to Fluid
• Conclusion & Next steps
Unisound
AI Service
AI UI
AI Chip
Chip
Cloud
Edge
Cloud AI Service Provider
Application: Medical、Education
AI Chip Provider
Application: Home
Interactive system Provider
Application: Car
Education Home
Medical
Car
Security
Financial
Atlas support all application and expand new business
Atlas AI Platform
Infrastructure
Computing
Controller Manager
Application
GPU clusterCPU cluster Distributed storage cluster
Machine Learning Model
Data Preprocessing Feature Extraction Statistical Analysis
Resource Management
Job Scheduler Model Registry User Management
Logging and Monitoring Image Registry
Semantic Processing
Recommendation System Data Mining Text Analysis
Speech Processing Image Processing
100G InfiniBand High Performance Network
Deep Learning Model
Platform Storage - Lustre
E
• Distributed file storage system provides
continuous large space (PB scale)
• Multi-Lustre clusters with different networks (
InfiniBand and Ethernet )
• Data verification and replication, ensure data
security
• Support online expansion and version update
Site: https://documen.site/download/03-session-intel-manager-for-lustre_pdf
Challenges with the current architecture
• Storage IO Pressure
• Small file(<1Mb) increases pressure of MDT and OST
• Single-node multi-task IO preemption
• Data duplication
Impacts of the challenge
• IO pressure -> Hardware
damage
• Small files -> Low QoS
• IO preemption -> Resource waste
• Duplicate data -> Storage waste
Current solution for challenges
$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid”
$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start other_clients nid={client@o2ib} rate=80"
• Limit client IO
• Pack small files
• Priority scheduling to idle nodes
• Check duplicate data
Not effective
New Architecture of Atlas
E
Computer Job
Node
Alluxio-fuse
Pod
Worker Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Access Control
(UGO+RWX) Node
Alluxio-fuse
Pod
Worker Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Dataset Runtime
Master Job Master
Pod
Access Control
(UGO+RWX)
/mnt
/mnt/$group/$user
Dataload
Cache(Alluxio+Fluid)
Client(atlasctl)
Storage(Lustre)
Alluxio is all we need & Why
Poor performance
Low QoS
Good performance
Limited scalability
Expensive
Good Performance
Good scalability
Cheap!
Atlas Atlas Atlas
Customized Storage
Fluid is all we need & Why
Data Type
Application
Fluid-SchedulerFluid-controller-manager
Medium size files
Massive small
files
How to manage different user`s dataset?
Natural Language
Processing
… …
Speech Processing Computer Vision
Large size files
How to speed up different type dataset?
How to auto scale alluxio engine?
Fluid-Function …
How to schedule job with cache locality?
Scenarios Test
• Datasets :
• Massive small file (<10Mb)
• Medium size file(>100Mb,<1Gb)
• Large size file(>100Gb)
• Different Scenarios:
• Noise Reduction
• Image Classification
• Optical Character
Recognition(OCR)
• Comparative Test:
• Read directly from Lustre
• Read from Cache(Alluxio Engine)
Experiment #1 --- Massive small files
• Test Details :
• Noise Reduction Application
• 500000 files ,total size 183 Gb
• Cache in Memory
• Three experiments:
• Load raw wav files from lustre
• Load clear data from Alluxio(cold), noise data from lustre
• Load all data from Alluxio(warm)
Massive small files --- Speed
comparision
• Test Results:
Speed Up: 10x
Massive small files Test Results
Massive small files --- Bandwidth and GPU Usage comparision
• Test Results:
230 Mb/s -> 0
Up: 10%
I/O Bandwidth
Experiment #2 --- Medium size files
• Test Details :
• Image Classification Application
• ImageNet TFRecord (Avg size: 138Mb)
• Cache in Memory
• Comparative Test:
• Run 10 GPUs job (Exclusive) on the same node
• Run 7 GPUs job (Preemption) on the same node
Medium size files
• Test Summary
Lustre Alluxio(warm) Memory Theoretical peak
(Preemption)2500 steps
(images/s) per gpu
236.9 601.8 N/A 768.9
(Exclusive)4000 step
(images/s) per gpu
247.2 702.6 699.1 765.9
E2E Time 50 min 20 min 15 min
• Notes :
• Lustre: directly load data from distributed storage system
• Alluxio(warm): pre-warm data in Cache
• Memory: directly load data from memory
• Theoretical peak: load synthetic data
Speed Up: 2.5x
Experiment #3 --- Large size files
• Test Details :
• Optical Character Recognition(Application)
• LMDB size: 125 Gb
• Cache in Memory
• Three experiments:
• Read LMDB from lustre
• Read LMDB from Alluxio, but not pre-warm
• Read LMDB from Alluxio, pre-warm
Large size files --- Speed comparision
• Test Result:
Speed Up: 30x
All in Memory
Large size files Test Results
Large size files --- Bandwidth and GPU Usage comparision
• Test Result:
1300 Mb/s -> 0
15.6 Mb/s -> 0
Up: 31.5%
I/O Bandwidth
Results Summary
• Fluid with Alluxio engine can greatly speed up different type datasets
• Fluid with Alluxio engine can increase the GPU usage of the task
• Fluid with Alluxio engine can significantly reduce the pressure on the
Lustre distributed file system
Contribution to fluid
Conclusion
• Alluxio : High-performance architecture
• Fluid customizes and optimizes Alluxio parameters for different datasets
of scenarios
• Alluxio perfect integration :
• Data is always immutable in deep learning life cycle
• Deep learning need deterministic job
• Fluid perfect integration :
• Locality based scheduling is needed for Lustre
• Cache share different task is needed
Next Steps
• More Scenarios for Fluid with Alluxio Engine
• More Best Practice Sharing
• Continuous Content Contributions & Community Activities
• ……
Just for Smart Life
Thanks

Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

  • 1.
    Speeding Up InAtlas Deep Learning Platform with Alluxio+Fluid Yuandong Xie xieydd@gmail.com Platform Researcher of Unisound AI labs
  • 2.
    Agenda • Introduction toUnisound & Atlas AI Platform • Challenges & Solutions • Test scenarios & Results • Analysis of speeding up • Contribution to Fluid • Conclusion & Next steps
  • 3.
    Unisound AI Service AI UI AIChip Chip Cloud Edge Cloud AI Service Provider Application: Medical、Education AI Chip Provider Application: Home Interactive system Provider Application: Car Education Home Medical Car Security Financial Atlas support all application and expand new business
  • 4.
    Atlas AI Platform Infrastructure Computing ControllerManager Application GPU clusterCPU cluster Distributed storage cluster Machine Learning Model Data Preprocessing Feature Extraction Statistical Analysis Resource Management Job Scheduler Model Registry User Management Logging and Monitoring Image Registry Semantic Processing Recommendation System Data Mining Text Analysis Speech Processing Image Processing 100G InfiniBand High Performance Network Deep Learning Model
  • 5.
    Platform Storage -Lustre E • Distributed file storage system provides continuous large space (PB scale) • Multi-Lustre clusters with different networks ( InfiniBand and Ethernet ) • Data verification and replication, ensure data security • Support online expansion and version update Site: https://documen.site/download/03-session-intel-manager-for-lustre_pdf
  • 6.
    Challenges with thecurrent architecture • Storage IO Pressure • Small file(<1Mb) increases pressure of MDT and OST • Single-node multi-task IO preemption • Data duplication
  • 7.
    Impacts of thechallenge • IO pressure -> Hardware damage • Small files -> Low QoS • IO preemption -> Resource waste • Duplicate data -> Storage waste
  • 8.
    Current solution forchallenges $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid” $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start other_clients nid={client@o2ib} rate=80" • Limit client IO • Pack small files • Priority scheduling to idle nodes • Check duplicate data Not effective
  • 9.
    New Architecture ofAtlas E Computer Job Node Alluxio-fuse Pod Worker Job Worker Pod RAM/SSD/HDD fuse Short circuit Access Control (UGO+RWX) Node Alluxio-fuse Pod Worker Job Worker Pod RAM/SSD/HDD fuse Short circuit Dataset Runtime Master Job Master Pod Access Control (UGO+RWX) /mnt /mnt/$group/$user Dataload Cache(Alluxio+Fluid) Client(atlasctl) Storage(Lustre)
  • 10.
    Alluxio is allwe need & Why Poor performance Low QoS Good performance Limited scalability Expensive Good Performance Good scalability Cheap! Atlas Atlas Atlas Customized Storage
  • 11.
    Fluid is allwe need & Why Data Type Application Fluid-SchedulerFluid-controller-manager Medium size files Massive small files How to manage different user`s dataset? Natural Language Processing … … Speech Processing Computer Vision Large size files How to speed up different type dataset? How to auto scale alluxio engine? Fluid-Function … How to schedule job with cache locality?
  • 12.
    Scenarios Test • Datasets: • Massive small file (<10Mb) • Medium size file(>100Mb,<1Gb) • Large size file(>100Gb) • Different Scenarios: • Noise Reduction • Image Classification • Optical Character Recognition(OCR) • Comparative Test: • Read directly from Lustre • Read from Cache(Alluxio Engine)
  • 13.
    Experiment #1 ---Massive small files • Test Details : • Noise Reduction Application • 500000 files ,total size 183 Gb • Cache in Memory • Three experiments: • Load raw wav files from lustre • Load clear data from Alluxio(cold), noise data from lustre • Load all data from Alluxio(warm)
  • 14.
    Massive small files--- Speed comparision • Test Results: Speed Up: 10x Massive small files Test Results
  • 15.
    Massive small files--- Bandwidth and GPU Usage comparision • Test Results: 230 Mb/s -> 0 Up: 10% I/O Bandwidth
  • 16.
    Experiment #2 ---Medium size files • Test Details : • Image Classification Application • ImageNet TFRecord (Avg size: 138Mb) • Cache in Memory • Comparative Test: • Run 10 GPUs job (Exclusive) on the same node • Run 7 GPUs job (Preemption) on the same node
  • 17.
    Medium size files •Test Summary Lustre Alluxio(warm) Memory Theoretical peak (Preemption)2500 steps (images/s) per gpu 236.9 601.8 N/A 768.9 (Exclusive)4000 step (images/s) per gpu 247.2 702.6 699.1 765.9 E2E Time 50 min 20 min 15 min • Notes : • Lustre: directly load data from distributed storage system • Alluxio(warm): pre-warm data in Cache • Memory: directly load data from memory • Theoretical peak: load synthetic data Speed Up: 2.5x
  • 18.
    Experiment #3 ---Large size files • Test Details : • Optical Character Recognition(Application) • LMDB size: 125 Gb • Cache in Memory • Three experiments: • Read LMDB from lustre • Read LMDB from Alluxio, but not pre-warm • Read LMDB from Alluxio, pre-warm
  • 19.
    Large size files--- Speed comparision • Test Result: Speed Up: 30x All in Memory Large size files Test Results
  • 20.
    Large size files--- Bandwidth and GPU Usage comparision • Test Result: 1300 Mb/s -> 0 15.6 Mb/s -> 0 Up: 31.5% I/O Bandwidth
  • 21.
    Results Summary • Fluidwith Alluxio engine can greatly speed up different type datasets • Fluid with Alluxio engine can increase the GPU usage of the task • Fluid with Alluxio engine can significantly reduce the pressure on the Lustre distributed file system
  • 22.
  • 23.
    Conclusion • Alluxio :High-performance architecture • Fluid customizes and optimizes Alluxio parameters for different datasets of scenarios • Alluxio perfect integration : • Data is always immutable in deep learning life cycle • Deep learning need deterministic job • Fluid perfect integration : • Locality based scheduling is needed for Lustre • Cache share different task is needed
  • 24.
    Next Steps • MoreScenarios for Fluid with Alluxio Engine • More Best Practice Sharing • Continuous Content Contributions & Community Activities • ……
  • 25.
    Just for SmartLife Thanks