Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Mickey Zhang, Software Engineer (Microsoft)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Deep Learning in the Cloud at Scale: A Data Orchestration Story
1. Deep Learning in the Cloud at Scale:
A Data Orchestration Story
Chao Wang, Mickey Zhang, Qianjun Xu
2. Overview Machine Learning Requirements
End-to-End lifecycle and processes
Data Scientist Workflow
Deep Learning on Azure Machine Learning
Deep Learning: Additional Requirements
Distributed Training with Azure ML Compute
Kubernetes + Alluxio
4. Machine Learning
Typical E2E Process
Prepare
Data
Register and
Manage Model
Train &
Test Model
Build
Image
…
Build Model
(your favorite IDE)
Deploy Service
Monitor Model
Prepare Experiment Deploy
Orchestrate
5. DevOps loop for data science
Prepare
Data
Prepare
Register and
Manage Model
Build
Image
…
Build Model
(your favorite IDE)
Deploy Service
Monitor Model
Train &
Test Model
7. Characteristics of Deep Learning
Massive amounts of training data
Excels with raw, unstructured data
Automatic feature extraction
Computationally expensive
8. Distributed training mode: Data parallelism
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Dataset
CNN model
9. Distributed training mode: Model parallelism
Dataset
CNN model
CNN model CNN modelSubset 1 Subset 2
Worker 1 Worker 2
Job manager
10. Challenges of distributed training
Dependencies and Containers
Provision clusters of VMs
Schedule jobs
Distribute data
Gather results
Handling failures
Scale resources
Secure Access
13. Typical Data Consumption Model
Storage/NFS
Deep Learning Training Platform
RAM
SSD
GPU
CPU
Azure Kubernetes Service
RAM
SSD
GPU
CPU
RAM
SSD
GPU
CPU
14. Why Alluxio?
Scalable
Performance is scalable based on the
cluster size
Lower Cost
Leverage idle resources in the cluster
Performance
Improve data access throughput by
distributing across nodes
Flexibility
Manage multiple data sources in a
unified namespace
15. Side-Car Model With Alluxio
Storage
Deep Learning Training Platform
RAM
SSD
GPU
CPU
Azure Kubernetes Service
RAM
SSD
GPU
CPU
RAM
SSD
GPU
CPU
Data Preloaded in Cluster