2. Agenda Distributed DL Challenges
@ Scale DL Infrastructure
Parallelize your models
Techniques for Optimization
Look into future
References
3. Rise of Deep Learning
• Computer Vision, Language Translation,
Speech Recognition, Question & Answer,
…
Major Advances
in AI
• Latency, Cost, Power consumption issues
• Complexity & size outpacing commodity
“General purpose compute”
• Hyper-parameter tuning
Challenging to
build & deploy
for large scale
applications
Exascale, 15 Watts
3
4. Shift towards Specialized Compute
Special purpose Cloud
Google TPU, Microsoft Brainwave, IBM Power AI, Nvidia v100, Intel Nervana
Spectrum: CPU, GPU, FPGA, Custom Asics
Edge Compute: Hardware accelerators, AI SOC
Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)
Architectures
Cluster Compute, HPC, Neuromorphic, Quantum compute
Complexity in Software
Model tuning/optimizations specific to hardware
Growing need for compilers to optimize based on deployment hardware
Workload specific compute: Model training, Inference
4
5. CPU Optimizations
Leverage High Performant compute tools
Intel Python, Intel Math Kernel Library (MKL), MKL-
DNN
Compile Tensorflow from Source for CPU
Optimizations
Proper Data Format
NCHW for CPUs vs Tensorflow default NHWC
Proper Batch size, using all cores & memory
Use Queues for Reading Data
Source: Intel Research Blog
5
6. Tensorflow CPU Optimizations
Compile from source
git clone https://github.com/tensorflow/tensorflow.git
Run ./configure from Tensorflow source directory
Select option MKL (CPU) Optimization
Build pip package for install
bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt
//tensorflow/tools/pip_package:build_pip_package
Install the optimized TensorFlow wheel
bazel-bin/tensorflow/tools/pip_package/build_pip_package
~/path_to_save_wheel
pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl
6
7. Parallelize your models
Data Parallelism
Tensorflow Estimator + Experiments
Parameter Server, Worker cluster
Intel BigDL Spark Cluster
Baidu’s Ring AllReduce
HyperTune Google Cloud ML
Model Parallelism
Graph too large to fit on one
machine
Tensorflow Model Towers
7
9. Workload Partitioning
Source: Amazon MxNET
Minimize communication time
Place neighboring layers on same GPU
Balance workload between GPUs
Different layers have different memory-compute
properties
Model on left more balanced
LSTM unrolling: ↓ memory, ↑ compute time
Encode/Decode: ↑ memory
9
11. Cluster
Optimizations
Define your ML Container locally
Evaluate with different parameters in the cloud
Use EFS / GFS for data storage and sharing across
nodes
Create separate Data processing container
Mount EFS/GFS drive on all pods for shared
storage
Avoid GPU Fragmentation problems by bundling
jobs
Placement optimizations – Kubernetes Bundle
as pods, Mesos placement constraints
GPU Drivers bundling in container a problem
Mount as Readonly volume, or use Nvidia-
docker
11