Distributed deep learning optimizations

Distributed Deep
Learning
Optimizations
GEETA CHAUHAN
5th Annual Global Big Data Conference, Santa Clara
Aug 29th – 31st , 2017

Agenda  Distributed DL Challenges
 @ Scale DL Infrastructure
 Parallelize your models
 Techniques for Optimization
 Look into future
 References

Rise of Deep Learning
• Computer Vision, Language Translation,
Speech Recognition, Question & Answer,
…
Major Advances
in AI
• Latency, Cost, Power consumption issues
• Complexity & size outpacing commodity
“General purpose compute”
• Hyper-parameter tuning
Challenging to
build & deploy
for large scale
applications
Exascale, 15 Watts
3

Shift towards Specialized Compute
 Special purpose Cloud
 Google TPU, Microsoft Brainwave, IBM Power AI, Nvidia v100, Intel Nervana
 Spectrum: CPU, GPU, FPGA, Custom Asics
 Edge Compute: Hardware accelerators, AI SOC
 Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)
 Architectures
 Cluster Compute, HPC, Neuromorphic, Quantum compute
 Complexity in Software
 Model tuning/optimizations specific to hardware
 Growing need for compilers to optimize based on deployment hardware
 Workload specific compute: Model training, Inference
4

CPU Optimizations
 Leverage High Performant compute tools
 Intel Python, Intel Math Kernel Library (MKL), MKL-
DNN
 Compile Tensorflow from Source for CPU
Optimizations
 Proper Data Format
 NCHW for CPUs vs Tensorflow default NHWC
 Proper Batch size, using all cores & memory
 Use Queues for Reading Data
Source: Intel Research Blog
5

Tensorflow CPU Optimizations
 Compile from source
 git clone https://github.com/tensorflow/tensorflow.git
 Run ./configure from Tensorflow source directory
 Select option MKL (CPU) Optimization
 Build pip package for install
 bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt
//tensorflow/tools/pip_package:build_pip_package
 Install the optimized TensorFlow wheel
 bazel-bin/tensorflow/tools/pip_package/build_pip_package
~/path_to_save_wheel
pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl
6

Parallelize your models
 Data Parallelism
 Tensorflow Estimator + Experiments
 Parameter Server, Worker cluster
 Intel BigDL Spark Cluster
 Baidu’s Ring AllReduce
 HyperTune Google Cloud ML
 Model Parallelism
 Graph too large to fit on one
machine
 Tensorflow Model Towers
7

Optimizations for Training
Source: Amazon MxNET
8

Workload Partitioning
Source: Amazon MxNET
 Minimize communication time
 Place neighboring layers on same GPU
 Balance workload between GPUs
 Different layers have different memory-compute
properties
 Model on left more balanced
 LSTM unrolling: ↓ memory, ↑ compute time
 Encode/Decode: ↑ memory
9

Optimizations for Inferencing
 Graph Transform Tool
 Freeze graph (variables to constants)
 Quantize weights (20 M weights for IV3)
 Quantization (32 bit float → 8 bit float)
 Memory Mapping
 Inception v3 93 MB → 1.5 MB
10
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph
--in_graph=/tmp/classify_image_graph_def.pb
--outputs="softmax" --out_graph=/tmp/quantized_graph.pb
--transforms='add_default_attributes strip_unused_nodes(type=float,
shape="1,299,299,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
strip_unused_nodes sort_by_execution_order'

Cluster
Optimizations
 Define your ML Container locally
 Evaluate with different parameters in the cloud
 Use EFS / GFS for data storage and sharing across
nodes
 Create separate Data processing container
 Mount EFS/GFS drive on all pods for shared
storage
 Avoid GPU Fragmentation problems by bundling
jobs
 Placement optimizations – Kubernetes Bundle
as pods, Mesos placement constraints
 GPU Drivers bundling in container a problem
 Mount as Readonly volume, or use Nvidia-
docker
11

Future: FPGA Hardware Microservices
Source: Microsoft Research Blog
12

BrainWave Compiler & Runtime
Source: Microsoft Research Blog
13

Future: Neuromorphic Compute
IBM True North: Brain Inspired Chip Neuromorphic memristors
14

Future:
Quantum
Computers
Source: opentranscripts.org
Eg Personalized Medicine for diseases like Cancer
15

Resources
 A Study of Complex Deep Learning Networks on High Performance,
Neuromorphic, and Quantum Computers
https://arxiv.org/pdf/1703.05364.pdf
 Microsoft’s Project Brainwave: https://www.microsoft.com/en-
us/research/blog/microsoft-unveils-project-brainwave/
 Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-
distributed-deep-learning-on-apache-spark
 Baidu’s Paddle-Paddle on Kubernetes:
http://blog.kubernetes.io/2017/02/run-deep-learning-with-
paddlepaddle-on-kubernetes.html
 Kubernetes GPU Guide: https://github.com/Langhalsdino/Kubernetes-
GPU-Guide
 Tensorflow Quantization:
https://www.tensorflow.org/performance/quantization
16

Questions?
Contact
http://bit.ly/geeta4c
geeta@svsg.co
@geeta4c

Distributed deep learning optimizations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed deep learning optimizations

Similar to Distributed deep learning optimizations (20)

More from geetachauhan

More from geetachauhan (17)

Recently uploaded

Recently uploaded (20)

Distributed deep learning optimizations