Distributed Deep learning Training.

Umang Sharma
Deep Learning Data Scientist
Author
@umang_sha
Distributed Deep Learning
Trainings over clusters:
Parallelizing Everything

Talk Agenda
• Hardware Requirements for Deep Learning
• CPUs vs GPUs for Deep Learning
• The Challenges in using GPUs
• Scaling it up: Using a Cluster of GPUs
• How TensorFlow Does it?
• Parameter Sharing
• The Solution: Introducing Horovod
• How it works?
• Questions

Hardware Requirements for Deep Learning
• Neural Networks primarily require intense matrix multiplications.
• On top of it a huge amount of Training Data is required for Deep learning
• Huge Data means more memory is required for Computations
• With each layer of Neural network, the number of Parameters increase many
folds.

CPUs vs GPUs for Deep Learning
• Deep learning tasks require good amount of memory to read data, hence
memory bandwidth becomes an important factor
• Wait What’s memory bandwidth, though?
• The Memory bandwidth is the rate at which data can be read from or stored
into a semiconductor memory by a processor.
• The standalone GPU, on the other hand, comes with a dedicated VRAM
memory. Thus, CPU’s memory can be used for other tasks
• GPUs do very well than CPUs here, lets see!

Memory Bandwidths
Comparison: CPUs vs
GPUs over Time

Another Advantage of GPUs over CPUs
• GPUs consists of more cores than CPUs, hence are able to perform these
memory intensive calculations more faster and in optimised way.
• GPUs are able to parallelise these operations due to more number of little
cores present in them
• To paint a picture, imagine CPU is Ferrari and GPU is a huge truck, ferrari
though fast can only transport 2 people at a time. GPU being a truck can
transport a large number of people hence better for us.

CPUs vs GPUs Source: NVIDIA YouTube

The Challenge in using GPUs
• Utilsing GPUs is a tricky process, the reason being one needs to write low level
code to access the GPUs.
• CUDA is the NVIDIA’s library for Deep Learning
• It is not just your code specifically that matters, it is actually the entire code path
between your concept and the CUDA cores that are executing it on the GPU
• But, Worry not! Things have improved for Good.
• Deep learning frameworks such as TensorFlow, PyTorch take care of taking your
Python code by using a computation graph that translates into CUDA code to
GPUs.

Scaling it up: Using a Cluster of GPUs
• So far what we discussed applies to simple models.
• As we create more and more complex models, a single GPU isn’t useful one
needs multiple GPUs, namely a cluster of GPUs.
• Unfortunately, parallelising tasks in GPUs aren’t as simple as in CPUs.
• Fortunately TensorFlow provides a way to distribute training amongst the
GPUs its called tf.distributed( )

How TensorFlow Does it?
• There are 2 types of Deep Learning Training Parallelism possible, Data
Parallelism and Model Parallelism
• Most widely used in Data parallelism more suitable for Deep learning with
huge amount of data
• Data gets divided to multiple GPUs and each GPU runs its own copy of Model
training and training parameters are shared.
• This approach is called centralised approach

Parameter Sharing
Centralised Approach

Challenges in this approach
• But that too comes with its own Challenge 😕

Challenges in Centralised Approach
• It becomes a challenge to decide the accurate ratio of number of parameter to
workers.
• If multiple parameter servers are used, the communication pattern becomes
“all-to-all” which may saturate network interconnects.
• Another key challenge is as we try to implement this, the TF code becomes
more and more complex.
• The Data scientist needs to add more parameter and worker level codes.
• Lets look at the performance now

Scaling on tf.distributed( ), Source: Uber Engg Blog
Note: Ideal is computed by multiplying the single-GPU rate by the number of GPUs

The Solution?
• Presenting Uber’s Horovod

How horovod does it?
• Horovod uses a different algorithm from Baidu called ring-allreduce
• The algorithm works in totally different ways than centralised approach
• Its rather a de-centralised approach
• The approach works faster and uses less bandwidth than parameter sharing
• But how it works? Lets see.

De-Centralised Approach
Ring all-reduce algorithm, Source: Baidu

Why it works?
• The no need of parameter server leads to lesser communication overheads
• The algorithm is bandwidth-optimal, meaning that if the buffer is large enough,
it will optimally utilize the available network.
• The allreduce approach is much easier to understand and adopt.
• All the user needs to do is modify their program to average gradients using
an allreduce() operation.

Implementing Horovod in your code
• Implementation is pretty simple with horovod being packaged as a Python
package.
• Step 1
hvd.init() initializes Horovod.
• Step 2
config.gpu_options.visible_device_list = str(hvd.local_ran
k()) assigns a GPU to each of the TensorFlow processes.

• Step 3
opt=hvd.DistributedOptimiz
er(opt)wraps any regular
TensorFlow optimizer with Horovod
optimizer which takes care of
averaging gradients using ring-
allreduce.
• Step 4
hvd.BroadcastGlobalVariablesH
ook(0) broadcasts variables from the
first process to all other processes to
ensure consistent initialization. If the
program does not use

My Contact information
Feel free to contact me for any questions
• Twitter: @umang_sha
• LinkedIn: umangsharma-datascience

Distributed Deep learning Training.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Deep learning Training.

Similar to Distributed Deep learning Training. (20)

Recently uploaded

Recently uploaded (20)

Distributed Deep learning Training.