1. Umang Sharma
Deep Learning Data Scientist
Author
@umang_sha
Distributed Deep Learning
Trainings over clusters:
Parallelizing Everything
2. Talk Agenda
• Hardware Requirements for Deep Learning
• CPUs vs GPUs for Deep Learning
• The Challenges in using GPUs
• Scaling it up: Using a Cluster of GPUs
• How TensorFlow Does it?
• Parameter Sharing
• The Solution: Introducing Horovod
• How it works?
• Questions
3. Hardware Requirements for Deep Learning
• Neural Networks primarily require intense matrix multiplications.
• On top of it a huge amount of Training Data is required for Deep learning
• Huge Data means more memory is required for Computations
• With each layer of Neural network, the number of Parameters increase many
folds.
4. CPUs vs GPUs for Deep Learning
• Deep learning tasks require good amount of memory to read data, hence
memory bandwidth becomes an important factor
• Wait What’s memory bandwidth, though?
• The Memory bandwidth is the rate at which data can be read from or stored
into a semiconductor memory by a processor.
• The standalone GPU, on the other hand, comes with a dedicated VRAM
memory. Thus, CPU’s memory can be used for other tasks
• GPUs do very well than CPUs here, lets see!
6. Another Advantage of GPUs over CPUs
• GPUs consists of more cores than CPUs, hence are able to perform these
memory intensive calculations more faster and in optimised way.
• GPUs are able to parallelise these operations due to more number of little
cores present in them
• To paint a picture, imagine CPU is Ferrari and GPU is a huge truck, ferrari
though fast can only transport 2 people at a time. GPU being a truck can
transport a large number of people hence better for us.
8. The Challenge in using GPUs
• Utilsing GPUs is a tricky process, the reason being one needs to write low level
code to access the GPUs.
• CUDA is the NVIDIA’s library for Deep Learning
• It is not just your code specifically that matters, it is actually the entire code path
between your concept and the CUDA cores that are executing it on the GPU
• But, Worry not! Things have improved for Good.
• Deep learning frameworks such as TensorFlow, PyTorch take care of taking your
Python code by using a computation graph that translates into CUDA code to
GPUs.
9. Scaling it up: Using a Cluster of GPUs
• So far what we discussed applies to simple models.
• As we create more and more complex models, a single GPU isn’t useful one
needs multiple GPUs, namely a cluster of GPUs.
• Unfortunately, parallelising tasks in GPUs aren’t as simple as in CPUs.
• Fortunately TensorFlow provides a way to distribute training amongst the
GPUs its called tf.distributed( )
10. How TensorFlow Does it?
• There are 2 types of Deep Learning Training Parallelism possible, Data
Parallelism and Model Parallelism
• Most widely used in Data parallelism more suitable for Deep learning with
huge amount of data
• Data gets divided to multiple GPUs and each GPU runs its own copy of Model
training and training parameters are shared.
• This approach is called centralised approach
12. Challenges in this approach
• But that too comes with its own Challenge 😕
13. Challenges in Centralised Approach
• It becomes a challenge to decide the accurate ratio of number of parameter to
workers.
• If multiple parameter servers are used, the communication pattern becomes
“all-to-all” which may saturate network interconnects.
• Another key challenge is as we try to implement this, the TF code becomes
more and more complex.
• The Data scientist needs to add more parameter and worker level codes.
• Lets look at the performance now
14. Scaling on tf.distributed( ), Source: Uber Engg Blog
Note: Ideal is computed by multiplying the single-GPU rate by the number of GPUs
16. How horovod does it?
• Horovod uses a different algorithm from Baidu called ring-allreduce
• The algorithm works in totally different ways than centralised approach
• Its rather a de-centralised approach
• The approach works faster and uses less bandwidth than parameter sharing
• But how it works? Lets see.
18. Why it works?
• The no need of parameter server leads to lesser communication overheads
• The algorithm is bandwidth-optimal, meaning that if the buffer is large enough,
it will optimally utilize the available network.
• The allreduce approach is much easier to understand and adopt.
• All the user needs to do is modify their program to average gradients using
an allreduce() operation.
19. Implementing Horovod in your code
• Implementation is pretty simple with horovod being packaged as a Python
package.
• Step 1
hvd.init() initializes Horovod.
• Step 2
config.gpu_options.visible_device_list = str(hvd.local_ran
k()) assigns a GPU to each of the TensorFlow processes.
20. • Step 3
opt=hvd.DistributedOptimiz
er(opt)wraps any regular
TensorFlow optimizer with Horovod
optimizer which takes care of
averaging gradients using ring-
allreduce.
• Step 4
hvd.BroadcastGlobalVariablesH
ook(0) broadcasts variables from the
first process to all other processes to
ensure consistent initialization. If the
program does not use