Umang Sharma
Deep Learning Data Scientist
Author
@umang_sha
Distributed Deep Learning
Trainings over clusters:
Parallelizing Everything
Talk Agenda
• Hardware Requirements for Deep Learning
• CPUs vs GPUs for Deep Learning
• The Challenges in using GPUs
• Scaling it up: Using a Cluster of GPUs
• How TensorFlow Does it?
• Parameter Sharing
• The Solution: Introducing Horovod
• How it works?
• Questions
Hardware Requirements for Deep Learning
• Neural Networks primarily require intense matrix multiplications.
• On top of it a huge amount of Training Data is required for Deep learning
• Huge Data means more memory is required for Computations
• With each layer of Neural network, the number of Parameters increase many
folds.
CPUs vs GPUs for Deep Learning
• Deep learning tasks require good amount of memory to read data, hence
memory bandwidth becomes an important factor
• Wait What’s memory bandwidth, though?
• The Memory bandwidth is the rate at which data can be read from or stored
into a semiconductor memory by a processor.
• The standalone GPU, on the other hand, comes with a dedicated VRAM
memory. Thus, CPU’s memory can be used for other tasks
• GPUs do very well than CPUs here, lets see!
Memory Bandwidths
Comparison: CPUs vs
GPUs over Time
Another Advantage of GPUs over CPUs
• GPUs consists of more cores than CPUs, hence are able to perform these
memory intensive calculations more faster and in optimised way.
• GPUs are able to parallelise these operations due to more number of little
cores present in them
• To paint a picture, imagine CPU is Ferrari and GPU is a huge truck, ferrari
though fast can only transport 2 people at a time. GPU being a truck can
transport a large number of people hence better for us.
CPUs vs GPUs Source: NVIDIA YouTube
The Challenge in using GPUs
• Utilsing GPUs is a tricky process, the reason being one needs to write low level
code to access the GPUs.
• CUDA is the NVIDIA’s library for Deep Learning
• It is not just your code specifically that matters, it is actually the entire code path
between your concept and the CUDA cores that are executing it on the GPU
• But, Worry not! Things have improved for Good.
• Deep learning frameworks such as TensorFlow, PyTorch take care of taking your
Python code by using a computation graph that translates into CUDA code to
GPUs.
Scaling it up: Using a Cluster of GPUs
• So far what we discussed applies to simple models.
• As we create more and more complex models, a single GPU isn’t useful one
needs multiple GPUs, namely a cluster of GPUs.
• Unfortunately, parallelising tasks in GPUs aren’t as simple as in CPUs.
• Fortunately TensorFlow provides a way to distribute training amongst the
GPUs its called tf.distributed( )
How TensorFlow Does it?
• There are 2 types of Deep Learning Training Parallelism possible, Data
Parallelism and Model Parallelism
• Most widely used in Data parallelism more suitable for Deep learning with
huge amount of data
• Data gets divided to multiple GPUs and each GPU runs its own copy of Model
training and training parameters are shared.
• This approach is called centralised approach
Parameter Sharing
Centralised Approach
Challenges in this approach
• But that too comes with its own Challenge 😕
Challenges in Centralised Approach
• It becomes a challenge to decide the accurate ratio of number of parameter to
workers.
• If multiple parameter servers are used, the communication pattern becomes
“all-to-all” which may saturate network interconnects.
• Another key challenge is as we try to implement this, the TF code becomes
more and more complex.
• The Data scientist needs to add more parameter and worker level codes.
• Lets look at the performance now
Scaling on tf.distributed( ), Source: Uber Engg Blog
Note: Ideal is computed by multiplying the single-GPU rate by the number of GPUs
The Solution?
• Presenting Uber’s Horovod
How horovod does it?
• Horovod uses a different algorithm from Baidu called ring-allreduce
• The algorithm works in totally different ways than centralised approach
• Its rather a de-centralised approach
• The approach works faster and uses less bandwidth than parameter sharing
• But how it works? Lets see.
De-Centralised Approach
Ring all-reduce algorithm, Source: Baidu
Why it works?
• The no need of parameter server leads to lesser communication overheads
• The algorithm is bandwidth-optimal, meaning that if the buffer is large enough,
it will optimally utilize the available network.
• The allreduce approach is much easier to understand and adopt.
• All the user needs to do is modify their program to average gradients using
an allreduce() operation.
Implementing Horovod in your code
• Implementation is pretty simple with horovod being packaged as a Python
package.
• Step 1
hvd.init() initializes Horovod.
• Step 2
config.gpu_options.visible_device_list = str(hvd.local_ran
k()) assigns a GPU to each of the TensorFlow processes.
• Step 3
opt=hvd.DistributedOptimiz
er(opt)wraps any regular
TensorFlow optimizer with Horovod
optimizer which takes care of
averaging gradients using ring-
allreduce.
• Step 4
hvd.BroadcastGlobalVariablesH
ook(0) broadcasts variables from the
first process to all other processes to
ensure consistent initialization. If the
program does not use
But Hey! Does all this works?
Questions?
My Contact information
Feel free to contact me for any questions
• Twitter: @umang_sha
• LinkedIn: umangsharma-datascience

Distributed Deep learning Training.

  • 1.
    Umang Sharma Deep LearningData Scientist Author @umang_sha Distributed Deep Learning Trainings over clusters: Parallelizing Everything
  • 2.
    Talk Agenda • HardwareRequirements for Deep Learning • CPUs vs GPUs for Deep Learning • The Challenges in using GPUs • Scaling it up: Using a Cluster of GPUs • How TensorFlow Does it? • Parameter Sharing • The Solution: Introducing Horovod • How it works? • Questions
  • 3.
    Hardware Requirements forDeep Learning • Neural Networks primarily require intense matrix multiplications. • On top of it a huge amount of Training Data is required for Deep learning • Huge Data means more memory is required for Computations • With each layer of Neural network, the number of Parameters increase many folds.
  • 4.
    CPUs vs GPUsfor Deep Learning • Deep learning tasks require good amount of memory to read data, hence memory bandwidth becomes an important factor • Wait What’s memory bandwidth, though? • The Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. • The standalone GPU, on the other hand, comes with a dedicated VRAM memory. Thus, CPU’s memory can be used for other tasks • GPUs do very well than CPUs here, lets see!
  • 5.
  • 6.
    Another Advantage ofGPUs over CPUs • GPUs consists of more cores than CPUs, hence are able to perform these memory intensive calculations more faster and in optimised way. • GPUs are able to parallelise these operations due to more number of little cores present in them • To paint a picture, imagine CPU is Ferrari and GPU is a huge truck, ferrari though fast can only transport 2 people at a time. GPU being a truck can transport a large number of people hence better for us.
  • 7.
    CPUs vs GPUsSource: NVIDIA YouTube
  • 8.
    The Challenge inusing GPUs • Utilsing GPUs is a tricky process, the reason being one needs to write low level code to access the GPUs. • CUDA is the NVIDIA’s library for Deep Learning • It is not just your code specifically that matters, it is actually the entire code path between your concept and the CUDA cores that are executing it on the GPU • But, Worry not! Things have improved for Good. • Deep learning frameworks such as TensorFlow, PyTorch take care of taking your Python code by using a computation graph that translates into CUDA code to GPUs.
  • 9.
    Scaling it up:Using a Cluster of GPUs • So far what we discussed applies to simple models. • As we create more and more complex models, a single GPU isn’t useful one needs multiple GPUs, namely a cluster of GPUs. • Unfortunately, parallelising tasks in GPUs aren’t as simple as in CPUs. • Fortunately TensorFlow provides a way to distribute training amongst the GPUs its called tf.distributed( )
  • 10.
    How TensorFlow Doesit? • There are 2 types of Deep Learning Training Parallelism possible, Data Parallelism and Model Parallelism • Most widely used in Data parallelism more suitable for Deep learning with huge amount of data • Data gets divided to multiple GPUs and each GPU runs its own copy of Model training and training parameters are shared. • This approach is called centralised approach
  • 11.
  • 12.
    Challenges in thisapproach • But that too comes with its own Challenge 😕
  • 13.
    Challenges in CentralisedApproach • It becomes a challenge to decide the accurate ratio of number of parameter to workers. • If multiple parameter servers are used, the communication pattern becomes “all-to-all” which may saturate network interconnects. • Another key challenge is as we try to implement this, the TF code becomes more and more complex. • The Data scientist needs to add more parameter and worker level codes. • Lets look at the performance now
  • 14.
    Scaling on tf.distributed(), Source: Uber Engg Blog Note: Ideal is computed by multiplying the single-GPU rate by the number of GPUs
  • 15.
  • 16.
    How horovod doesit? • Horovod uses a different algorithm from Baidu called ring-allreduce • The algorithm works in totally different ways than centralised approach • Its rather a de-centralised approach • The approach works faster and uses less bandwidth than parameter sharing • But how it works? Lets see.
  • 17.
  • 18.
    Why it works? •The no need of parameter server leads to lesser communication overheads • The algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network. • The allreduce approach is much easier to understand and adopt. • All the user needs to do is modify their program to average gradients using an allreduce() operation.
  • 19.
    Implementing Horovod inyour code • Implementation is pretty simple with horovod being packaged as a Python package. • Step 1 hvd.init() initializes Horovod. • Step 2 config.gpu_options.visible_device_list = str(hvd.local_ran k()) assigns a GPU to each of the TensorFlow processes.
  • 20.
    • Step 3 opt=hvd.DistributedOptimiz er(opt)wrapsany regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring- allreduce. • Step 4 hvd.BroadcastGlobalVariablesH ook(0) broadcasts variables from the first process to all other processes to ensure consistent initialization. If the program does not use
  • 21.
    But Hey! Doesall this works?
  • 22.
  • 23.
    My Contact information Feelfree to contact me for any questions • Twitter: @umang_sha • LinkedIn: umangsharma-datascience