2. Introduction & Motivation
โข Motivation
โข State-of-the-art DL models get bigger, as well as the datasets
on which they are trained on
โข GPT-3 model (SoA text processing / NLP model)
โข model size is 175 billion parameters
โข trained on 500 billion tokens
โข So training time gets up and up โฆ
โข Doing a single training is usally not sufficient
โข Hyperparameter tuning is usually done
โข Adapt learning rate, momentum, ..
โข Might want to experiment with the network architecture
โข Network architecture search
2
3. Definition
โข In distributed training the workload to train a model is split up
and shared among multiple processors (workers / nodes)
โข Can be a โclusterโ of few workers or up to several hundreds
โข Usually, each worker is equipped with 2-8 GPUs
โข Optimal case is linear scaling
โข Training time is inversely proportionally to number of workers
โข Usually not achieved due to adverse affects for larger clusters
โข Serial parts (which can not be paralllized) get more prominent
โข Communication cost (between workers) may rise dis-proportionally
โข Distributed training allows training โfrom scratchโ on a huge dataset in minutes
โข E.g. Image classification model can be trained in 1.5 minutes
on ImageNet dataset, employing 512 GPUs
3
4. Data parallelism versus Model parallelism
โข Data parallelism
โข Training data is split into chunks
โข Each worker processes a chunk
and updates model
โข Advantages
โข Can be applied to any model
โข Disadvantages
โข Each worker must have enough (GPU)
memory to hold the whole model
โข Updated model must be communicated
regularly to all workers
4
5. Data parallelism versus Model parallelism
โข Model parallelism
โข Model is split into several parts
โข Each worker processes
its respective model part
โข Advantages
โข Support for large models which
do not fit in GPU memory (e.g. NLP models)
โข Disadvantages
โข One has to find an efficient split of the model,
depends on model structure and number of workers
5
6. System architecture
โข System architecture describes how the model parameter
updates of the different workers are performed
โข Centralized system architecture
โข Workers periodically report their model
updates to one (or more) parameter servers
โข Decentralized system architecture
โข Workers exchange the model updates
directly via an allreduce operation
โข Topology of the allreduce operation is critical
โข Fully connected => Communication cost O(n^2) !
โข Usually using high-performance topologies like
ring, tree, butterfly etc.
6
7. Synchronization strategies
โข Different strategies to synchronize the model parameters between all workers
โข Synchronous
โข Sync of model parameters is done after each iteration (mini-batch)
โข Prone to straggler problem (slowest worker delays all workers)
โข Bounded asynchronous
โข Workers may train on model parameters which are โa few iterationsโ old
โข Asynchronous (e.g. Hogwild algorithm)
โข Workers update their model completely independent from others
โข Difficult to reason about model convergence
โข Lost update problem: new parameters written by
worker A could be overwritten by worker B
7
8. Distributed training frameworks
โข Main DL frameworks (PyTorch, TensorFlow, MXNet)
โข Provide mainly support for a single node (but using multiple GPUs)
โข Horovod (Uber)
โข PyTorch, TensorFlow, Keras, MXNet
โข Data parallelism and limited model parallelism
โข Fairscale (Facebook)
โข PyTorch
โข Data parallelism and limited model/pipeline parallelism
โข Deepspeed (Microsoft)
โข PyTorch
โข Data parallelism and model/pipeline parallelism
โข Gradient compression (1-bit Adam/LAMB), โฆ
8
GPT-3 info => https://lambdalabs.com/blog/demystifying-gpt-3/ und https://scilogs.spektrum.de/hlf/an-ai-walks-into-a-bar-and-it-writes-an-awesome-story/
Alexnet in 1.5 minuten โ siehe https://arxiv.org/pdf/1902.06855.pdf
Info und figures aus https://arxiv.org/pdf/1903.11314.pdf
Info und figures aus https://arxiv.org/pdf/1903.11314.pdf