Scalable Deep Learning on Distributed GPUs

Scalable Deep Learning
on Distributed GPUs
Emiliano Molinaro, PhD
eScience Center & Institut for Matematik og Datalogi
Syddansk Universitet, Odense Denmark

Outline
2
•Introduction
•Deep Neural Networks:
- Model training
- Scalability on multiple GPUs
•Distributed architectures:
- Async Parameter Server
- Sync AllReduce
•Data pipeline
• Application: Variational Autoencoder for clustering of
single cell gene expression levels
• Summary

Introduction
3
Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral
parts of public/private research and industrial applications
MAIN REASONS:
- rise in computational power
- advances in data science
- IoT and big data high availability
APPLICATIONS:
- speech recognition
- image classiﬁcation
- computer vision
- anomaly detection
- recommender systems
- …
Increase of data volume and model complexity requires computing power and memory,
such as High Performance Computing (HPC) resources
Efﬁcient parallel and distributed algorithms/frameworks for scaling DL are crucial
to speed up the training process and handle big data processing and analysis

Deep Neural Network
4
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
INPUTLAYER
HIDDEN LAYERS
OUTPUTLAYER

Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
training data
input pipeline

Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
backward propagation: gradients
training data
input pipeline
lossfunction

Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
backward propagation: gradients
training data
input pipeline
lossfunction
update model
parameters

Training on a single device
6
CPU GPU
compute gradients and loss functionupdate model parameters

Scaling on multiple GPUs
7
CPU
GPU
GPU GPU
GPU

Data parallelism
9
training data
DNN copied on each worker and trained with a subset of the input data

Data parallelism
9
training data
input pipeline

Data parallelism
9
training data
•
• • •
.
.
.
.
•
• • •
.
.
.
.
•
• • •
.
.
.
.
worker 1
worker 2
worker 3
input pipeline
parameters sync: average gradients

Model parallelism
10
DNN is divided across the different workers
•
• • •
.
.
.
.
•
• •
.
.
.
.
worker 1 worker 2 worker 3
training data

Data parallelism architectures
11

11
Async Parameter Server
PS1 PS2
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence

11
Async Parameter Server
PS1 PS2
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Sync AllReduce
worker 1 worker 2
worker 3worker 4
- approach more common on systems with fast
accelerators
- no parameter servers, each worker has its own copy of
model parameters
- all workers are synchronized
- workers communicate among themselves to propagate
the gradients and update the model parameters
- AllReduce algorithms used to combine the gradients,
depending on the type of communication: Ring
AllReduce, NVIDIA’s NCCL, …

Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efﬁcient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shufﬂe, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training

Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
parallelize the ETL phases
read multiple files in parallel
distribute data transformation operations on multiple CPU cores
prefetch data for the next step during backward propagation

Single cell RNA sequencing data
13
CD14+ Monocytes
Double negative T cells
CD14+ Monocytes__
Double negative T cells__
Mature B cell
CD8 Effector
NK cells
Plasma cell
CD8 Effector__
FCGR3A+ Monocytes
CD8 Naive
Megakaryocytes
Immature B cell
CD14+ Monocytes______
Dendritic cells
CD8 Effector____
pDC
Dendritic cells__
Variational Autoencoder (VAE)
research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU
~10k peripheral blood mononuclear cells
Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3

Scaling ofVAE training
14
simulation done on GPU nodes of Puhti
supercomputer, CSC, Finland
Node specs:
2 x 20 cores @ 2,1 GHz, 384 GiB
4 GPUs connected with NVLink, 4x32 GB
GPU specs:
Xeon Gold 6230 Nvidia V100
Collective ops:
- TensorFlow ring-based gRPC
- NVIDIA’s NCCL
scaling performance depends on the device topology and the AllReduce algorithm

Summary
15
❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and
make DL suitable for big data processing and analysis
❖ Implementation of distributed architecture/parallel programming depends on
the model complexity and the device topology
❖ Construction of efﬁcient input data pipelines improves training performance
❖ Further applications:
- development of algorithms for parameter optimization
(hyper-parameter search, architecture search)
- integration of distributed training with big data frameworks
(Spark, Hadoop, etc.)
- integration with cloud computing services (SDUCloud)

Scalable Deep Learning on Distributed GPUs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable Deep Learning on Distributed GPUs

Similar to Scalable Deep Learning on Distributed GPUs (20)

Recently uploaded

Recently uploaded (20)

Scalable Deep Learning on Distributed GPUs