I discuss optimization algorithms for data pipelining and scalable (multi-node) deep learning training using GPUs. As an application, I present a clustering analysis of single cell expression data using a variational autoencoder. This model allows to learn a latent representation of gene expression levels in single cells and capture most of the variability in different cell populations.
Talk presented at DeiC Conference 2019, Frederecia, Denmark.
1. Scalable Deep Learning
on Distributed GPUs
Emiliano Molinaro, PhD
eScience Center & Institut for Matematik og Datalogi
Syddansk Universitet, Odense Denmark
2. Outline
2
•Introduction
•Deep Neural Networks:
- Model training
- Scalability on multiple GPUs
•Distributed architectures:
- Async Parameter Server
- Sync AllReduce
•Data pipeline
• Application: Variational Autoencoder for clustering of
single cell gene expression levels
• Summary
3. Introduction
3
Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral
parts of public/private research and industrial applications
MAIN REASONS:
- rise in computational power
- advances in data science
- IoT and big data high availability
APPLICATIONS:
- speech recognition
- image classification
- computer vision
- anomaly detection
- recommender systems
- …
Increase of data volume and model complexity requires computing power and memory,
such as High Performance Computing (HPC) resources
Efficient parallel and distributed algorithms/frameworks for scaling DL are crucial
to speed up the training process and handle big data processing and analysis
14. Data parallelism
9
training data
•
• • •
.
.
.
.
•
• • •
.
.
.
.
•
• • •
.
.
.
.
worker 1
worker 2
worker 3
input pipeline
parameters sync: average gradients
DNN copied on each worker and trained with a subset of the input data
15. Model parallelism
10
DNN is divided across the different workers
•
• • •
.
.
.
.
•
• •
.
.
.
.
worker 1 worker 2 worker 3
training data
17. Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
18. Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Sync AllReduce
worker 1 worker 2
worker 3worker 4
- approach more common on systems with fast
accelerators
- no parameter servers, each worker has its own copy of
model parameters
- all workers are synchronized
- workers communicate among themselves to propagate
the gradients and update the model parameters
- AllReduce algorithms used to combine the gradients,
depending on the type of communication: Ring
AllReduce, NVIDIA’s NCCL, …
19. Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
20. Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
parallelize the ETL phases
read multiple files in parallel
distribute data transformation operations on multiple CPU cores
prefetch data for the next step during backward propagation
21. Single cell RNA sequencing data
13
CD14+ Monocytes
Double negative T cells
CD14+ Monocytes__
Double negative T cells__
Mature B cell
CD8 Effector
NK cells
Plasma cell
CD8 Effector__
FCGR3A+ Monocytes
CD8 Naive
Megakaryocytes
Immature B cell
CD14+ Monocytes______
Dendritic cells
CD8 Effector____
pDC
Dendritic cells__
Variational Autoencoder (VAE)
research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU
~10k peripheral blood mononuclear cells
Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
22. Scaling ofVAE training
14
simulation done on GPU nodes of Puhti
supercomputer, CSC, Finland
Node specs:
2 x 20 cores @ 2,1 GHz, 384 GiB
4 GPUs connected with NVLink, 4x32 GB
GPU specs:
Xeon Gold 6230 Nvidia V100
Collective ops:
- TensorFlow ring-based gRPC
- NVIDIA’s NCCL
scaling performance depends on the device topology and the AllReduce algorithm
23. Summary
15
❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and
make DL suitable for big data processing and analysis
❖ Implementation of distributed architecture/parallel programming depends on
the model complexity and the device topology
❖ Construction of efficient input data pipelines improves training performance
❖ Further applications:
- development of algorithms for parameter optimization
(hyper-parameter search, architecture search)
- integration of distributed training with big data frameworks
(Spark, Hadoop, etc.)
- integration with cloud computing services (SDUCloud)