SlideShare a Scribd company logo
1 of 23
Download to read offline
Scalable Deep Learning
on Distributed GPUs
Emiliano Molinaro, PhD
eScience Center & Institut for Matematik og Datalogi
Syddansk Universitet, Odense Denmark
Outline
2
•Introduction
•Deep Neural Networks:
- Model training
- Scalability on multiple GPUs
•Distributed architectures:
- Async Parameter Server
- Sync AllReduce
•Data pipeline
• Application: Variational Autoencoder for clustering of
single cell gene expression levels
• Summary
Introduction
3
Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral
parts of public/private research and industrial applications
MAIN REASONS:
- rise in computational power
- advances in data science
- IoT and big data high availability
APPLICATIONS:
- speech recognition
- image classification
- computer vision
- anomaly detection
- recommender systems
- …
Increase of data volume and model complexity requires computing power and memory,
such as High Performance Computing (HPC) resources
Efficient parallel and distributed algorithms/frameworks for scaling DL are crucial
to speed up the training process and handle big data processing and analysis
Deep Neural Network
4
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
INPUTLAYER
HIDDEN LAYERS
OUTPUTLAYER
Training loop
5
training data
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
training data
input pipeline
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
backward propagation: gradients
training data
input pipeline
lossfunction
Training loop
5
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
•
•
•
•
•
•
•
•
•
•
•
•
. . .
. . .
. . .
. . .
forward propagation: predictions
backward propagation: gradients
training data
input pipeline
lossfunction
update model
parameters
Training on a single device
6
CPU GPU
compute gradients and loss functionupdate model parameters
Scaling on multiple GPUs
7
CPU
GPU
GPU GPU
GPU
Scaling on multiple devices
8
Data parallelism
9
training data
DNN copied on each worker and trained with a subset of the input data
Data parallelism
9
training data
input pipeline
DNN copied on each worker and trained with a subset of the input data
Data parallelism
9
training data
•
• • •
.
.
.
.
•
• • •
.
.
.
.
•
• • •
.
.
.
.
worker 1
worker 2
worker 3
input pipeline
parameters sync: average gradients
DNN copied on each worker and trained with a subset of the input data
Model parallelism
10
DNN is divided across the different workers
•
• • •
.
.
.
.
•
• •
.
.
.
.
worker 1 worker 2 worker 3
training data
Data parallelism architectures
11
Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Data parallelism architectures
11
Async Parameter Server
PS1 PS2
worker 1 worker 2 worker 3
- parameter servers store the variables
- all workers are independent
- workers do the bulk of computation
- architecture easy to scale
- downside: workers can get out of sync, delay
convergence
Sync AllReduce
worker 1 worker 2
worker 3worker 4
- approach more common on systems with fast
accelerators
- no parameter servers, each worker has its own copy of
model parameters
- all workers are synchronized
- workers communicate among themselves to propagate
the gradients and update the model parameters
- AllReduce algorithms used to combine the gradients,
depending on the type of communication: Ring
AllReduce, NVIDIA’s NCCL, …
Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
Pipelining
12
The input pipeline is preprocessing the data and make it available for training on the
GPUs
However, GPUs process data and perform calculations much faster than a CPU
Avoid bottleneck of input data …
… build an efficient ETL data processing:
1. Extract phase: read data from a persistence storage
2. Transform phase: apply different transformations to the input data
(shuffle, repeat, map, batch, …)
3. Load phase: provide the processed data to the accelerator for training
parallelize the ETL phases
read multiple files in parallel
distribute data transformation operations on multiple CPU cores
prefetch data for the next step during backward propagation
Single cell RNA sequencing data
13
CD14+ Monocytes
Double negative T cells
CD14+ Monocytes__
Double negative T cells__
Mature B cell
CD8 Effector
NK cells
Plasma cell
CD8 Effector__
FCGR3A+ Monocytes
CD8 Naive
Megakaryocytes
Immature B cell
CD14+ Monocytes______
Dendritic cells
CD8 Effector____
pDC
Dendritic cells__
Variational Autoencoder (VAE)
research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU
~10k peripheral blood mononuclear cells
Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Scaling ofVAE training
14
simulation done on GPU nodes of Puhti
supercomputer, CSC, Finland
Node specs:
2 x 20 cores @ 2,1 GHz, 384 GiB
4 GPUs connected with NVLink, 4x32 GB
GPU specs:
Xeon Gold 6230 Nvidia V100
Collective ops:
- TensorFlow ring-based gRPC
- NVIDIA’s NCCL
scaling performance depends on the device topology and the AllReduce algorithm
Summary
15
❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and
make DL suitable for big data processing and analysis
❖ Implementation of distributed architecture/parallel programming depends on
the model complexity and the device topology
❖ Construction of efficient input data pipelines improves training performance
❖ Further applications:
- development of algorithms for parameter optimization
(hyper-parameter search, architecture search)
- integration of distributed training with big data frameworks
(Spark, Hadoop, etc.)
- integration with cloud computing services (SDUCloud)

More Related Content

What's hot

INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGGS Kosta
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputersPankaj Kumar Jain
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semesterRafi Ullah
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingShitalkumar Sukhdeve
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classificationsmatsus
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programmingRaul Goycoolea Seoane
 
Application of Parallel Processing
Application of Parallel ProcessingApplication of Parallel Processing
Application of Parallel Processingare you
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecturedileesh E D
 

What's hot (20)

INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSING
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Lecture 04 Chapter 1 - Introduction to Parallel Computing
Lecture 04  Chapter 1 - Introduction to Parallel ComputingLecture 04  Chapter 1 - Introduction to Parallel Computing
Lecture 04 Chapter 1 - Introduction to Parallel Computing
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Chapter 2 pc
Chapter 2 pcChapter 2 pc
Chapter 2 pc
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Par com
Par comPar com
Par com
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programming
 
Application of Parallel Processing
Application of Parallel ProcessingApplication of Parallel Processing
Application of Parallel Processing
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecture
 
Lecture 04 chapter 2 - Parallel Programming Platforms
Lecture 04  chapter 2 - Parallel Programming PlatformsLecture 04  chapter 2 - Parallel Programming Platforms
Lecture 04 chapter 2 - Parallel Programming Platforms
 

Similar to Scalable Deep Learning on Distributed GPUs

Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
Unit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxUnit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxVanshJain322212
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Mahdi Hosseini Moghaddam
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 

Similar to Scalable Deep Learning on Distributed GPUs (20)

Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Unit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptxUnit-1_Digital Computers, number systemCOA[1].pptx
Unit-1_Digital Computers, number systemCOA[1].pptx
 
Coa presentation4
Coa presentation4Coa presentation4
Coa presentation4
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
DSP Processor
DSP Processor DSP Processor
DSP Processor
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Scalable Deep Learning on Distributed GPUs

  • 1. Scalable Deep Learning on Distributed GPUs Emiliano Molinaro, PhD eScience Center & Institut for Matematik og Datalogi Syddansk Universitet, Odense Denmark
  • 2. Outline 2 •Introduction •Deep Neural Networks: - Model training - Scalability on multiple GPUs •Distributed architectures: - Async Parameter Server - Sync AllReduce •Data pipeline • Application: Variational Autoencoder for clustering of single cell gene expression levels • Summary
  • 3. Introduction 3 Deep Neural Networks (DNNs) and Deep Learning (DL) are more and more integral parts of public/private research and industrial applications MAIN REASONS: - rise in computational power - advances in data science - IoT and big data high availability APPLICATIONS: - speech recognition - image classification - computer vision - anomaly detection - recommender systems - … Increase of data volume and model complexity requires computing power and memory, such as High Performance Computing (HPC) resources Efficient parallel and distributed algorithms/frameworks for scaling DL are crucial to speed up the training process and handle big data processing and analysis
  • 4. Deep Neural Network 4 • • • • • • • • • • • • . . . . . . . . . . . . INPUTLAYER HIDDEN LAYERS OUTPUTLAYER
  • 6. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions training data input pipeline
  • 7. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions backward propagation: gradients training data input pipeline lossfunction
  • 8. Training loop 5 • • • • • • • • • • • • . . . . . . . . . . . . • • • • • • • • • • • • . . . . . . . . . . . . forward propagation: predictions backward propagation: gradients training data input pipeline lossfunction update model parameters
  • 9. Training on a single device 6 CPU GPU compute gradients and loss functionupdate model parameters
  • 10. Scaling on multiple GPUs 7 CPU GPU GPU GPU GPU
  • 11. Scaling on multiple devices 8
  • 12. Data parallelism 9 training data DNN copied on each worker and trained with a subset of the input data
  • 13. Data parallelism 9 training data input pipeline DNN copied on each worker and trained with a subset of the input data
  • 14. Data parallelism 9 training data • • • • . . . . • • • • . . . . • • • • . . . . worker 1 worker 2 worker 3 input pipeline parameters sync: average gradients DNN copied on each worker and trained with a subset of the input data
  • 15. Model parallelism 10 DNN is divided across the different workers • • • • . . . . • • • . . . . worker 1 worker 2 worker 3 training data
  • 17. Data parallelism architectures 11 Async Parameter Server PS1 PS2 worker 1 worker 2 worker 3 - parameter servers store the variables - all workers are independent - workers do the bulk of computation - architecture easy to scale - downside: workers can get out of sync, delay convergence
  • 18. Data parallelism architectures 11 Async Parameter Server PS1 PS2 worker 1 worker 2 worker 3 - parameter servers store the variables - all workers are independent - workers do the bulk of computation - architecture easy to scale - downside: workers can get out of sync, delay convergence Sync AllReduce worker 1 worker 2 worker 3worker 4 - approach more common on systems with fast accelerators - no parameter servers, each worker has its own copy of model parameters - all workers are synchronized - workers communicate among themselves to propagate the gradients and update the model parameters - AllReduce algorithms used to combine the gradients, depending on the type of communication: Ring AllReduce, NVIDIA’s NCCL, …
  • 19. Pipelining 12 The input pipeline is preprocessing the data and make it available for training on the GPUs However, GPUs process data and perform calculations much faster than a CPU Avoid bottleneck of input data … … build an efficient ETL data processing: 1. Extract phase: read data from a persistence storage 2. Transform phase: apply different transformations to the input data (shuffle, repeat, map, batch, …) 3. Load phase: provide the processed data to the accelerator for training
  • 20. Pipelining 12 The input pipeline is preprocessing the data and make it available for training on the GPUs However, GPUs process data and perform calculations much faster than a CPU Avoid bottleneck of input data … … build an efficient ETL data processing: 1. Extract phase: read data from a persistence storage 2. Transform phase: apply different transformations to the input data (shuffle, repeat, map, batch, …) 3. Load phase: provide the processed data to the accelerator for training parallelize the ETL phases read multiple files in parallel distribute data transformation operations on multiple CPU cores prefetch data for the next step during backward propagation
  • 21. Single cell RNA sequencing data 13 CD14+ Monocytes Double negative T cells CD14+ Monocytes__ Double negative T cells__ Mature B cell CD8 Effector NK cells Plasma cell CD8 Effector__ FCGR3A+ Monocytes CD8 Naive Megakaryocytes Immature B cell CD14+ Monocytes______ Dendritic cells CD8 Effector____ pDC Dendritic cells__ Variational Autoencoder (VAE) research done in collaboration with Institut for Biokemi og Molekylær Biologi, SDU ~10k peripheral blood mononuclear cells Dataset from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
  • 22. Scaling ofVAE training 14 simulation done on GPU nodes of Puhti supercomputer, CSC, Finland Node specs: 2 x 20 cores @ 2,1 GHz, 384 GiB 4 GPUs connected with NVLink, 4x32 GB GPU specs: Xeon Gold 6230 Nvidia V100 Collective ops: - TensorFlow ring-based gRPC - NVIDIA’s NCCL scaling performance depends on the device topology and the AllReduce algorithm
  • 23. Summary 15 ❖ Scaling DNNs on multiple GPUs is crucial to speed up model training and make DL suitable for big data processing and analysis ❖ Implementation of distributed architecture/parallel programming depends on the model complexity and the device topology ❖ Construction of efficient input data pipelines improves training performance ❖ Further applications: - development of algorithms for parameter optimization (hyper-parameter search, architecture search) - integration of distributed training with big data frameworks (Spark, Hadoop, etc.) - integration with cloud computing services (SDUCloud)