«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

•Download as PPTX, PDF•

1 like•212 views

Provectus

Dmitry Spodarets, Сloud Сomputing (AWS) Lead at Provectus and CEO в FlyElephant.

Technology

Training Deep Learning Models
on Multi-GPUs Systems
Dmitry Spodarets
Odessa / Machine Learning Meetup / 27.02.2019

Distributed training
Data parallel vs model parallel
Faster or larger models?
Distributed TensorFlow training
https://www.youtube.com/watch?v=bRMGoPqsn20

Distributed training
Asynchronous vs Synchronous
Fast or precise?
Keras - multi-GPU training is not automatic :(
https://keras.io/utils/#multi_gpu_model

Distributed training framework for
TensorFlow, Keras, PyTorch, and MXNet

Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers

Horovod Example - Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(1.0)
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
# Broadcast initial variable states from rank 0 to all other processes.
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))

Running Horovod
To run on a machine with 4 GPUs:
$ mpirun -np 4
-H localhost:4
-bind-to none -map-by slot
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH
-mca pml ob1 -mca btl ^openib
python train.py
To run on 4 machines with 4 GPUs each:
$ mpirun -np 16
-H server1:4,server2:4,server3:4,server4:4
-bind-to none -map-by slot
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH
-mca pml ob1 -mca btl ^openib
python train.py

Running Horovod
https://eng.uber.com/horovod/

Which GPU to Get for Deep Learning
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/

Dmitry Spodarets
d.spodarets@flyelephant.net
https://flyelephant.net/gpu-dedicated-servers

What's hot

Opportuni1012017 programming model_tied_to_hardware_processorsVenkat Ravi Shanker K

Python и программирование GPU (Ивашкевич Глеб)IT-Доминанта

Modern Interface to Mainframe - The Compuware Workbench (B. Ebner)NRB

UniPlex T1 Storage SuperchargerUniFabric

Cassandra from tarball to productionRon Kuris

Kauli SSPにおけるVyOSの導入事例Kazuhito Ohkawa

1101: GRID 技術セッション 2：vGPU SizingNVIDIA Japan

Gpu accelerated BERT deployment on awsJohn Varghese

UniFabricUniFabric

Citrix TechXperts Perth May 2016Jeremy Saunders

Overclocking & EconomyAsad Salihi

Ceph Day Beijing - Ceph RDMA UpdateDanielle Womboldt

MySQL新技术研究与实践orczhou

PostgreSQL with OpenCLMuhaza Liebenlito

UniPlex Desktop Memory & PCIe ExpansionUniFabric

Optcarrot: A Pure-Ruby NES Emulatormametter

IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory EasyIn-Memory Computing Summit

Juju 基礎編VirtualTech Japan Inc.

What's hot (18)

Opportuni1012017 programming model_tied_to_hardware_processors

Python и программирование GPU (Ивашкевич Глеб)

Modern Interface to Mainframe - The Compuware Workbench (B. Ebner)

UniPlex T1 Storage Supercharger

Cassandra from tarball to production

Kauli SSPにおけるVyOSの導入事例

1101: GRID 技術セッション 2：vGPU Sizing

Gpu accelerated BERT deployment on aws

UniFabric

Citrix TechXperts Perth May 2016

Overclocking & Economy

Ceph Day Beijing - Ceph RDMA Update

MySQL新技术研究与实践

PostgreSQL with OpenCL

UniPlex Desktop Memory & PCIe Expansion

Optcarrot: A Pure-Ruby NES Emulator

IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy

Juju 基礎編

Similar to «Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros

GPU and Deep learning best practicesLior Sidi

High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...Chris Fregly

Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit

High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...Chris Fregly

Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks

Odsc workshop - Distributed Tensorflow on HopsJim Dowling

Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao

Distributed deep learning optimizations for Financegeetachauhan

Distributed deep learning optimizations - AI WithTheBestgeetachauhan

What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018Sri Ambati

Deep Learning with Spark and GPUsDataWorks Summit

Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Chris Fregly

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly

GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...Kohei KaiGai

Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han

Similar to «Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets. (20)

How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs

GPU and Deep learning best practices

High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...

Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...

High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...

Deep Learning with Apache Spark and GPUs with Pierce Spitler

Odsc workshop - Distributed Tensorflow on Hops

Using Deep Learning Toolkits with Kubernetes clusters

Distributed deep learning optimizations for Finance

Distributed deep learning optimizations - AI WithTheBest

What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018

Deep Learning with Spark and GPUs

Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...

End-to-End Deep Learning with Horovod on Apache Spark

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...

GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...

Profiling deep learning network using NVIDIA nsight systems

Recently uploaded

Artificial intelligence in the post-deep learning eraDeakin University

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

How to convert PDF to text with Nanonetsnaman860154

CloudStudio User manual (basic edition):comworks

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Slack Application Development 101 Slidespraypatel2

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

AI as an Interface for Commercial BuildingsMemoori

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Key Features Of Token Development (1).pptxLBM Solutions

Recently uploaded (20)

Artificial intelligence in the post-deep learning era

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Maximizing Board Effectiveness 2024 Webinar.pptx

SQL Database Design For Developers at php[tek] 2024

How to convert PDF to text with Nanonets

CloudStudio User manual (basic edition):

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Breaking the Kubernetes Kill Chain: Host Path Mount

Slack Application Development 101 Slides

The transition to renewables in India.pdf

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

AI as an Interface for Commercial Buildings

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Advanced Test Driven-Development @ php[tek] 2024

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

My Hashitalk Indonesia April 2024 Presentation

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Key Features Of Token Development (1).pptx

«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

1. Training Deep Learning Models on Multi-GPUs Systems Dmitry Spodarets Odessa / Machine Learning Meetup / 27.02.2019

3. Multi-GPUs Systems

4. Distributed training Data parallel vs model parallel Faster or larger models? Distributed TensorFlow training https://www.youtube.com/watch?v=bRMGoPqsn20

5. Distributed training Asynchronous vs Synchronous Fast or precise? Keras - multi-GPU training is not automatic :( https://keras.io/utils/#multi_gpu_model

6. Bottlenecks RAM / CPU I/O Connections

7. NVLink (200 GB/sec)

8. NVSwitches (300 GB/sec)

9. Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet

10. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers

11. Horovod Example - Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta(1.0) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) # Broadcast initial variable states from rank 0 to all other processes. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)] model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))

12. Running Horovod To run on a machine with 4 GPUs: $ mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py To run on 4 machines with 4 GPUs each: $ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py

13. Running Horovod https://eng.uber.com/horovod/

14. Which GPU to Get for Deep Learning http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/

15. Dmitry Spodarets d.spodarets@flyelephant.net https://flyelephant.net/gpu-dedicated-servers

«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to «Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.

Similar to «Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets. (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.