PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

•Download as PPTX, PDF•

2 likes•576 views

- Title: Horovod: fast and easy distributed deep learning in TensorFlow - Paper: https://arxiv.org/abs/1802.05799 - Youtube: https://youtu.be/8zQECRiONAo Taekmin Kim, http://github.com/tantara

Technology

Horovod: fast and easy distributed
deep learning in Tensorflow
PR-129
Taekmin Kim
Dec 23, 2018

On single machine
● Training ResNet-50 on TPU
○ Batch Size: 1024
○ Accuracy: 76%(Top-1)
○ Training Time: 17hours
● Training Faster RCNN on 8GPUs
○ Batch Size: 8~16
■ 1~2 per GPU

Why large-scale training?
● Better accuracy
○ E.g. Object detection
○ Group Normalization(ECCV 2018)
● Fast training
○ ResNet-50
■ 6.6 minutes, 75.8%(Top-1)
■ 64k per mini-batch, 2048 GPUs
Group Normalization

Distributed Tensorflow: How to use
https://www.tensorflow.org/deploy/distributed

https://www.slideshare.net/databricks/horovod-ubers-open-source-distributed-deep-learning-framework-for-tensorflow

Distributed Tensorflow
● Parameter-Worker Architecture
● Issues
○ need to decide # parameter servers, workers
○ difficult to edit configurations
■ Codes, ...

Issues
● Communication Cost
○ Average gradients
○ Update weights
● Others
○ GPU/CPU
○ Network
○ Storage

ring-allreduce
http://research.baidu.com/bringing-hpc-techniques-deep-learning
Add Add
2*(N-1) iterations
● N-1: Add
● N-1: Send & Receive
Send &
Receive
Send &
Receive

Horovod
● Stand-alone package
○ pip install horovod
● ring-allreduce
○ NCCL
● Horovod Timeline
○ Chrome extension
● Tensor Fusion
● MPI

Tensor Fusion
● 65% improvement in performance
● Algorithm:

Results: with RDMA networking
● Support RDMA(Remote Direct Memory Access) networking
○ e.g. InfiniBand

Current Horovod
● Support
○ Tensorflow
■ Estimator API
○ Keras
○ PyTorch
https://github.com/uber/horovod

Summary
● Communication
○ ring-allreduce
● Libraries
○ NCCL, MPI
● Benefits
○ Large-scale training(e.g. AutoML)
● Future work
○ PyTorch 1.0
■ torch.distributed
○ Tensorflow 2.0

Related Work
● Training
○ Communication Cost
■ Deep Gradient Compression
■ Training ImageNet in Four Minutes
● using mixed precision
○ RNN, RL
■ Dynamic Control Flow
● Inference
○ Low Latency RNN Inference with Cellular Batching

What's hot

running Tensorflow in ProductionMatthias Feys

Deep parkingShintaro Shiba

GTC Japan 2016 Chainer feature introductionKenta Oono

Using Multi GPU in PyTorchJun Young Park

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf

Distributed Prioritized Experience Replay(Ape-X)Younggyo Seo

MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...Masashi Shibata

ARM and Machine Learninginside-BigData.com

Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2

Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf

OpenMpNeel Bhad

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf

OpenmpAmirali Sharifian

Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala

OpenMPmohammadradpour

Introduction to OpenMPAkhila Prabhakaran

On the benchmark of ChainerKenta Oono

Open mpGopi Saiteja

ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media

Introduction to OpenMPAkhila Prabhakaran

What's hot (20)

running Tensorflow in Production

Deep parking

GTC Japan 2016 Chainer feature introduction

Using Multi GPU in PyTorch

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Distributed Prioritized Experience Replay(Ape-X)

MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...

ARM and Machine Learning

Concurrent Programming OpenMP @ Distributed System Discussion

Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017

OpenMp

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016

Openmp

Wapid and wobust active online machine leawning with Vowpal Wabbit

OpenMP

Introduction to OpenMP

On the benchmark of Chainer

Open mp

ML in the Browser: Interactive Experiences with Tensorflow.js

Introduction to OpenMP

Similar to PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

C3 w3Ajay Taneja

Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok

MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy

TonY: Native support of TensorFlow on HadoopAnthony Hsu

Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks

In datacenter performance analysis of a tensor processing unitJinwon Lee

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon

K-Fashion 경진대회 3등 수상자 솔루션DACON AI 데이콘

Deep Learning with Spark and GPUsDataWorks Summit

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja

Parallelformersgohyunwoong

2017 04-13-google-tpu-04Brahim HAMADICHAREF

Accelerating stochastic gradient descent using adaptive mini batch size3muayyad alsadi

Nervana and the Future of ComputingIntel Nervana

Toronto meetup 20190917Bill Liu

Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.

Common Design of Deep Learning FrameworksKenta Oono

KaoNet: Face Recognition and Generation App using Deep LearningVan Huy

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...Chris Fregly

Similar to PR-129: Horovod: fast and easy distributed deep learning in TensorFlow (20)

C3 w3

Distributed DNN training: Infrastructure, challenges, and lessons learned

MIT's experience on OpenPOWER/POWER 9 platform

TonY: Native support of TensorFlow on Hadoop

Deep Learning with Apache Spark and GPUs with Pierce Spitler

In datacenter performance analysis of a tensor processing unit

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

K-Fashion 경진대회 3등 수상자 솔루션

Deep Learning with Spark and GPUs

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...

Parallelformers

2017 04-13-google-tpu-04

Accelerating stochastic gradient descent using adaptive mini batch size3

Nervana and the Future of Computing

Toronto meetup 20190917

Enabling Presto Caching at Uber with Alluxio

Common Design of Deep Learning Frameworks

KaoNet: Face Recognition and Generation App using Deep Learning

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Scaling API-first – The story of a global engineering organizationRadu Cotescu

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Slack Application Development 101 Slidespraypatel2

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Install Stable Diffusion in windows machinePadma Pradeep

How to convert PDF to text with Nanonetsnaman860154

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Benefits Of Flutter Compared To Other Frameworks

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Scaling API-first – The story of a global engineering organization

SQL Database Design For Developers at php[tek] 2024

Slack Application Development 101 Slides

My Hashitalk Indonesia April 2024 Presentation

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Azure Monitor & Application Insight to monitor Infrastructure & Application

Install Stable Diffusion in windows machine

How to convert PDF to text with Nanonets

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

1. Horovod: fast and easy distributed deep learning in Tensorflow PR-129 Taekmin Kim Dec 23, 2018

2. On single machine ● Training ResNet-50 on TPU ○ Batch Size: 1024 ○ Accuracy: 76%(Top-1) ○ Training Time: 17hours ● Training Faster RCNN on 8GPUs ○ Batch Size: 8~16 ■ 1~2 per GPU

3. Why large-scale training? ● Better accuracy ○ E.g. Object detection ○ Group Normalization(ECCV 2018) ● Fast training ○ ResNet-50 ■ 6.6 minutes, 75.8%(Top-1) ■ 64k per mini-batch, 2048 GPUs Group Normalization

4. Distributed Tensorflow: How to use https://www.tensorflow.org/deploy/distributed

5. Distributed Tensorflow: How to use https://www.tensorflow.org/deploy/distributed

6. Distributed Tensorflow: Results

7. https://www.slideshare.net/databricks/horovod-ubers-open-source-distributed-deep-learning-framework-for-tensorflow

8. Data Parallelism 2017

9. Distributed Tensorflow ● Parameter-Worker Architecture ● Issues ○ need to decide # parameter servers, workers ○ difficult to edit configurations ■ Codes, ...

10. Issues ● Communication Cost ○ Average gradients ○ Update weights ● Others ○ GPU/CPU ○ Network ○ Storage

11. ring-allreduce http://research.baidu.com/bringing-hpc-techniques-deep-learning Add Add 2*(N-1) iterations ● N-1: Add ● N-1: Send & Receive Send & Receive Send & Receive

12. Horovod ● Stand-alone package ○ pip install horovod ● ring-allreduce ○ NCCL ● Horovod Timeline ○ Chrome extension ● Tensor Fusion ● MPI

13. Horovod Timeline

14. Tensor Fusion ● 65% improvement in performance ● Algorithm:

15. Example

16. Example

17. Results: Inception V3, ResNet-101

18. Results: with RDMA networking ● Support RDMA(Remote Direct Memory Access) networking ○ e.g. InfiniBand

19. Current Horovod ● Support ○ Tensorflow ■ Estimator API ○ Keras ○ PyTorch https://github.com/uber/horovod

20. Summary ● Communication ○ ring-allreduce ● Libraries ○ NCCL, MPI ● Benefits ○ Large-scale training(e.g. AutoML) ● Future work ○ PyTorch 1.0 ■ torch.distributed ○ Tensorflow 2.0

21. Related Work ● Training ○ Communication Cost ■ Deep Gradient Compression ■ Training ImageNet in Four Minutes ● using mixed precision ○ RNN, RL ■ Dynamic Control Flow ● Inference ○ Low Latency RNN Inference with Cellular Batching

Editor's Notes

In the ring-allreduce algorithm, shown on Figure 4, each of N nodes communicates with two of its peers 2 ∗ (N − 1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N − 1 iterations, received values are added to the values in the node’s buffer. In the second N − 1 iterations, received values replace the values held in the node’s buffer. Patarasuk and Yuan in [9] suggest that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network. In addition to being network-optimal, the allreduce approach is much easier to understand and adopt. Users utilize a Message Passing Interface (MPI) [10] implementation such as Open MPI [11] to launch all copies of the TensorFlow program. MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other. All the user needs to do is modify their program to average gradients using an allreduce() operation.
we found that RDMA did not significantly improve our performance and only achieved a three to four percent increase over TCP networking. RDMA, however, did help Horovod exceed 90 percent scaling efficiency on both mode the VGG-16 model experienced a significant 30 percent speedup when we leveraged RDMA networking.

PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

Similar to PR-129: Horovod: fast and easy distributed deep learning in TensorFlow (20)

Recently uploaded

Recently uploaded (20)

PR-129: Horovod: fast and easy distributed deep learning in TensorFlow

Editor's Notes