SlideShare a Scribd company logo
1 of 25
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Lin Yuan, Yuxi (Darren) Hu
Distributed Training Using Apache
MXNet with Horovod
Feb 11, 2019
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Outline
• What is distributed model training
• Introduction to Apache MXNet and Horovod
• Integrating MXNet with Horovod
• Performance results
• Demo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Model Training 101
data
model
optimizer
gradients
converge? done
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
The Growing Pain of DNN
• Increasing model complexity
• ResNet50 network has over 25 million parameters [1]
• Huge training data
• ImageNet has 14,197,122 images [2]
• How to leverage the computing resource
• Cost/energy efficiency
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Model Training Going Distributed
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Parallelism vs Model Parallelism
data1 data2 datan
model model model
global state of parameters
machine1 machine2 machinen
machine1 machine2
machine3 machine4
…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Parallel: Parameter Server based approach
data1
model
optimizer
worker1
server1
data2
model
optimizer
worker2
datan
model
optimizer
workern
server2
…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Parallel: Ring-Allreduce based approach
worker1
model
worker4
model
worker3
model
worker2
model
data1
data4
data3
data2
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Apache MXNet
• Apache (incubating) open source project
• Framework for DNNs
• Created by academia (CMU and UW)
• Adopted byAWS as DNN framework of choice,
Nov 2016
• Widely used within Amazon
http://mxnet.io
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Asynchronous Engine
Operations appear to return
immediately, but just pushed to engine
queue on backend.
Allows for much greater parallelism.
Serial or parallel? “The execution of
any two functions when one of them
modifies at least one common variable
is serialized in their push order.”
Must wait_to_read() or similar to
retrieve value and this blocks.
wait_to_read()
frontend
backend
Sync Async
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Horovod
• An open source framework (under Linux
Foundation) for distributed model training
• Support TensorFlow, Keras, MXNet, and PyTorch
• Developed at Uber since Oct 2017
• Implement the ring-allreduce approach using MPI
and NCCL
• MPI: a message passing interface to
communicate between worker nodes
• NCCL: an efficient communicator methods
between GPUs
https://eng.uber.com/
horovod/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Integrating MXNet with Horovod
Horovod MXNet
broadcast
parameters
model
optimizer
distributed
optimizer
allreduce update gradients
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Leverage the power of asynchronous engine in MXNet
• MXNet engine
starts executing the
operation
asynchronously
• Task dependency is
taken care of
automatically
• Improves the
training throughput
MXNet Engine
horovod.broadcast
horovod.allreduce
PushAsync
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Performance Optimization
• Mixed precision: using float32 for training and float16 for passing
gradients
• Tensor fusion: combine all tensors that are ready to be reduced into
one reduction operation
• Hierarchical Allreduce (only supported in NCCL)
• Aggregate SGD*: aggregate multiple weights in a single call to
optimizer to reduce synchronizing overhead
* contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations-
accelerate-deep-learning-training-gpu/)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Next Steps
• Fused operators such as BatchNorm-ReLU and BatchNorm-Add-
ReLU to reduce unnecessary data transfer between CPU and GPU
memory*
• Provide different layout (NHWC) to improve convolution operators in
GPU*
* contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations-
accelerate-deep-learning-training-gpu/)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Benchmark Setup
• Model and data
• ResNet50-v1b (~25 million parameters) [1]
• ImageNet (~14 million images) [2]
• Training setup
• batch size (per device): 256
• learning rate: 0.1 (scaled linearly with number of GPUs) [3]
• number of epochs: 90
• Software
• CUDA 9.2
• Ubuntu 16.04
• cuDNN 7.2.1
• NCCL 2.2.13
• OpenMPI 3.1.1
• Hardware
• GPU instance: p3.16xlarge (8 NVIDIA Tesla V100 GPUs, each pairing 5,120 CUDA Cores and 640 Tensor
Cores)
• CPU instance: c5.18xlarge (72 vCPU and 144GiB memory)
• Network bandwidth: 25Gbps
[1] He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
[2] http://image-net.org/challenges/LSVRC/2015/
[3] Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, CVPR 2018
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Scaling Efficiency
0
10000
20000
30000
40000
50000
60000
1 8 16 32 64
Images/sec
Training ResNet50 model with ImageNet data on
NVIDIA Tesla V100 GPUs
Parameter Server Horovod Ideal
82.6%
48.7%
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Cost Comparison
• Adding extra machines as parameter servers can help increase
throughput at the cost of computation resource ($$)
Setup Time to train
(min)
Throughput
(images/sec)
Top-1 Validation
Accuracy
Cost ($$) *
Horovod on 8
p3.16xlarge
43.5 44900 75.69% 142
Parameter Server
on 8 p3.16xlarge
and 16 c5.18xlarge
44.1 43482 74.81% 190
Parameter Server
(collocated)
76 26500 74.72% 248
* cost is calculated based on AWS on demand EC2 instance hourly rate of p3.16xlarge and c5.18xlarge
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
MLPerf Benchmark of ResNet-50v1.5 on ImageNet*
Submitter Hardware Software Time (mins) Speedup
Reference Pascal P100 Unoptimized
reference
8831.3 1.0x
Google TPUv2.512 +
TPUv2.8 (260
cores)
TensorFlow
1.12
11.3 781.5x
Intel 8x2S
SKX8180 (16
processors)
Intel Caffe
1.1.2a
1312.8 6.7x
NVIDIA 80xDGX-1
(640 Volta
GPU)
MXNet-
ngc18.11,
cuDNN 7.4
6.2 1424.4x
*MLPerf: https://mlperf.org/results/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
How to run distributed training using MXNet with
Horovod
• Install MXNet
• Currently we recommend users to build MXNet from source if you are running on
machines with GCC 5.x and beyond: https://github.com/apache/incubator-mxnet
• If you are running on machines with GCC 4.x, you may install MXNet using pip:
• pip install mxnet-cu92
• Install Horovod
• Currently MXNet is supported in Horovod by building from source:
https://github.com/uber/horovod
• Horovod 0.16.0 will include MXNet in PyPI package:
• pip install horovod
• Run MPI in cluster
• Specify cluster in a host file
• mpirun -np <num of gpu/cpu devices> -H <hostfile> -bind-to none -map-by slot
python <training script>
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Changes needed in training script
Single GPU training Distributed training in Horovod
import mxnet as mx
# Set context to GPU
context = mx.gpu(0)
# Build model
model = …
# Define hyper parameters
optimizer_params = …
# Create optimizer
opt = mx.optimizer.create(…)
# Initialize parameters
initializer = …
model.bind(data=…,label=…)
model.init_params(initializer)
# Train model
model.fit(train_data, optimizer=opt,num_epoch=…
import mxnet as mx
import horovod.mxnet as hvd
# Initialize Horovod
hvd.init()
# Set conext to GPU by local rank
context = mx.gpu(hvd.local_rank())
# Build model
model = …
# Define hyper parameters
optimizer_params = …
# Create distributed optimizer
opt = mx.optimizer.create(…)
opt = hvd.DistributedOptimizer(opt)
# Initialize parameters
initializer = …
model.bind(data=…,label=…)
model.init_params(initializer)
# Fetch and broadcast parameters
hvd.broadcast_parameters(model.get_params())
# Train model
model.fit(train_data, optimizer=opt,num_epoch=…)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Demo
• MXNet + Hovorod MNIST example Jupyter Notebook
• MXNet + Horovod MNIST example full scripts: Gluon, Module
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
How to Get Started with Apache MXNet on AWS
• Get started with Apache MXNet onAWS:
https://aws.amazon.com/mxnet/get-started/
• UsingApache MXNet with Amazon SageMaker:
https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html
• Contact: mxnet-info@amazon.com
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Using Apache MXNet with AWS ML Services
• Amazon SageMaker: https://aws.amazon.com/sagemaker/
• Amazon SageMaker Neo: https://aws.amazon.com/sagemaker/neo/
• Amazon Elastic Inference: https://aws.amazon.com/machine-learning/elastic-inference/
• Amazon Reinforcement Learning: https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-sagemaker-announces-support-
for-reinforcement-learning/
• AWS IoT Greengrass ML Inference: https://aws.amazon.com/greengrass/ml/
• Dynamic Training with Apache MXNet on AWS: https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-dynamic-
training-with-apache-mxnet/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Thank you for coming!
Q&A

More Related Content

What's hot

VR/AR introduction & state-of-the-art VR/AR prototyping
VR/AR introduction & state-of-the-art VR/AR prototypingVR/AR introduction & state-of-the-art VR/AR prototyping
VR/AR introduction & state-of-the-art VR/AR prototypingValentijn Destoop
 
Big data application architecture 요약2
Big data application architecture 요약2Big data application architecture 요약2
Big data application architecture 요약2Seong-Bok Lee
 
La Réalité Augmentée - C'est quoi ?
La Réalité Augmentée - C'est quoi ?La Réalité Augmentée - C'est quoi ?
La Réalité Augmentée - C'est quoi ?You to You
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017Crispin Reedy
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AILiming Zhu
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
Using Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeUsing Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeGautier Marti
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 yearsMoogsoft
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCeph Community
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data ScientistAlexey Grigorev
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 

What's hot (20)

VR/AR introduction & state-of-the-art VR/AR prototyping
VR/AR introduction & state-of-the-art VR/AR prototypingVR/AR introduction & state-of-the-art VR/AR prototyping
VR/AR introduction & state-of-the-art VR/AR prototyping
 
Big data application architecture 요약2
Big data application architecture 요약2Big data application architecture 요약2
Big data application architecture 요약2
 
La Réalité Augmentée - C'est quoi ?
La Réalité Augmentée - C'est quoi ?La Réalité Augmentée - C'est quoi ?
La Réalité Augmentée - C'est quoi ?
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AI
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Using Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of CodeUsing Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of Code
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
IBM Watson Overview
IBM Watson OverviewIBM Watson Overview
IBM Watson Overview
 
Generative AI
Generative AIGenerative AI
Generative AI
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
 
Prompt Engineering
Prompt EngineeringPrompt Engineering
Prompt Engineering
 
Open Ai ppt
Open Ai pptOpen Ai ppt
Open Ai ppt
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
Workshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David RaabWorkshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David Raab
 
introduction Azure OpenAI by Usama wahab khan
introduction  Azure OpenAI by Usama wahab khanintroduction  Azure OpenAI by Usama wahab khan
introduction Azure OpenAI by Usama wahab khan
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data Scientist
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 

Similar to Distributed Model Training using MXNet with Horovod

Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Amazon Web Services
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...Amazon Web Services
 
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...Amazon Web Services
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...Amazon Web Services
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Amazon Web Services
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Amazon Web Services
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerAmazon Web Services
 
Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference  Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference Hagay Lupesko
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Amazon Web Services
 
Introduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetIntroduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Julien SIMON
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Codiax
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and GluonSoji Adeshina
 
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS SummitWork with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS SummitAmazon Web Services
 
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...Amazon Web Services Korea
 
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Amazon Web Services
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML OverviewBESPIN GLOBAL
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)Julien SIMON
 

Similar to Distributed Model Training using MXNet with Horovod (20)

Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference  Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
 
Introduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetIntroduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNet
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and Gluon
 
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS SummitWork with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Atlanta AWS Summit
 
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...
Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준...
 
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML Overview
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)
 

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Distributed Model Training using MXNet with Horovod

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Lin Yuan, Yuxi (Darren) Hu Distributed Training Using Apache MXNet with Horovod Feb 11, 2019
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Outline • What is distributed model training • Introduction to Apache MXNet and Horovod • Integrating MXNet with Horovod • Performance results • Demo
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Model Training 101 data model optimizer gradients converge? done
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark The Growing Pain of DNN • Increasing model complexity • ResNet50 network has over 25 million parameters [1] • Huge training data • ImageNet has 14,197,122 images [2] • How to leverage the computing resource • Cost/energy efficiency
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Model Training Going Distributed
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Data Parallelism vs Model Parallelism data1 data2 datan model model model global state of parameters machine1 machine2 machinen machine1 machine2 machine3 machine4 …
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Data Parallel: Parameter Server based approach data1 model optimizer worker1 server1 data2 model optimizer worker2 datan model optimizer workern server2 …
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Data Parallel: Ring-Allreduce based approach worker1 model worker4 model worker3 model worker2 model data1 data4 data3 data2
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Apache MXNet • Apache (incubating) open source project • Framework for DNNs • Created by academia (CMU and UW) • Adopted byAWS as DNN framework of choice, Nov 2016 • Widely used within Amazon http://mxnet.io
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Asynchronous Engine Operations appear to return immediately, but just pushed to engine queue on backend. Allows for much greater parallelism. Serial or parallel? “The execution of any two functions when one of them modifies at least one common variable is serialized in their push order.” Must wait_to_read() or similar to retrieve value and this blocks. wait_to_read() frontend backend Sync Async
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Horovod • An open source framework (under Linux Foundation) for distributed model training • Support TensorFlow, Keras, MXNet, and PyTorch • Developed at Uber since Oct 2017 • Implement the ring-allreduce approach using MPI and NCCL • MPI: a message passing interface to communicate between worker nodes • NCCL: an efficient communicator methods between GPUs https://eng.uber.com/ horovod/
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Integrating MXNet with Horovod Horovod MXNet broadcast parameters model optimizer distributed optimizer allreduce update gradients
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Leverage the power of asynchronous engine in MXNet • MXNet engine starts executing the operation asynchronously • Task dependency is taken care of automatically • Improves the training throughput MXNet Engine horovod.broadcast horovod.allreduce PushAsync
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Performance Optimization • Mixed precision: using float32 for training and float16 for passing gradients • Tensor fusion: combine all tensors that are ready to be reduced into one reduction operation • Hierarchical Allreduce (only supported in NCCL) • Aggregate SGD*: aggregate multiple weights in a single call to optimizer to reduce synchronizing overhead * contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations- accelerate-deep-learning-training-gpu/)
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Next Steps • Fused operators such as BatchNorm-ReLU and BatchNorm-Add- ReLU to reduce unnecessary data transfer between CPU and GPU memory* • Provide different layout (NHWC) to improve convolution operators in GPU* * contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations- accelerate-deep-learning-training-gpu/)
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Benchmark Setup • Model and data • ResNet50-v1b (~25 million parameters) [1] • ImageNet (~14 million images) [2] • Training setup • batch size (per device): 256 • learning rate: 0.1 (scaled linearly with number of GPUs) [3] • number of epochs: 90 • Software • CUDA 9.2 • Ubuntu 16.04 • cuDNN 7.2.1 • NCCL 2.2.13 • OpenMPI 3.1.1 • Hardware • GPU instance: p3.16xlarge (8 NVIDIA Tesla V100 GPUs, each pairing 5,120 CUDA Cores and 640 Tensor Cores) • CPU instance: c5.18xlarge (72 vCPU and 144GiB memory) • Network bandwidth: 25Gbps [1] He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 [2] http://image-net.org/challenges/LSVRC/2015/ [3] Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, CVPR 2018
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Scaling Efficiency 0 10000 20000 30000 40000 50000 60000 1 8 16 32 64 Images/sec Training ResNet50 model with ImageNet data on NVIDIA Tesla V100 GPUs Parameter Server Horovod Ideal 82.6% 48.7%
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Cost Comparison • Adding extra machines as parameter servers can help increase throughput at the cost of computation resource ($$) Setup Time to train (min) Throughput (images/sec) Top-1 Validation Accuracy Cost ($$) * Horovod on 8 p3.16xlarge 43.5 44900 75.69% 142 Parameter Server on 8 p3.16xlarge and 16 c5.18xlarge 44.1 43482 74.81% 190 Parameter Server (collocated) 76 26500 74.72% 248 * cost is calculated based on AWS on demand EC2 instance hourly rate of p3.16xlarge and c5.18xlarge
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark MLPerf Benchmark of ResNet-50v1.5 on ImageNet* Submitter Hardware Software Time (mins) Speedup Reference Pascal P100 Unoptimized reference 8831.3 1.0x Google TPUv2.512 + TPUv2.8 (260 cores) TensorFlow 1.12 11.3 781.5x Intel 8x2S SKX8180 (16 processors) Intel Caffe 1.1.2a 1312.8 6.7x NVIDIA 80xDGX-1 (640 Volta GPU) MXNet- ngc18.11, cuDNN 7.4 6.2 1424.4x *MLPerf: https://mlperf.org/results/
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark How to run distributed training using MXNet with Horovod • Install MXNet • Currently we recommend users to build MXNet from source if you are running on machines with GCC 5.x and beyond: https://github.com/apache/incubator-mxnet • If you are running on machines with GCC 4.x, you may install MXNet using pip: • pip install mxnet-cu92 • Install Horovod • Currently MXNet is supported in Horovod by building from source: https://github.com/uber/horovod • Horovod 0.16.0 will include MXNet in PyPI package: • pip install horovod • Run MPI in cluster • Specify cluster in a host file • mpirun -np <num of gpu/cpu devices> -H <hostfile> -bind-to none -map-by slot python <training script>
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Changes needed in training script Single GPU training Distributed training in Horovod import mxnet as mx # Set context to GPU context = mx.gpu(0) # Build model model = … # Define hyper parameters optimizer_params = … # Create optimizer opt = mx.optimizer.create(…) # Initialize parameters initializer = … model.bind(data=…,label=…) model.init_params(initializer) # Train model model.fit(train_data, optimizer=opt,num_epoch=… import mxnet as mx import horovod.mxnet as hvd # Initialize Horovod hvd.init() # Set conext to GPU by local rank context = mx.gpu(hvd.local_rank()) # Build model model = … # Define hyper parameters optimizer_params = … # Create distributed optimizer opt = mx.optimizer.create(…) opt = hvd.DistributedOptimizer(opt) # Initialize parameters initializer = … model.bind(data=…,label=…) model.init_params(initializer) # Fetch and broadcast parameters hvd.broadcast_parameters(model.get_params()) # Train model model.fit(train_data, optimizer=opt,num_epoch=…)
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Demo • MXNet + Hovorod MNIST example Jupyter Notebook • MXNet + Horovod MNIST example full scripts: Gluon, Module
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark How to Get Started with Apache MXNet on AWS • Get started with Apache MXNet onAWS: https://aws.amazon.com/mxnet/get-started/ • UsingApache MXNet with Amazon SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html • Contact: mxnet-info@amazon.com
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Using Apache MXNet with AWS ML Services • Amazon SageMaker: https://aws.amazon.com/sagemaker/ • Amazon SageMaker Neo: https://aws.amazon.com/sagemaker/neo/ • Amazon Elastic Inference: https://aws.amazon.com/machine-learning/elastic-inference/ • Amazon Reinforcement Learning: https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-sagemaker-announces-support- for-reinforcement-learning/ • AWS IoT Greengrass ML Inference: https://aws.amazon.com/greengrass/ml/ • Dynamic Training with Apache MXNet on AWS: https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-dynamic- training-with-apache-mxnet/
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thank you for coming! Q&A

Editor's Notes

  1. Thanks for coming to our meetup today. My colleage Darren and I will present traininig deep neural network models on multiple GPU instances using Apache MXNet with Horovod
  2. First, I will give an overview of distributed model training. Next, I will briefly introduce MXNet a deep learning library and Horovod a framework for distributed training. After that, I will describe how we support running MXNet on Horovod and show you some performance results we achieved. Finally, we will give you a short demo of running MXNet with Horovod on multiple hosts.
  3. This is typical flow of today’s model training especially for deep neural networks.
  4. As the DNN becomes a popular models for machine learning applications, model training has become a challenging task.
  5. There are two trends in todays model training tasks. GPU has become the dominant hardware architecture for training due to its massive parallel computing capability for matrix operations. Second, more training jobs are running on multiple nodes than on single node.
  6. ring-allreduce utilizes the network in an optimal way if the tensors are large enough, but does not work as efficiently or quickly if they are very small. Up to 65% improvement by doing tensor fusion using fusion buffer hierarchical allreduce can further boost performance by 10% ~ 30%