SlideShare a Scribd company logo
Haibin Lin
Applied Scientist
AWS AI
From Hours to Minutes: The Journey of
Optimizing Mask-RCNN and BERT Using MXNet
Lin Yuan
Software Design Engineer
AWS AI
Dataset and Model Size Keep Growing
Dataset size for training (GB) Model parameter size (million)
Large Scale Distributed Training for Deep Neural Networks
Data parallelism Model parallelism
Optimization for Large Scale Distributed Training
• System-Level Optimization
• Accelerate training on a single GPU
• fused operators, data prefetching, vectorization, cache utilization, tensor core
• Distributed training with multiple GPUs
• large batch size, NCCL-allreduce, Elastic Fabric Adaptor
• Algorithm-Level Optimization
• Large-batch optimization algorithm
• Model architecture
• Accuracy/runtime trade off
Performance Optimization on AWS Cloud
• Leverage the Amazon EC2 P3dn.24xlarge GPU instances
• 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each
• 96 Intel Xeon Scalable vCPUs
• 1.8 TB local NVMe SSD
• 100 Gbps network throughput
• support Elastic Fabric Adapter
• Software
• Apache MXNet
• GluonNLP and GluonCV toolkits
• Horovod distributed training library
Case Study: Mask R-CNN
Deep learning nowadays - Mask-RCNN
• Widely used in object detection
and instance Segmentation
• Target accuracy
• bounding box AP: 37.7
• mask AP: 33.9
GluonCV: a Deep Learning Toolkit for Computer Vision
• Training scripts that reproduce SOTA results reported in latest papers
• A large set of pre-trained models
• Carefully designed APIs and easy to understand its implementations
• Community support
• Built on top of Apache MXNet framework
Image
classification
Object
detection
Semantic
segmentation
Pose
estimation
Video action
recognition
GPU Profiling
• Analyze runtime using Nvidia Visual Profiler
• Identify large kernels to optimize
Slow operator
NHWC layout conversion
small kernels
GPU Optimization
Runtime Improvements
• optimize ROIAlign: +10%
• optimize NMS: +10%
• fuse RCNN target generator: +5%
• NWHC layout conversion: +10%
• pointwise operator fusion: +3%
Automatic Mixed Precision
• Automatic casting of the model
• Convolution, FullyConnected -> FP16
• Norm, Mean, SoftMax, etc. -> FP32
• Add, Mul etc. -> Cast to widest type
• AMP boosted the throughput by 5~10%
• Casting the gradients to FP16 gives another throughput improvement by 1~2%
without compromising Accuracy.
Utilities for dynamic loss scaling
Model Hybridization
• MXNet provides users the APIs to construct and debug the model using
imperative programming
• Users can invoke a hybridize API to boost model performance that is
equivalent to symbolic programming.
• We applied hybridization to the model and achieved 5% runtime improvement
• Also, Hybridizing the model with static_alloc gave another 1~2% throughput
improvement.
Performance Tuning in AWS cluster
• Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge
EC2 instance helps us to get 8% improvement in throughput
• Autotune Horovod hyperparameters such as tensor fusion threshold cycle
times, cache capacity, hierarchical allreduce etc. +9% throughput
• Increase the number of data workers from 4 to 8 also help to accelerate data
loading. Note that however more data workers do not necessarily mean better
performance due to the overhead of context switching.
• Accelerate dataloader through Cython
• Distributed validation showed significant improvement in Validation compute
time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on
non-distributed validation.
Case Study: BERT
55
65
75
85
95
General Language Understanding Evaluation (GLUE) Benchmark
Human Baseline
Deep learning nowadays - BERT
BERT
Transfer learning with BERT for NLP
• Pre-training for NLP
• learn text representation on large-scale
corpus
• Fine-tuning for downstream tasks
• Named Entity Recognition
• Question Answering
• Search
• Chatbot
• Text Summarization
• Text Classification
• Models available in GluonNLP toolkit
feature
extractor
}
GTC is awesome!
positive
NLP CV
Image credit to: d2l.ai
GluonNLP: a deep learning natural language toolkit
• Open source, available on SageMaker and deep learning container
• State-of-the-art NLP models
• Easy prototyping
• Fast deployment
• Multiple built-in NLP tasks
BERT model architecture
Image credit to: d2l.ai
BERTMulti-head attention (Vaswani et al., 17)
x N
1. Masked language modeling
• Estimate
• Randomly mask 15% of all tokens and predict them
2. Next sentence prediction
• 50% of the time, replace it by random sentence
• Learn logical coherence
Pre-training objectives
I went to the bank to deposit some money.
I went to the <mask> to deposit some money.
<CLS> Haibin is obnoxious <SEP> I don’t like his shirt
<CLS> Haibin is obnoxious <SEP> Hello world! .
Data loading
• Mini-batches are generated on the fly for dynamic masking[1]
• Multi-process DatasetLoader with pre-fetching in the background
• AWS FSx for Lustre: file system for compute-intensive workloads
• Profiling result visualization
previous
batch
current
batch
data
loading
gap
Image credit to: d2l.ai
Fast Multi-head Self-Attention
For each layer:
Separate projections:
Qproj = QWq, Kproj = QWk, Vproj = QWv
Transpose Qproj , Kproj , Vproj :
From (N, T, H, C) to (N, H, T, C)
Compute attention:
score = batch_gemm(Qproj, Kproj)
result = batch_gemm(score, Vproj)
Transpose result:
From (N, H, T, C) to (N, T, H, C)
credit to: Clement Fuji Tsang
Higher cache utilization
1.58x faster (end to end)
Transpose Q:
From (N, T, HC) to (T, N, HC)
For each layer:
Joint projections:
Wqkv = concat(Wq, Wk, Wv)
Q_K_Vproj = QWqkv
Compute attention:
score = strided_batch_gemm(Qproj, Kproj)
result = strided_batch_gemm(score, Vproj)
Transpose final result:
From (T, N, HC) to (N, T, HC)
GPU memory is precious
- For each mini-batch, the gradient is synchronized across GPUs
- Gradient allreduce can overlap with backward computation
- A larger batch sizes leads to more time to hide communication latency
- 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes
Image credit to: d2l.ai
Forward1 Backward1Forward2 Forward3 Backward2 Backward3
Allreduce1 Allreduce2 Allreduce3
time
We can overlap computation
and communication
NCCL + Elastic Fabric Adaptor
HPC Application
MPI
implementation
TCP/IP stack
ENA network
driver
ENA Device
HPC Application
MPI
implementation
EFA kernel
driver
ENA Device
Libfabric
user
space
kernel
Traditional HPC
software stack in EC2
kernel
user
space
HPC software stack
in EC2 with EFA
- Elastic Fabric Adaptor (EFA)
- For HPC and distributed ML
- Bypass OS kernel
- Integrated with MPI, NCCL
- BERT training
- 32 p3dn.24xlarge instances
- V100 GPUs x 256
- 100 Gb/s networking
- BERT-large with GluonNLP
- Batch size 64K, phase 1
- 90% strong scaling efficiency, with
EFA enabled
Distributed Stochastic Optimization
credit to: Shuai Zheng
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡
𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡
𝑔 𝑡
∥𝑔 𝑡∥2
Framework
batch
size
#XPUs #steps optimizer F1 score training time
Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m
MXNet 32K/32K 512 GPUs 7038/1563
LAMB +
NG
90.60% 141.5m
References
[1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach."
arXiv preprint arXiv:1907.11692 (2019).
[2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76
minutes." International Conference on Learning Representations. 2019.
Thank you
Haibin Lin
haibilin@amazon.com
Lin Yuan
lnyuan@amazon.com

More Related Content

What's hot

Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikanderrogerz1234567
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
jakehofman
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
Matthias Feys
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
Abhishek Thakur
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Intel® Software
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
Nidhin Pattaniyil
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Spark Summit
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
AlexNet and so on...
AlexNet and so on...AlexNet and so on...
AlexNet and so on...
Dong Heon Cho
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
Databricks
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
Mathieu Dumoulin
 

What's hot (20)

Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
AlexNet and so on...
AlexNet and so on...AlexNet and so on...
AlexNet and so on...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 

Similar to From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
Mike Frampton
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
Sri Ambati
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
Amer Ather
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
inside-BigData.com
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
Apache MXNet
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
Amazon Web Services
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
Databricks
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
 

Similar to From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

  • 1. Haibin Lin Applied Scientist AWS AI From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet Lin Yuan Software Design Engineer AWS AI
  • 2. Dataset and Model Size Keep Growing Dataset size for training (GB) Model parameter size (million)
  • 3. Large Scale Distributed Training for Deep Neural Networks Data parallelism Model parallelism
  • 4. Optimization for Large Scale Distributed Training • System-Level Optimization • Accelerate training on a single GPU • fused operators, data prefetching, vectorization, cache utilization, tensor core • Distributed training with multiple GPUs • large batch size, NCCL-allreduce, Elastic Fabric Adaptor • Algorithm-Level Optimization • Large-batch optimization algorithm • Model architecture • Accuracy/runtime trade off
  • 5. Performance Optimization on AWS Cloud • Leverage the Amazon EC2 P3dn.24xlarge GPU instances • 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each • 96 Intel Xeon Scalable vCPUs • 1.8 TB local NVMe SSD • 100 Gbps network throughput • support Elastic Fabric Adapter • Software • Apache MXNet • GluonNLP and GluonCV toolkits • Horovod distributed training library
  • 7. Deep learning nowadays - Mask-RCNN • Widely used in object detection and instance Segmentation • Target accuracy • bounding box AP: 37.7 • mask AP: 33.9
  • 8. GluonCV: a Deep Learning Toolkit for Computer Vision • Training scripts that reproduce SOTA results reported in latest papers • A large set of pre-trained models • Carefully designed APIs and easy to understand its implementations • Community support • Built on top of Apache MXNet framework Image classification Object detection Semantic segmentation Pose estimation Video action recognition
  • 9. GPU Profiling • Analyze runtime using Nvidia Visual Profiler • Identify large kernels to optimize Slow operator NHWC layout conversion small kernels
  • 10. GPU Optimization Runtime Improvements • optimize ROIAlign: +10% • optimize NMS: +10% • fuse RCNN target generator: +5% • NWHC layout conversion: +10% • pointwise operator fusion: +3%
  • 11. Automatic Mixed Precision • Automatic casting of the model • Convolution, FullyConnected -> FP16 • Norm, Mean, SoftMax, etc. -> FP32 • Add, Mul etc. -> Cast to widest type • AMP boosted the throughput by 5~10% • Casting the gradients to FP16 gives another throughput improvement by 1~2% without compromising Accuracy. Utilities for dynamic loss scaling
  • 12. Model Hybridization • MXNet provides users the APIs to construct and debug the model using imperative programming • Users can invoke a hybridize API to boost model performance that is equivalent to symbolic programming. • We applied hybridization to the model and achieved 5% runtime improvement • Also, Hybridizing the model with static_alloc gave another 1~2% throughput improvement.
  • 13. Performance Tuning in AWS cluster • Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge EC2 instance helps us to get 8% improvement in throughput • Autotune Horovod hyperparameters such as tensor fusion threshold cycle times, cache capacity, hierarchical allreduce etc. +9% throughput • Increase the number of data workers from 4 to 8 also help to accelerate data loading. Note that however more data workers do not necessarily mean better performance due to the overhead of context switching. • Accelerate dataloader through Cython • Distributed validation showed significant improvement in Validation compute time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on non-distributed validation.
  • 15. 55 65 75 85 95 General Language Understanding Evaluation (GLUE) Benchmark Human Baseline Deep learning nowadays - BERT BERT
  • 16. Transfer learning with BERT for NLP • Pre-training for NLP • learn text representation on large-scale corpus • Fine-tuning for downstream tasks • Named Entity Recognition • Question Answering • Search • Chatbot • Text Summarization • Text Classification • Models available in GluonNLP toolkit feature extractor } GTC is awesome! positive NLP CV Image credit to: d2l.ai
  • 17. GluonNLP: a deep learning natural language toolkit • Open source, available on SageMaker and deep learning container • State-of-the-art NLP models • Easy prototyping • Fast deployment • Multiple built-in NLP tasks
  • 18. BERT model architecture Image credit to: d2l.ai BERTMulti-head attention (Vaswani et al., 17) x N
  • 19. 1. Masked language modeling • Estimate • Randomly mask 15% of all tokens and predict them 2. Next sentence prediction • 50% of the time, replace it by random sentence • Learn logical coherence Pre-training objectives I went to the bank to deposit some money. I went to the <mask> to deposit some money. <CLS> Haibin is obnoxious <SEP> I don’t like his shirt <CLS> Haibin is obnoxious <SEP> Hello world! .
  • 20. Data loading • Mini-batches are generated on the fly for dynamic masking[1] • Multi-process DatasetLoader with pre-fetching in the background • AWS FSx for Lustre: file system for compute-intensive workloads • Profiling result visualization previous batch current batch data loading gap Image credit to: d2l.ai
  • 21. Fast Multi-head Self-Attention For each layer: Separate projections: Qproj = QWq, Kproj = QWk, Vproj = QWv Transpose Qproj , Kproj , Vproj : From (N, T, H, C) to (N, H, T, C) Compute attention: score = batch_gemm(Qproj, Kproj) result = batch_gemm(score, Vproj) Transpose result: From (N, H, T, C) to (N, T, H, C) credit to: Clement Fuji Tsang Higher cache utilization 1.58x faster (end to end) Transpose Q: From (N, T, HC) to (T, N, HC) For each layer: Joint projections: Wqkv = concat(Wq, Wk, Wv) Q_K_Vproj = QWqkv Compute attention: score = strided_batch_gemm(Qproj, Kproj) result = strided_batch_gemm(score, Vproj) Transpose final result: From (T, N, HC) to (N, T, HC)
  • 22. GPU memory is precious - For each mini-batch, the gradient is synchronized across GPUs - Gradient allreduce can overlap with backward computation - A larger batch sizes leads to more time to hide communication latency - 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes Image credit to: d2l.ai Forward1 Backward1Forward2 Forward3 Backward2 Backward3 Allreduce1 Allreduce2 Allreduce3 time We can overlap computation and communication
  • 23. NCCL + Elastic Fabric Adaptor HPC Application MPI implementation TCP/IP stack ENA network driver ENA Device HPC Application MPI implementation EFA kernel driver ENA Device Libfabric user space kernel Traditional HPC software stack in EC2 kernel user space HPC software stack in EC2 with EFA - Elastic Fabric Adaptor (EFA) - For HPC and distributed ML - Bypass OS kernel - Integrated with MPI, NCCL - BERT training - 32 p3dn.24xlarge instances - V100 GPUs x 256 - 100 Gb/s networking - BERT-large with GluonNLP - Batch size 64K, phase 1 - 90% strong scaling efficiency, with EFA enabled
  • 24. Distributed Stochastic Optimization credit to: Shuai Zheng 𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡 𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡 𝑔 𝑡 ∥𝑔 𝑡∥2 Framework batch size #XPUs #steps optimizer F1 score training time Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m MXNet 32K/32K 512 GPUs 7038/1563 LAMB + NG 90.60% 141.5m
  • 25. References [1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019). [2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." International Conference on Learning Representations. 2019.

Editor's Notes

  1. First call deck
  2. A
  3. What is the specialty for this toolkit? Previously, each model has its own repo. Now all the SOTA models in one place. Smooth to develop.
  4. Today we are launching Amazon FSx for Lustre, designed to meet the needs of these applications and others that you will undoubtedly dream up. Based on the mature and popular Lustre open source project, Amazon FSx for Lustre is a highly parallel file system that supports sub-millisecond access to petabyte-scale file systems. Thousands of simultaneous clients (EC2 instances and on-premises servers) can drive millions of IOPS (Input/Output Operations per Second) and transfer hundreds of gibibytes of data per second.