Successfully reported this slideshow.
Your SlideShare is downloading. ×

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
CNN Quantization
CNN Quantization
Loading in …3
×

Check these out next

1 of 26 Ad

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.

Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet (20)

Advertisement

Recently uploaded (20)

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

  1. 1. Haibin Lin Applied Scientist AWS AI From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet Lin Yuan Software Design Engineer AWS AI
  2. 2. Dataset and Model Size Keep Growing Dataset size for training (GB) Model parameter size (million)
  3. 3. Large Scale Distributed Training for Deep Neural Networks Data parallelism Model parallelism
  4. 4. Optimization for Large Scale Distributed Training • System-Level Optimization • Accelerate training on a single GPU • fused operators, data prefetching, vectorization, cache utilization, tensor core • Distributed training with multiple GPUs • large batch size, NCCL-allreduce, Elastic Fabric Adaptor • Algorithm-Level Optimization • Large-batch optimization algorithm • Model architecture • Accuracy/runtime trade off
  5. 5. Performance Optimization on AWS Cloud • Leverage the Amazon EC2 P3dn.24xlarge GPU instances • 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each • 96 Intel Xeon Scalable vCPUs • 1.8 TB local NVMe SSD • 100 Gbps network throughput • support Elastic Fabric Adapter • Software • Apache MXNet • GluonNLP and GluonCV toolkits • Horovod distributed training library
  6. 6. Case Study: Mask R-CNN
  7. 7. Deep learning nowadays - Mask-RCNN • Widely used in object detection and instance Segmentation • Target accuracy • bounding box AP: 37.7 • mask AP: 33.9
  8. 8. GluonCV: a Deep Learning Toolkit for Computer Vision • Training scripts that reproduce SOTA results reported in latest papers • A large set of pre-trained models • Carefully designed APIs and easy to understand its implementations • Community support • Built on top of Apache MXNet framework Image classification Object detection Semantic segmentation Pose estimation Video action recognition
  9. 9. GPU Profiling • Analyze runtime using Nvidia Visual Profiler • Identify large kernels to optimize Slow operator NHWC layout conversion small kernels
  10. 10. GPU Optimization Runtime Improvements • optimize ROIAlign: +10% • optimize NMS: +10% • fuse RCNN target generator: +5% • NWHC layout conversion: +10% • pointwise operator fusion: +3%
  11. 11. Automatic Mixed Precision • Automatic casting of the model • Convolution, FullyConnected -> FP16 • Norm, Mean, SoftMax, etc. -> FP32 • Add, Mul etc. -> Cast to widest type • AMP boosted the throughput by 5~10% • Casting the gradients to FP16 gives another throughput improvement by 1~2% without compromising Accuracy. Utilities for dynamic loss scaling
  12. 12. Model Hybridization • MXNet provides users the APIs to construct and debug the model using imperative programming • Users can invoke a hybridize API to boost model performance that is equivalent to symbolic programming. • We applied hybridization to the model and achieved 5% runtime improvement • Also, Hybridizing the model with static_alloc gave another 1~2% throughput improvement.
  13. 13. Performance Tuning in AWS cluster • Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge EC2 instance helps us to get 8% improvement in throughput • Autotune Horovod hyperparameters such as tensor fusion threshold cycle times, cache capacity, hierarchical allreduce etc. +9% throughput • Increase the number of data workers from 4 to 8 also help to accelerate data loading. Note that however more data workers do not necessarily mean better performance due to the overhead of context switching. • Accelerate dataloader through Cython • Distributed validation showed significant improvement in Validation compute time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on non-distributed validation.
  14. 14. Case Study: BERT
  15. 15. 55 65 75 85 95 General Language Understanding Evaluation (GLUE) Benchmark Human Baseline Deep learning nowadays - BERT BERT
  16. 16. Transfer learning with BERT for NLP • Pre-training for NLP • learn text representation on large-scale corpus • Fine-tuning for downstream tasks • Named Entity Recognition • Question Answering • Search • Chatbot • Text Summarization • Text Classification • Models available in GluonNLP toolkit feature extractor } GTC is awesome! positive NLP CV Image credit to: d2l.ai
  17. 17. GluonNLP: a deep learning natural language toolkit • Open source, available on SageMaker and deep learning container • State-of-the-art NLP models • Easy prototyping • Fast deployment • Multiple built-in NLP tasks
  18. 18. BERT model architecture Image credit to: d2l.ai BERTMulti-head attention (Vaswani et al., 17) x N
  19. 19. 1. Masked language modeling • Estimate • Randomly mask 15% of all tokens and predict them 2. Next sentence prediction • 50% of the time, replace it by random sentence • Learn logical coherence Pre-training objectives I went to the bank to deposit some money. I went to the <mask> to deposit some money. <CLS> Haibin is obnoxious <SEP> I don’t like his shirt <CLS> Haibin is obnoxious <SEP> Hello world! .
  20. 20. Data loading • Mini-batches are generated on the fly for dynamic masking[1] • Multi-process DatasetLoader with pre-fetching in the background • AWS FSx for Lustre: file system for compute-intensive workloads • Profiling result visualization previous batch current batch data loading gap Image credit to: d2l.ai
  21. 21. Fast Multi-head Self-Attention For each layer: Separate projections: Qproj = QWq, Kproj = QWk, Vproj = QWv Transpose Qproj , Kproj , Vproj : From (N, T, H, C) to (N, H, T, C) Compute attention: score = batch_gemm(Qproj, Kproj) result = batch_gemm(score, Vproj) Transpose result: From (N, H, T, C) to (N, T, H, C) credit to: Clement Fuji Tsang Higher cache utilization 1.58x faster (end to end) Transpose Q: From (N, T, HC) to (T, N, HC) For each layer: Joint projections: Wqkv = concat(Wq, Wk, Wv) Q_K_Vproj = QWqkv Compute attention: score = strided_batch_gemm(Qproj, Kproj) result = strided_batch_gemm(score, Vproj) Transpose final result: From (T, N, HC) to (N, T, HC)
  22. 22. GPU memory is precious - For each mini-batch, the gradient is synchronized across GPUs - Gradient allreduce can overlap with backward computation - A larger batch sizes leads to more time to hide communication latency - 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes Image credit to: d2l.ai Forward1 Backward1Forward2 Forward3 Backward2 Backward3 Allreduce1 Allreduce2 Allreduce3 time We can overlap computation and communication
  23. 23. NCCL + Elastic Fabric Adaptor HPC Application MPI implementation TCP/IP stack ENA network driver ENA Device HPC Application MPI implementation EFA kernel driver ENA Device Libfabric user space kernel Traditional HPC software stack in EC2 kernel user space HPC software stack in EC2 with EFA - Elastic Fabric Adaptor (EFA) - For HPC and distributed ML - Bypass OS kernel - Integrated with MPI, NCCL - BERT training - 32 p3dn.24xlarge instances - V100 GPUs x 256 - 100 Gb/s networking - BERT-large with GluonNLP - Batch size 64K, phase 1 - 90% strong scaling efficiency, with EFA enabled
  24. 24. Distributed Stochastic Optimization credit to: Shuai Zheng 𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡 𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡 𝑔 𝑡 ∥𝑔 𝑡∥2 Framework batch size #XPUs #steps optimizer F1 score training time Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m MXNet 32K/32K 512 GPUs 7038/1563 LAMB + NG 90.60% 141.5m
  25. 25. References [1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019). [2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." International Conference on Learning Representations. 2019.
  26. 26. Thank you Haibin Lin haibilin@amazon.com Lin Yuan lnyuan@amazon.com

Editor's Notes

  • First call deck

  • A
  • What is the specialty for this toolkit? Previously, each model has its own repo. Now all the SOTA models in one place.

    Smooth to develop.

  • Today we are launching Amazon FSx for Lustre, designed to meet the needs of these applications and others that you will undoubtedly dream up. Based on the mature and popular Lustre open source project, Amazon FSx for Lustre is a highly parallel file system that supports sub-millisecond access to petabyte-scale file systems. Thousands of simultaneous clients (EC2 instances and on-premises servers) can drive millions of IOPS (Input/Output Operations per Second) and transfer hundreds of gibibytes of data per second.

×