Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High-Performance Input Pipelines for Scalable Deep Learning

280 views

Published on

A production AI system is more than just training a deep learning model. It also includes 1) ingesting and running inference on new data, 2) transformation, processing, and cleaning new data to incorporate it into the training set, 3) continuously retraining to update and continue learning, and 4) experimental pipeline to test improvements to the AI models. This presentation focuses on the importance of high-performance and highly-scalable storage that is needed to take advantage of ever-larger datasets in model training.

We describe the common stages in an input pipeline for deep learning training and describe their resource requirements. We then present a benchmark-based approach for identifying bottlenecks in the pipeline, utilizing the Imagenet dataset to show linear scaling of training performance from 1 GPU to 32 GPUs. The AI-ready infrastructure presented here achieves the goal of providing scalable training performance with simplicity to eliminate the need for complex configuration and tuning of infrastructure components.

Speaker:
Joshua Robinson, Founding Engineer
Pure Storage

Published in: Technology
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

High-Performance Input Pipelines for Scalable Deep Learning

  1. 1. HIGH-PERFORMANCE INPUT PIPELINES FOR SCALABLE DEEP LEARNING Joshua Robinson Pure Storage
  2. 2. © 2019 PURE STORAGE INC.2 QUESTIONONEVERYONE’SMIND: WHYISASTORAGECOMPANYHERE?
  3. 3. © 2019 PURE STORAGE INC.3 “We don’t have better algorithms, we just have more data” PETER NORVIG ​Engineering Director, Google
  4. 4. © 2019 PURE STORAGE INC.4 The AI “Hierarchy of Needs” credit: Monica Rogati ML algorithms: linear & logistic regression, k-means clustering, decision trees, etc. Validation: A/B testing, detecting model drift over time✓ Data preparation: cleaning, feature identification, exploration, etc. Data acquisition: ingest, transformation, and representation of data for analysis
  5. 5. © 2019 PURE STORAGE INC.5 THIS IS NOT THE FIRST AI HYPE WAVE 1950 1960 1970 1980 1990 2000 2010 2020 Birth of AI Re-birth I Re-birth II AI winter I AI winter II Common themes: compute and data couldn’t match needs of problems being hyped Common themes: focus on specific problems where available compute & data are sufficient
  6. 6. 6 © 2019 PURE STORAGE INC. DEEP LEARNING = MASSIVE DATA & COMPUTE Deep Learning Accuracy Data & Compute Previous methods STATE-OF-THE-ART RESULTS ACROSS VISION, SPEECH, LANGUAGE, AND MORE Sources: https://arxiv.org/abs/1506.01497; https://arxiv.org/abs/1703.06870; https://shubhangdesai.github.io/blog/Neural-Style.html; https://cs.stanford.edu/people/karpathy/cnnembed/
  7. 7. © 2019 PURE STORAGE INC.7 THE INTUITION BEHIND DEEP LEARNING deep neural net Pr{dog}= 0.903 Pr{cat} = 0.072 … “dog” Primitives Rough shapes Macro features
  8. 8. © 2019 PURE STORAGE INC.8 TRAINING A DEEP NEURAL NETWORK evaluate compute gradients apply gradients Pr{dog}= 0.903 Pr{cat} = 0.072 … Primitives Rough shapes Macro features
  9. 9. © 2019 PURE STORAGE INC.9 DISTRIBUTED TRAINING evaluate compute gradients merge gradients apply gradients evaluate compute gradients apply gradients evaluate compute gradients apply gradients # GPUs
  10. 10. 10 © 2019 PURE STORAGE INC. MORE, FASTER GPUs + MORE DATA
  11. 11. 11 © 2019 PURE STORAGE INC. CAN WE KEEP GPUs FED WITH DATA? INPUT PIPELINE = POTENTIAL BOTTLENECK
  12. 12. 12 © 2019 PURE STORAGE INC. INPUT PIPELINES CAN IT BE THAT SIMPLE? Source: K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR 2015
  13. 13. 13 © 2019 PURE STORAGE INC. REAL INPUT PIPELINES CAN YOU SPOT THE BOTTLENECK? Source: K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR 2015
  14. 14. 14 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS PLANE DOG BOAT CAT 1. Enumerate
  15. 15. 15 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS PLANE PLANE PLANE PLANE DOG DOG DOG DOG BOAT BOAT BOAT BOAT CAT CAT CAT CAT 1. Enumerate 2. Associate labels
  16. 16. 16 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS BOATCAT PLANE PLANE PLANE PLANE DOG DOG DOGDOG BOATBOAT BOAT CAT CAT CAT 1. Enumerate 2. Associate labels 3. Shuffle
  17. 17. 17 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS PLANE PLANE PLANE DOG DOG DOGDOG BOAT BOATBOAT BOAT CAT CAT CAT CATPLANE 1. Enumerate 2. Associate labels 3. Shuffle 4. Read, crop, distort
  18. 18. 18 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS 1. Enumerate 2. Associate labels 3. Shuffle 4. Read, crop, distort 5. Copy to GPU PLANE PLANE PLANE DOG DOG DOGDOG BOAT BOATBOAT BOAT CAT CAT CAT CATPLANE
  19. 19. 19 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS 1. Enumerate 2. Associate labels 3. Shuffle 4. Read, crop, distort 5. Copy to GPU ANY OF THESE STEPS CAN BE A POTENTIAL BOTTLENECK Other domains (NLP, speech, etc.) will follow a similar(ish) flow
  20. 20. 20 © 2019 PURE STORAGE INC. EVALUATION METHODOLOGY 1.3M images, 1000 categories
  21. 21. 21 © 2019 PURE STORAGE INC. 40Gb Ethernet 4x NVIDIA DGX-1, each with 8x Tesla V100 GPUs (SXM2) 2x Intel E5-2698 v4 @ 2.20GHz 4x Mellanox MT27700 100Gb/s VPI adapters 512GB DDR4-2400 Pure Storage FlashBlade: 15x17TB 179T usable before data reduction Arista DCS-7060CX2-32S 32x 100Gb/s QSFP100 ports AIRI 100Gb Ethernet w/ RDMA (RoCE) HARDWARE STACK
  22. 22. 22 © 2019 PURE STORAGE INC. SOFTWARE STACK nvcr.io/nvidia/tensorflow:17.12 Using TensorFlow “Datasets” API for input pipelines DGX-OS (Ubuntu 16.04) CUDA 9.0 NCCL 2.1.2 CUDNN v7 OpenMPI 3.0 TensorFlow 1.4.0+ Horovod alsrgv/tf_cnn_benchmarks
  23. 23. 23 © 2019 PURE STORAGE INC. TRAINING WITH 1 GPU 216 i/s Defaults Images per second when training Inception3 (batch size = 64) forward input pipeline backward “Default” training pipeline forward backward Replace the input pipeline with synthetic data How do we know what good looks like? Synthetic 228 i/s
  24. 24. 24 © 2019 PURE STORAGE INC. TRAINING WITH 1 GPU 225 i/s Defaults + Prefetch forward input pipeline backward Images per second when training Inception3 (batch size = 64) Adding a prefetch queue improves scheduler behavior 216 i/s Synthetic 228 images/s forward input pipeline backward “Default” training pipeline SHOULD WE CARE ABOUT 5%?
  25. 25. 25 © 2019 PURE STORAGE INC. SCALING TO 32 GPUs (4x DGX-1s) Defaults 4143 i/s Linear Synthetic 6580 images/s 7200 images/s + Prefetch 5335 i/s - Distortions 6440 i/s Images per second when training Inception3 (batch size = 64/GPU) + Thread Pool Limit 5527 i/s Thread pool limits: Avoid over-subscribing CPU with too many threads. (inter_op_parallelism_threads) No Distortions: Skip preprocessing step from input pipeline. This is an unrealistic configuration, but it shows the bottleneck. EXCELLENT SCALABILITY, BUT STILL MORE WORK TO BE DONE 42% gap!
  26. 26. 26 © 2019 PURE STORAGE INC. 2.5X Performance Improvement
  27. 27. 28 © 2019 PURE STORAGE INC. SCALE OF REAL-WORLD DATA 143 GB 20 PB ImageNet Zenuity
  28. 28. 29 © 2019 PURE STORAGE INC. SINGLE-GPU TRAINING evaluate compute gradients apply gradients Pr{dog}= 0.903 Pr{cat} = 0.072 …
  29. 29. 30 © 2019 PURE STORAGE INC. DISTRIBUTED TRAINING evaluate compute gradients merge gradients apply gradients evaluate compute gradients apply gradients evaluate compute gradients apply gradients # GPUs
  30. 30. 31 © 2019 PURE STORAGE INC. LINEAR SCALING FOR CONVNETS RESNET-50 2540 i/s 4870 i/s 10244 i/s 1 DGX-1 2 DGX-1 4 DGX-1 INCEPTION3 1600 i/s 3160 i/s 6440 i/s 1 DGX-1 2 DGX-1 4 DGX-1 VGG16 1640 i/s 3110 i/s 6300 i/s 1 DGX-1 2 DGX-1 4 DGX-1
  31. 31. 32 © 2019 PURE STORAGE INC. RDMA OVER ETHERNET RDMA is essential for peak performance
  32. 32. 33 © 2019 PURE STORAGE INC. Input queue is full - need more/faster GPUs? KEEPING GPUs FED WITH DATA
  33. 33. 34 © 2019 PURE STORAGE INC. FROM IMAGES TO TENSORS PLANE PLANE PLANE DOG DOG DOG DOG BOAT BOAT BOAT BOAT CAT CAT CAT CAT PLANE 1. Enumerate 2. Associate labels 3. Crop and distort

×