Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jeff Johnson, Research Engineer, Facebook at MLconf NYC


Published on

Hacking GPUs for Deep Learning: GPUs have revolutionized machine learning in recent years, and have made both massive and deep multi-layer neural networks feasible. However, misunderstandings on why they seem to be winning persist. Many of deep learning’s workloads are in fact “too small” for GPUs, and require significantly different approaches to take full advantage of their power. There are many differences between traditional high-performance computing workloads, long the domain of GPUs, and those used in deep learning. This talk will cover these issues by looking into various quirks of GPUs, how they are exploited (or not) in current model architectures, and how Facebook AI Research is approaching deep learning programming through our recent work.

Published in: Technology
  • Be the first to comment

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

  1. 1. Hacking GPUs for Deep Learning MLConf New York Jeff Johnson Facebook AI Research
  2. 2. Deep (convolutional) Neural Networks Revolution in machine learning Convolution: since 1980s. Deep: flops since 2000s Avoid feature engineering ▪ With enough data, let network discover feature representations ▪ Can work even for NLP. No word segmentation, use raw character data.
  3. 3. 2D Convolutional Nets (images) LeCun, Bottou, Bengio and Haffner, 1998 Krizhevsky, Sutskever and Hinton, 2012
  4. 4. 2D Convolutional Nets Progress towards smaller kernels and deeper nets Network architecture ImageNet 1000 class top-5 error AlexNet ~15% OverFeat ~13% ZeilerNet ~11% Oxford-VGG ~7% GoogLeNet ~6%, ~4.5% PReLU (MSR) ~4.9% Human performance 3-5%
  5. 5. 3D Convolutional Nets (videos) C3D (Tran et al., 2014) DeepVideo (Karpathy et al., 2014)
  6. 6. 1D Convolutional Nets (text, sequences) Collobert et al., 2011 Zhang and LeCun, 2015
  7. 7. RNNs and LSTMs (text, sequences) Graves, Mohamed and Hinton, 2013 Mikolov, 2014
  8. 8. Deep Neural Networks Supervised learning. Unsupervised ??? Train with back-propagation/SGD variants Strong scaling is unsolved ▪ Distributed parameter space exploration (e.g., Hogwild!; Niu et al. 2011) ▪ Distributed hyperparameter space exploration (e.g., Bayesian optimization; Snoek et al. 2012)
  9. 9. Characteristics
  10. 10. Deep nets are flop eaters Convolutions are expensive Pointwise calcuations (log/exp, ReLU, */+, ...) Neighborhood reductions (pooling, convolution) Scaling network parameters  increased learning capacity; overfitting  more training data (real or synthetic), regularization required
  11. 11. Deep nets are bandwidth eaters More parameters = more memory, data to exchange Barrier to cross-machine parallelism ▪ periodic exchanges, compression, quantization Increase reuse of memory while local? ▪ interspersed reductions are resistant to fusion of computations ▪ generalized programming language problem
  12. 12. Deep nets are latency sensitive Serial dependency of training fprop => bprop => fprop => ... Serial dependency of multi-layer networks layer 1 => layer 2 => layer 3 => ... Multiple path dependent networks (RNNs, multi-layer LSTMs)
  13. 13. Deep nets are also small? Deeper = smaller feature planes, more of them input Rm => expand to Rn => non-lin => reduce to Rk Problems are tiny in HPC terms 4096×4096 FFT, FE/PDE on massive grids, ... NLP tasks can be sparse Setup/kernel launch latency on GPU can dominate compute
  14. 14. The tools
  15. 15. Vector processors SIMD: Single Instruction, Multiple Data Serial processor with ability to operate on more than one piece of data concurrently Cray-1 (1976)
  16. 16. Vector processors Hard to use: instructions only operate on 4, 8, 16, ... pieces of data at a time. Boundary/alignment effects. Great if your vectors are large, but... float* a = ...; // is this aligned (a % 16 == 0)? float* b = ...; // is this aligned (b % 16 == 0)? for (i = 0; i < 18; ++i) { // how to handle [16, 17]? b[i] += a[i]; // SIMD this?!? masking/loop epilogue }
  17. 17. “Vector cores”? SIMD variant: NVIDIA calls “SIMT” Lots of simple cores (CM) Hide latency through many threads + switching (Tera) “Pixel/vertex shaders” in 2000s GPUs => GPGPU CM-1 (1983) Tera MTA (1995)
  18. 18. GPU versus CPU GPUs represent a different form of vector programming (“vector cores”) ▪ 32-wide vector of threads (“warp”) Sufficiently optimized CPU code can be on par with GPU perf (Tflop range with AVX2/512, exploit multi- level caches, deep pipelines, prefetch, ...) Vector programming: easier with GPUs than CPUs Sweetspot is different from GPU codes
  19. 19. Parallelization + vectorization Serial nature of commonly used CPU programming languages sometimes hides opportunities Auto-vectorizing/parallelizing compilers + DSLs can’t yet compete with expert hand-rolled ▪ DSLs like Halide (Ragan-Kelley et al. 2013) show promise but need a few more generations Sprinkle in (OpenMP) doesn’t cut it
  20. 20. Who wins CPU GPU flops ✔ (vectorize: AVX2/512 gives Tflop range) ✔ Tesla K40: 2880 fp32 ALU pipelines main memory b/w ✖ (Xeon Phi improves) ✔ latency ✔ (high clock, reordering; caches are large and work if you obey them) ✖ (threads slow, non-smem caches irrelevant, CPU -> GPU control overhead) boundary effects, small/irregular sizes ✔✖ (branches easy, vectorization hard) ✖ (warp divergence, load imbalance) parallel programming model ✖ (vectorization hard, perf black box) ✔✖ (CUDA is very different, domain knowledge)
  21. 21. Tool + problem = solution?
  22. 22. Dive into 2D Convolutional Nets Somewhat computationally expensive O(b × f × f’ × n2 × k2) 1st layer AlexNet: ▪ 13.493 Gflop (1 flop here = fp32 multiply-add) ▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32) ▪ Perfect caching + reuse, 175 flop/byte in ▪ No caching + reuse, 0.125 flop/byte in
  23. 23. The problem Programmable caches (shared memory, registers, ...) not large enough for perfect reuse Space of all possible square 2D convolution problems is 5/6-dimensional Parameter Size minibatch size (b) 128 input feature maps (f) 3 output feature maps (f’) 96 input feature size (n x n) 224 convolution kernel size (k x k) 11 convolution kernel stride (SxS) (optional) 4
  24. 24. Converting Space of all possible matrix multiplications = 3 dimensional (ANxMBMxP = CNxP) NVIDIA, Intel, others have put lots of effort into optimizing many parts of this space ▪ Rephrase convolution as a matrix multiplication! ▪ NVIDIA’s cuDNN
  25. 25. But: Sgemm originally optimized for large problems 13x13 * 3x3 is a small convolution. Unrolling it 192 times it might be enough to feed GPU Large convolutions are intractable? Small feature maps/ convolutions = boundary effects bad for GPUs
  26. 26. Facebook AI Research work 2D convolution via FFT Fast convolutional nets with fbfft: A GPU Performance Evaluation (Vasilache, Johnson et al., 2015 ICLR conference track oral) Convolution => pointwise × in Fourier basis Choice of basis is wide open! 2i is great perf O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n) ▪ >= 5x5 kernels, faster than cuDNN
  27. 27. fbfft cuFFT optimized for large FFT sizes fbfft: smaller data, fit in registers, focus on warp
  28. 28. Data layout Different problem sizes => different data layout ▪ cudaconv: DHWB (optimal for large b) ▪ deeper layers: HWBD/BHWD (many feature maps) ▪ b=1 faster convergence? ▪ b=128 better compute utilization Smaller problems, exploit different layout/batching ▪ fbcunn 1D convolution
  29. 29. Latency hiding: what holds you back? ▪ Compute bound? (math) ▪ Memory b/w bound? (streaming) ▪ Memory latency bound? (sparse) Almost all “deep learning” algorithms are b/w bound on GPU. Low math intensity! cuBLAS: Sgemm b/w bound. Dgemm compute bound
  30. 30. Kernel fusion: CPU vs GPU Reduces memory b/w pressure Exploits cache locality and register reuse CPU: fusion not necessary Kernel tiling + interleaving works due to caches GPU: fusion necessary Tiling + interleaving doesn’t work: smem not persistent, caches too small/irrelevant
  31. 31. Kernel fusion CUDA kernel = hard optimization boundary on GPU Loop interchange, lifting, better fusion on CPU CUDA: parallelization layer not visible to optimizer. Auto-tuning desired. HW specific non-linear tradeoffs Scripting languages are further barrier to fusion on both CPU and GPU (Torch)
  32. 32. Kernel fusion Torch: transposition is common operation ▪ size (80, 40) stride (40, 1) => size (40, 80) stride (1, 40) ▪ Old approach: transpose in memory, perform work, copy back ▪ New approach: rewrite kernel to handle transpositions. Optimize if non-transposed Runtime fusion (CUDA 7.0, Theano)
  33. 33. Exploiting parallelism
  34. 34. end