Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Deep Convnets for Video Processing ... by Universitat Polit... 3169 views
- Soumith Chintala, Artificial Intell... by MLconf 906 views
- Hussein Mehanna, Engineering Direct... by MLconf 749 views
- Quoc le, slides MLconf 11/15/13 by MLconf 4962 views
- Visualizing inference by Alex Morrise 826 views
- Understanding Convolutional Neural ... by Jeremy Nixon 744 views

2,891 views

Published on

Published in:
Technology

No Downloads

Total views

2,891

On SlideShare

0

From Embeds

0

Number of Embeds

673

Shares

0

Downloads

43

Comments

0

Likes

4

No embeds

No notes for slide

- 1. Hacking GPUs for Deep Learning MLConf New York Jeff Johnson Facebook AI Research jhj@fb.com
- 2. Deep (convolutional) Neural Networks Revolution in machine learning Convolution: since 1980s. Deep: flops since 2000s Avoid feature engineering ▪ With enough data, let network discover feature representations ▪ Can work even for NLP. No word segmentation, use raw character data.
- 3. 2D Convolutional Nets (images) LeCun, Bottou, Bengio and Haffner, 1998 Krizhevsky, Sutskever and Hinton, 2012
- 4. 2D Convolutional Nets Progress towards smaller kernels and deeper nets Network architecture ImageNet 1000 class top-5 error AlexNet ~15% OverFeat ~13% ZeilerNet ~11% Oxford-VGG ~7% GoogLeNet ~6%, ~4.5% PReLU (MSR) ~4.9% Human performance 3-5%
- 5. 3D Convolutional Nets (videos) C3D (Tran et al., 2014) DeepVideo (Karpathy et al., 2014)
- 6. 1D Convolutional Nets (text, sequences) Collobert et al., 2011 Zhang and LeCun, 2015
- 7. RNNs and LSTMs (text, sequences) Graves, Mohamed and Hinton, 2013 Mikolov, 2014
- 8. Deep Neural Networks Supervised learning. Unsupervised ??? Train with back-propagation/SGD variants Strong scaling is unsolved ▪ Distributed parameter space exploration (e.g., Hogwild!; Niu et al. 2011) ▪ Distributed hyperparameter space exploration (e.g., Bayesian optimization; Snoek et al. 2012)
- 9. Characteristics
- 10. Deep nets are flop eaters Convolutions are expensive Pointwise calcuations (log/exp, ReLU, */+, ...) Neighborhood reductions (pooling, convolution) Scaling network parameters increased learning capacity; overfitting more training data (real or synthetic), regularization required
- 11. Deep nets are bandwidth eaters More parameters = more memory, data to exchange Barrier to cross-machine parallelism ▪ periodic exchanges, compression, quantization Increase reuse of memory while local? ▪ interspersed reductions are resistant to fusion of computations ▪ generalized programming language problem
- 12. Deep nets are latency sensitive Serial dependency of training fprop => bprop => fprop => ... Serial dependency of multi-layer networks layer 1 => layer 2 => layer 3 => ... Multiple path dependent networks (RNNs, multi-layer LSTMs)
- 13. Deep nets are also small? Deeper = smaller feature planes, more of them input Rm => expand to Rn => non-lin => reduce to Rk Problems are tiny in HPC terms 4096×4096 FFT, FE/PDE on massive grids, ... NLP tasks can be sparse Setup/kernel launch latency on GPU can dominate compute
- 14. The tools
- 15. Vector processors SIMD: Single Instruction, Multiple Data Serial processor with ability to operate on more than one piece of data concurrently Cray-1 (1976)
- 16. Vector processors Hard to use: instructions only operate on 4, 8, 16, ... pieces of data at a time. Boundary/alignment effects. Great if your vectors are large, but... float* a = ...; // is this aligned (a % 16 == 0)? float* b = ...; // is this aligned (b % 16 == 0)? for (i = 0; i < 18; ++i) { // how to handle [16, 17]? b[i] += a[i]; // SIMD this?!? masking/loop epilogue }
- 17. “Vector cores”? SIMD variant: NVIDIA calls “SIMT” Lots of simple cores (CM) Hide latency through many threads + switching (Tera) “Pixel/vertex shaders” in 2000s GPUs => GPGPU CM-1 (1983) Tera MTA (1995)
- 18. GPU versus CPU GPUs represent a different form of vector programming (“vector cores”) ▪ 32-wide vector of threads (“warp”) Sufficiently optimized CPU code can be on par with GPU perf (Tflop range with AVX2/512, exploit multi- level caches, deep pipelines, prefetch, ...) Vector programming: easier with GPUs than CPUs Sweetspot is different from GPU codes
- 19. Parallelization + vectorization Serial nature of commonly used CPU programming languages sometimes hides opportunities Auto-vectorizing/parallelizing compilers + DSLs can’t yet compete with expert hand-rolled ▪ DSLs like Halide (Ragan-Kelley et al. 2013) show promise but need a few more generations Sprinkle in (OpenMP) doesn’t cut it
- 20. Who wins CPU GPU flops ✔ (vectorize: AVX2/512 gives Tflop range) ✔ Tesla K40: 2880 fp32 ALU pipelines main memory b/w ✖ (Xeon Phi improves) ✔ latency ✔ (high clock, reordering; caches are large and work if you obey them) ✖ (threads slow, non-smem caches irrelevant, CPU -> GPU control overhead) boundary effects, small/irregular sizes ✔✖ (branches easy, vectorization hard) ✖ (warp divergence, load imbalance) parallel programming model ✖ (vectorization hard, perf black box) ✔✖ (CUDA is very different, domain knowledge)
- 21. Tool + problem = solution?
- 22. Dive into 2D Convolutional Nets Somewhat computationally expensive O(b × f × f’ × n2 × k2) 1st layer AlexNet: ▪ 13.493 Gflop (1 flop here = fp32 multiply-add) ▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32) ▪ Perfect caching + reuse, 175 flop/byte in ▪ No caching + reuse, 0.125 flop/byte in
- 23. The problem Programmable caches (shared memory, registers, ...) not large enough for perfect reuse Space of all possible square 2D convolution problems is 5/6-dimensional Parameter Size minibatch size (b) 128 input feature maps (f) 3 output feature maps (f’) 96 input feature size (n x n) 224 convolution kernel size (k x k) 11 convolution kernel stride (SxS) (optional) 4
- 24. Converting Space of all possible matrix multiplications = 3 dimensional (ANxMBMxP = CNxP) NVIDIA, Intel, others have put lots of effort into optimizing many parts of this space ▪ Rephrase convolution as a matrix multiplication! ▪ NVIDIA’s cuDNN
- 25. But: Sgemm originally optimized for large problems 13x13 * 3x3 is a small convolution. Unrolling it 192 times it might be enough to feed GPU Large convolutions are intractable? Small feature maps/ convolutions = boundary effects bad for GPUs
- 26. Facebook AI Research work 2D convolution via FFT Fast convolutional nets with fbfft: A GPU Performance Evaluation (Vasilache, Johnson et al., 2015 ICLR conference track oral) Convolution => pointwise × in Fourier basis Choice of basis is wide open! 2i is great perf O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n) ▪ >= 5x5 kernels, faster than cuDNN
- 27. fbfft cuFFT optimized for large FFT sizes fbfft: smaller data, fit in registers, focus on warp
- 28. Data layout Different problem sizes => different data layout ▪ cudaconv: DHWB (optimal for large b) ▪ deeper layers: HWBD/BHWD (many feature maps) ▪ b=1 faster convergence? ▪ b=128 better compute utilization Smaller problems, exploit different layout/batching ▪ fbcunn 1D convolution
- 29. Latency hiding: what holds you back? ▪ Compute bound? (math) ▪ Memory b/w bound? (streaming) ▪ Memory latency bound? (sparse) Almost all “deep learning” algorithms are b/w bound on GPU. Low math intensity! cuBLAS: Sgemm b/w bound. Dgemm compute bound
- 30. Kernel fusion: CPU vs GPU Reduces memory b/w pressure Exploits cache locality and register reuse CPU: fusion not necessary Kernel tiling + interleaving works due to caches GPU: fusion necessary Tiling + interleaving doesn’t work: smem not persistent, caches too small/irrelevant
- 31. Kernel fusion CUDA kernel = hard optimization boundary on GPU Loop interchange, lifting, better fusion on CPU CUDA: parallelization layer not visible to optimizer. Auto-tuning desired. HW specific non-linear tradeoffs Scripting languages are further barrier to fusion on both CPU and GPU (Torch)
- 32. Kernel fusion Torch: transposition is common operation ▪ size (80, 40) stride (40, 1) => size (40, 80) stride (1, 40) ▪ Old approach: transpose in memory, perform work, copy back ▪ New approach: rewrite kernel to handle transpositions. Optimize if non-transposed Runtime fusion (CUDA 7.0, Theano)
- 33. Exploiting parallelism
- 34. end

No public clipboards found for this slide

Be the first to comment