Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High-Performance GPU Programming for Deep Learning

This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.

  • Be the first to comment

High-Performance GPU Programming for Deep Learning

  1. 1. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™
  2. 2. Proprietary and confidential. Do not distribute.ner va na High-Performance GPU kernels for deep learning 2 • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd
  3. 3. Proprietary and confidential. Do not distribute.ner va na GEMM: Basics 3 C = AB
  4. 4. Proprietary and confidential. Do not distribute.ner va na GEMM: Memory Load 4 Outer product contiguous Outer product strided threads memory load single tile batched GEMM
  5. 5. Proprietary and confidential. Do not distribute.ner va na Batched GEMM tiles 32 x 32 GEMM tile 32 x 64GEMM tile 32 x 32 GEMM: Tile sizes 5 threads shared memory load
  6. 6. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - NN 6 Nx3072x3072 NN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS
  7. 7. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - TN 7 GFLOPS Nx3072x3072 TN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N)
  8. 8. Proprietary and confidential. Do not distribute.ner va na Direct convolution is still relevant 8 • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer
  9. 9. Proprietary and confidential. Do not distribute.ner va na Direct convolution: implementation details 9 • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations
  10. 10. Proprietary and confidential. Do not distribute.ner va na Winograd: input transform 10 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros
  11. 11. Proprietary and confidential. Do not distribute.ner va na Winograd: filter transform 11 • Filter transform • Same as input but with different coefficients • Transform each feature map independently
  12. 12. Proprietary and confidential. Do not distribute.ner va na Winograd: batched GEMM 12
  13. 13. Proprietary and confidential. Do not distribute.ner va na Winograd: output transform 13 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile
  14. 14. Proprietary and confidential. Do not distribute.ner va na 14 Performance: VGG VGG fp32 - Totals by operation 0 0.5 1 1.5 2 64 32 16 8 4 2 1 Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update AlgorithmicSpeedup Batch Size
  15. 15. Proprietary and confidential. Do not distribute.ner va na Performance: Alexnet convolutional layers 15 Alexnet Totals 0 0.5 1 1.5 2 128 64 32 16 8 4 Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32 Batch Size AlgorithmicSpeedup
  16. 16. Proprietary and confidential. Do not distribute.ner va na Compounding 16 • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean Compounding inside of GEMM and conv for free:
  17. 17. Proprietary and confidential. Do not distribute.ner va na Summary 17 • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor

×