Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learni...
Copyright © 2016 ARM Ltd 2
• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolut...
Copyright © 2016 ARM Ltd 3
ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company...
Copyright © 2016 ARM Ltd 4
Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of c...
Copyright © 2016 ARM Ltd 5
Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a La...
Copyright © 2016 ARM Ltd 6
From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we ...
Copyright © 2016 ARM Ltd 7
SGEMM-based Convolution
C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
Copyright © 2016 ARM Ltd 8
Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in ...
Copyright © 2016 ARM Ltd 9
SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not ca...
Copyright © 2016 ARM Ltd 10
Transpose Matrix B
Matrix B Transposition
/* First accumulation */
ai = load(addr_a);
bi = loa...
Copyright © 2016 ARM Ltd 11
Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matri...
Copyright © 2016 ARM Ltd 12
Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
/* First accumulation */
ai = load(...
Copyright © 2016 ARM Ltd 13
Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in o...
Copyright © 2016 ARM Ltd 14
Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8 Matrix AI – 2x3...
Copyright © 2016 ARM Ltd 15
FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplica...
Copyright © 2016 ARM Ltd 16
From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power o...
Copyright © 2016 ARM Ltd 17
FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix...
Copyright © 2016 ARM Ltd 18
SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large...
Copyright © 2016 ARM Ltd 19
SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inf...
Copyright © 2016 ARM Ltd 20
Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibi...
Copyright © 2016 ARM Ltd 21
Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2...
Copyright © 2016 ARM Ltd 22
Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm...
Copyright © 2016 ARM Ltd 23
Question Time
Question Time
Copyright © 2016 ARM Ltd 24
Thank you!
The trademarks featured in this presentation are registered and/or unregistered tra...
Upcoming SlideShare
Loading in …5
×

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

1,577 views

Published on

For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-iodice

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Gian Marco Iodice, Software Engineer at ARM, presents the "Using SGEMM and FFTs to Accelerate Deep Learning" tutorial at the May 2016 Embedded Vision Summit.

Matrix Multiplication and the Fast Fourier Transform are numerical foundation stones for a wide range of scientific algorithms. With the emergence of deep learning, they are becoming even more important, particularly as use cases extend into mobile and embedded devices. In this presentation, lodice discusses and analyzes how these two key, computationally-intensive algorithms can be used to gain significant performance improvements for convolutional neural network (CNN) implementations.

After a brief introduction to the nature of CNN computations, Iodice explores the use of GEMM (General Matrix Multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows examples of OpenCL implementations of these functions and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types.

Published in: Technology
  • Be the first to comment

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

  1. 1. Copyright © 2016 ARM Ltd 1 Gian Marco Iodice, SW Engineer – ARM May 3, 2016 Using SGEMM and FFTs to Accelerate Deep Learning
  2. 2. Copyright © 2016 ARM Ltd 2 • About ARM • Convolutional Neural Networks (CNN) • Architecture and building blocks • Convolutional Layer • SGEMM-based convolution • FFT-based convolution • SGEMM vs FFT • Limited Numerical Precision for CNN • Lesson Learned Contents
  3. 3. Copyright © 2016 ARM Ltd 3 ARM Ltd • ARM Holdings plc is a British multinational semiconductor and software design company (www.arm.com) • Headquarters in Cambridge, England
  4. 4. Copyright © 2016 ARM Ltd 4 Architecture and Building Blocks of CNN • Convolutional layer (core block of CNN) • Number of convolution kernels (filters bank) • Filter shape (width, height and depth) • Pooling layer (typical values 2x2) • Non-linear gating (ReLu) • Classifier: Fully Connected Neural Network Learned Non-Linear Trainable Feature
  5. 5. Copyright © 2016 ARM Ltd 5 Why Are We Going to Study Convolutional Layer? *Learning Semantic Image Representations at a Large Scale, Yangqing Jia conv1 16.9% relu 0.7% pool 1.0% conv2 21.9% pool2 0.7% norm2 0.5% conv3 17.8% relu3 0.2% conv4 17.8% conv5 17.7% fc6 1.8% fc7 0.8% Compute Load for AlexNet Inference*
  6. 6. Copyright © 2016 ARM Ltd 6 From 2D Convolution to 3D Batched Convolution • Most of the time for the convolution layers we have: • Multiple input images • Multiple convolution kernels (various dimensions and shapes) • Multiple channels per image/kernel (not necessarily 3!) Output images Input image Kernels Why don’t we use sliding window approach?
  7. 7. Copyright © 2016 ARM Ltd 7 SGEMM-based Convolution C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
  8. 8. Copyright © 2016 ARM Ltd 8 Im2col • im2col stores in each row the necessary pixels for each kernel application • Costs in terms of memory requirements! • pixels duplication • col2im restores the output image structure Input image Output image A B C Output images stride X B C
  9. 9. Copyright © 2016 ARM Ltd 9 SGEMM: Naïve implementation • Each thread computes a single element of the output matrix Not cache friendly! /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c0 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1 * N); c0 += ai * bi; … store(c0, addr_c); Matrix A Matrix BMatrix C
  10. 10. Copyright © 2016 ARM Ltd 10 Transpose Matrix B Matrix B Transposition /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c00 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1); c00 += ai * bi; ... store(c0, addr_c); Matrix A Matrix BMatrix C 1.1x… Speed-up achievable?
  11. 11. Copyright © 2016 ARM Ltd 11 Transpose Matrix B in Chunk of 1x4 (I) • Each thread computes 1x4 elements of the output matrix Not cache friendly! float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 1 * N); out += (float4)ai * bi; ... store4(out, addr_c); Matrix A Matrix BMatrix C
  12. 12. Copyright © 2016 ARM Ltd 12 Transpose Matrix B in Chunk of 1x4 (II) float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 4); out += (float4)ai * bi; ... store4(out, addr_c); Matrix B Matrix BT1x4 2 2.5 3 3.5 4 512 1024 2048 4096 SGEMM Speed-Up Speed-up achievable? 3.5x N: A=NxN, B=NxN, C=NxN
  13. 13. Copyright © 2016 ARM Ltd 13 Reshaping Matrix A (I) • We can do more…we can compute a block of 4x4 elements per thread in order to re-use the values loaded from Matrix A Matrix BT1x4 Matrix AMatrix C
  14. 14. Copyright © 2016 ARM Ltd 14 Reshaping Matrix A (II) Chunk 0 Chunk 1 Chunk = Block of 4 rows Matrix A – 8x8 Matrix AI – 2x32 6.5 7 7.5 8 8.5 512 1024 2048 4096 N: A=NxN, B=NxN, C=NxN SGEMM Speed-Up Speed-up achievable? > 8.0x
  15. 15. Copyright © 2016 ARM Ltd 15 FFT-based Convolution • Convolution in the spatial domain is equivalent to a scalar multiplication in frequency domain
  16. 16. Copyright © 2016 ARM Ltd 16 From Radix-2 to Mixed-Radix • The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…) • Any factorization would generally be possible for N (N = N1 x N2 x N3 x…) • Mixed-Radix is the generalization of the basic radix-2 FFT Over 1.5x better performance than Radix-2
  17. 17. Copyright © 2016 ARM Ltd 17 FFT Implementation • Recursive FFT in-place computation* • Each thread computes a single radix-N (floating point computation) • Block-wise 2x2 in-place transposition • ~2x times better performance than 2x2 out-of-place transposition • Out-of-place batched convolution • High memory requirements as we have to keep the frequency representation for: 1. Input image 2. Convolution kernels 3. Result of convolutions * https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by- means-of-opencl-part-2
  18. 18. Copyright © 2016 ARM Ltd 18 SGEMM vs FFT (I) • High memory requirements due to im2col: • stride < kernel dimension • large convolution kernel • large input image SGEMM-based convolution • No efficient way to handle stride != 1 • High memory requirements for batched convolutions • It could require considerable effort to optimize well SGEMM-based convolution FFT-based convolution
  19. 19. Copyright © 2016 ARM Ltd 19 SGEMM vs FFT (II) Case 1: 1 input image, 64/128/256 convolution kernels • Study limited on inference problem • Stride x = 1 and stride y = 1 • N. of channels = 1 • Pre-computed FFT for convolution kernels Case 2: 64 input images, 32 convolution kernels ImagesizeImagesize Kernel size / Number of convolutions Kernel size and using stride x = 2? SGEMM FFT
  20. 20. Copyright © 2016 ARM Ltd 20 Limited Numerical Precision for CNN (I) • Some papers ([1], [2]) have demonstrated the feasibility in using limited numerical precision for CNN • This opens an interesting computational scenario if, for instance, HW has accelerators for 16 bit half-precision: • Performance boosting • Reduced memory traffic to/from external memory • Possible to dispatch fewer threads • Energy saving • Essentially due to the reduced memory traffic to/from the external memory [1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David [2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
  21. 21. Copyright © 2016 ARM Ltd 21 Limited Numerical Precision for CNN (II) 1 1.5 2 2.5 512 1024 2048 4096 1 1.5 2 2.5 512 1024 2048 4096 It is possible to dispatch fewer threads i.e. 8x4 elements per thread We can not dispatch fewer threads Each thread computes a single radix-N SGEMM Speed-Up FFT Speed-Up N: A=NxN, B=NxN, C=NxN N > 2.0x > 1.5x
  22. 22. Copyright © 2016 ARM Ltd 22 Lessons Learned 1. Cache-efficient data layout has huge impact on performance of our algorithm also for GPU computing 2. Simple changes in data layout can bring to: • dispatch fewer threads • exploit better vector instructions 3. Limited Numerical Precision plays a crucial role IF HW accelerated 4. Convolutional calculation is an embarrassingly parallel task which can be easily and efficiently accelerated on mobile GPU by means of OpenCL
  23. 23. Copyright © 2016 ARM Ltd 23 Question Time Question Time
  24. 24. Copyright © 2016 ARM Ltd 24 Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited Thank you!

×