SlideShare a Scribd company logo
1 of 24
Download to read offline
Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learning
Copyright © 2016 ARM Ltd 2
• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolutional Layer
• SGEMM-based convolution
• FFT-based convolution
• SGEMM vs FFT
• Limited Numerical Precision for CNN
• Lesson Learned
Contents
Copyright © 2016 ARM Ltd 3
ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company (www.arm.com)
• Headquarters in Cambridge, England
Copyright © 2016 ARM Ltd 4
Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of convolution kernels (filters bank)
• Filter shape (width, height and depth)
• Pooling layer (typical values 2x2)
• Non-linear gating (ReLu)
• Classifier: Fully Connected Neural Network
Learned
Non-Linear Trainable Feature
Copyright © 2016 ARM Ltd 5
Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a Large Scale, Yangqing Jia
conv1
16.9%
relu
0.7%
pool
1.0%
conv2
21.9%
pool2
0.7%
norm2
0.5%
conv3
17.8%
relu3
0.2%
conv4
17.8%
conv5
17.7%
fc6
1.8%
fc7
0.8%
Compute Load for AlexNet Inference*
Copyright © 2016 ARM Ltd 6
From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we have:
• Multiple input images
• Multiple convolution kernels (various dimensions and shapes)
• Multiple channels per image/kernel (not necessarily 3!)
Output images
Input image
Kernels
Why don’t we use sliding window approach?
Copyright © 2016 ARM Ltd 7
SGEMM-based Convolution
C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
Copyright © 2016 ARM Ltd 8
Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in terms of memory
requirements!
• pixels duplication
• col2im restores the output
image structure
Input image Output image
A B C
Output images
stride X
B C
Copyright © 2016 ARM Ltd 9
SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not cache
friendly!
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c0 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1 * N);
c0 += ai * bi;
…
store(c0, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 10
Transpose Matrix B
Matrix B Transposition
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c00 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1);
c00 += ai * bi;
...
store(c0, addr_c);
Matrix A Matrix BMatrix C
1.1x…
Speed-up
achievable?
Copyright © 2016 ARM Ltd 11
Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matrix
Not cache
friendly!
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 1 * N);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 12
Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 4);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix B
Matrix BT1x4
2
2.5
3
3.5
4
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
3.5x
N: A=NxN, B=NxN, C=NxN
Copyright © 2016 ARM Ltd 13
Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in order to re-use the values loaded from Matrix A
Matrix BT1x4
Matrix AMatrix C
Copyright © 2016 ARM Ltd 14
Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8 Matrix AI – 2x32
6.5
7
7.5
8
8.5
512 1024 2048 4096
N: A=NxN, B=NxN, C=NxN
SGEMM Speed-Up
Speed-up achievable?
> 8.0x
Copyright © 2016 ARM Ltd 15
FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplication in
frequency domain
Copyright © 2016 ARM Ltd 16
From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)
• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)
• Mixed-Radix is the generalization of the basic radix-2 FFT
Over 1.5x better performance than Radix-2
Copyright © 2016 ARM Ltd 17
FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix-N (floating point computation)
• Block-wise 2x2 in-place transposition
• ~2x times better performance than 2x2 out-of-place transposition
• Out-of-place batched convolution
• High memory requirements as we have to keep the frequency representation for:
1. Input image
2. Convolution kernels
3. Result of convolutions
* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-
means-of-opencl-part-2
Copyright © 2016 ARM Ltd 18
SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large convolution kernel
• large input image
SGEMM-based convolution
• No efficient way to handle stride != 1
• High memory requirements for batched
convolutions
• It could require considerable effort to
optimize well
SGEMM-based convolution FFT-based convolution
Copyright © 2016 ARM Ltd 19
SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inference problem
• Stride x = 1 and stride y = 1
• N. of channels = 1
• Pre-computed FFT for convolution kernels
Case 2: 64 input images, 32 convolution kernels
ImagesizeImagesize
Kernel size / Number of convolutions
Kernel size
and using stride x = 2?
SGEMM FFT
Copyright © 2016 ARM Ltd 20
Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibility in using limited
numerical precision for CNN
• This opens an interesting computational scenario if, for instance, HW has
accelerators for 16 bit half-precision:
• Performance boosting
• Reduced memory traffic to/from external memory
• Possible to dispatch fewer threads
• Energy saving
• Essentially due to the reduced memory traffic to/from the external memory
[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David
[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
Copyright © 2016 ARM Ltd 21
Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2048 4096
It is possible to dispatch
fewer threads
i.e. 8x4 elements per thread
We can not dispatch fewer
threads
Each thread computes a
single radix-N
SGEMM Speed-Up
FFT Speed-Up
N: A=NxN, B=NxN, C=NxN
N
> 2.0x
> 1.5x
Copyright © 2016 ARM Ltd 22
Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm
also for GPU computing
2. Simple changes in data layout can bring to:
• dispatch fewer threads
• exploit better vector instructions
3. Limited Numerical Precision plays a crucial role IF HW accelerated
4. Convolutional calculation is an embarrassingly parallel task which can be
easily and efficiently accelerated on mobile GPU by means of OpenCL
Copyright © 2016 ARM Ltd 23
Question Time
Question Time
Copyright © 2016 ARM Ltd 24
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or
elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
Thank you!

More Related Content

What's hot

Cs231n 2017 lecture9 CNN Architecture
Cs231n 2017 lecture9 CNN ArchitectureCs231n 2017 lecture9 CNN Architecture
Cs231n 2017 lecture9 CNN ArchitectureYanbin Kong
 
02-Istoria typografias
02-Istoria typografias02-Istoria typografias
02-Istoria typografiaseretrianews
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolutionRonak Mehta
 
Introduction to Capsule Networks
Introduction to Capsule NetworksIntroduction to Capsule Networks
Introduction to Capsule NetworksChia-Ching Lin
 
Avvocati: le sanzioni e il procedimento disciplinare
Avvocati: le sanzioni e il procedimento disciplinareAvvocati: le sanzioni e il procedimento disciplinare
Avvocati: le sanzioni e il procedimento disciplinareRenato Savoia
 
[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networks[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networksTaegyun Jeon
 
Social Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionSocial Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionYiannis Kompatsiaris
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaPreferred Networks
 

What's hot (9)

Cs231n 2017 lecture9 CNN Architecture
Cs231n 2017 lecture9 CNN ArchitectureCs231n 2017 lecture9 CNN Architecture
Cs231n 2017 lecture9 CNN Architecture
 
02-Istoria typografias
02-Istoria typografias02-Istoria typografias
02-Istoria typografias
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolution
 
Introduction to Capsule Networks
Introduction to Capsule NetworksIntroduction to Capsule Networks
Introduction to Capsule Networks
 
Avvocati: le sanzioni e il procedimento disciplinare
Avvocati: le sanzioni e il procedimento disciplinareAvvocati: le sanzioni e il procedimento disciplinare
Avvocati: le sanzioni e il procedimento disciplinare
 
[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networks[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networks
 
Social Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionSocial Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event Detection
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
 
Yolov3
Yolov3Yolov3
Yolov3
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAGanesan Narayanasamy
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsSepidehShirkhanzadeh
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegKieran Kunhya
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCdterei
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...Edge AI and Vision Alliance
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM (20)

26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware Accelerators
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

  • 1. Copyright © 2016 ARM Ltd 1 Gian Marco Iodice, SW Engineer – ARM May 3, 2016 Using SGEMM and FFTs to Accelerate Deep Learning
  • 2. Copyright © 2016 ARM Ltd 2 • About ARM • Convolutional Neural Networks (CNN) • Architecture and building blocks • Convolutional Layer • SGEMM-based convolution • FFT-based convolution • SGEMM vs FFT • Limited Numerical Precision for CNN • Lesson Learned Contents
  • 3. Copyright © 2016 ARM Ltd 3 ARM Ltd • ARM Holdings plc is a British multinational semiconductor and software design company (www.arm.com) • Headquarters in Cambridge, England
  • 4. Copyright © 2016 ARM Ltd 4 Architecture and Building Blocks of CNN • Convolutional layer (core block of CNN) • Number of convolution kernels (filters bank) • Filter shape (width, height and depth) • Pooling layer (typical values 2x2) • Non-linear gating (ReLu) • Classifier: Fully Connected Neural Network Learned Non-Linear Trainable Feature
  • 5. Copyright © 2016 ARM Ltd 5 Why Are We Going to Study Convolutional Layer? *Learning Semantic Image Representations at a Large Scale, Yangqing Jia conv1 16.9% relu 0.7% pool 1.0% conv2 21.9% pool2 0.7% norm2 0.5% conv3 17.8% relu3 0.2% conv4 17.8% conv5 17.7% fc6 1.8% fc7 0.8% Compute Load for AlexNet Inference*
  • 6. Copyright © 2016 ARM Ltd 6 From 2D Convolution to 3D Batched Convolution • Most of the time for the convolution layers we have: • Multiple input images • Multiple convolution kernels (various dimensions and shapes) • Multiple channels per image/kernel (not necessarily 3!) Output images Input image Kernels Why don’t we use sliding window approach?
  • 7. Copyright © 2016 ARM Ltd 7 SGEMM-based Convolution C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
  • 8. Copyright © 2016 ARM Ltd 8 Im2col • im2col stores in each row the necessary pixels for each kernel application • Costs in terms of memory requirements! • pixels duplication • col2im restores the output image structure Input image Output image A B C Output images stride X B C
  • 9. Copyright © 2016 ARM Ltd 9 SGEMM: Naïve implementation • Each thread computes a single element of the output matrix Not cache friendly! /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c0 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1 * N); c0 += ai * bi; … store(c0, addr_c); Matrix A Matrix BMatrix C
  • 10. Copyright © 2016 ARM Ltd 10 Transpose Matrix B Matrix B Transposition /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c00 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1); c00 += ai * bi; ... store(c0, addr_c); Matrix A Matrix BMatrix C 1.1x… Speed-up achievable?
  • 11. Copyright © 2016 ARM Ltd 11 Transpose Matrix B in Chunk of 1x4 (I) • Each thread computes 1x4 elements of the output matrix Not cache friendly! float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 1 * N); out += (float4)ai * bi; ... store4(out, addr_c); Matrix A Matrix BMatrix C
  • 12. Copyright © 2016 ARM Ltd 12 Transpose Matrix B in Chunk of 1x4 (II) float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 4); out += (float4)ai * bi; ... store4(out, addr_c); Matrix B Matrix BT1x4 2 2.5 3 3.5 4 512 1024 2048 4096 SGEMM Speed-Up Speed-up achievable? 3.5x N: A=NxN, B=NxN, C=NxN
  • 13. Copyright © 2016 ARM Ltd 13 Reshaping Matrix A (I) • We can do more…we can compute a block of 4x4 elements per thread in order to re-use the values loaded from Matrix A Matrix BT1x4 Matrix AMatrix C
  • 14. Copyright © 2016 ARM Ltd 14 Reshaping Matrix A (II) Chunk 0 Chunk 1 Chunk = Block of 4 rows Matrix A – 8x8 Matrix AI – 2x32 6.5 7 7.5 8 8.5 512 1024 2048 4096 N: A=NxN, B=NxN, C=NxN SGEMM Speed-Up Speed-up achievable? > 8.0x
  • 15. Copyright © 2016 ARM Ltd 15 FFT-based Convolution • Convolution in the spatial domain is equivalent to a scalar multiplication in frequency domain
  • 16. Copyright © 2016 ARM Ltd 16 From Radix-2 to Mixed-Radix • The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…) • Any factorization would generally be possible for N (N = N1 x N2 x N3 x…) • Mixed-Radix is the generalization of the basic radix-2 FFT Over 1.5x better performance than Radix-2
  • 17. Copyright © 2016 ARM Ltd 17 FFT Implementation • Recursive FFT in-place computation* • Each thread computes a single radix-N (floating point computation) • Block-wise 2x2 in-place transposition • ~2x times better performance than 2x2 out-of-place transposition • Out-of-place batched convolution • High memory requirements as we have to keep the frequency representation for: 1. Input image 2. Convolution kernels 3. Result of convolutions * https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by- means-of-opencl-part-2
  • 18. Copyright © 2016 ARM Ltd 18 SGEMM vs FFT (I) • High memory requirements due to im2col: • stride < kernel dimension • large convolution kernel • large input image SGEMM-based convolution • No efficient way to handle stride != 1 • High memory requirements for batched convolutions • It could require considerable effort to optimize well SGEMM-based convolution FFT-based convolution
  • 19. Copyright © 2016 ARM Ltd 19 SGEMM vs FFT (II) Case 1: 1 input image, 64/128/256 convolution kernels • Study limited on inference problem • Stride x = 1 and stride y = 1 • N. of channels = 1 • Pre-computed FFT for convolution kernels Case 2: 64 input images, 32 convolution kernels ImagesizeImagesize Kernel size / Number of convolutions Kernel size and using stride x = 2? SGEMM FFT
  • 20. Copyright © 2016 ARM Ltd 20 Limited Numerical Precision for CNN (I) • Some papers ([1], [2]) have demonstrated the feasibility in using limited numerical precision for CNN • This opens an interesting computational scenario if, for instance, HW has accelerators for 16 bit half-precision: • Performance boosting • Reduced memory traffic to/from external memory • Possible to dispatch fewer threads • Energy saving • Essentially due to the reduced memory traffic to/from the external memory [1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David [2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
  • 21. Copyright © 2016 ARM Ltd 21 Limited Numerical Precision for CNN (II) 1 1.5 2 2.5 512 1024 2048 4096 1 1.5 2 2.5 512 1024 2048 4096 It is possible to dispatch fewer threads i.e. 8x4 elements per thread We can not dispatch fewer threads Each thread computes a single radix-N SGEMM Speed-Up FFT Speed-Up N: A=NxN, B=NxN, C=NxN N > 2.0x > 1.5x
  • 22. Copyright © 2016 ARM Ltd 22 Lessons Learned 1. Cache-efficient data layout has huge impact on performance of our algorithm also for GPU computing 2. Simple changes in data layout can bring to: • dispatch fewer threads • exploit better vector instructions 3. Limited Numerical Precision plays a crucial role IF HW accelerated 4. Convolutional calculation is an embarrassingly parallel task which can be easily and efficiently accelerated on mobile GPU by means of OpenCL
  • 23. Copyright © 2016 ARM Ltd 23 Question Time Question Time
  • 24. Copyright © 2016 ARM Ltd 24 Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited Thank you!