"Efficient Convolutional Neural Network Inference on Mobile GPUs," a Presentation from Imagination Technologies

Copyright © 2016 Imagination Technologies 1
Efficient Convolutional Neural Network
Inference on Mobile GPUs
Paul Brasnett
May 3, 2016

• About Imagination Technologies
• PowerVR GPUs
• Case study: Implementing Convolutions
• Performance Analysis
• Conclusions
• Resources
Overview

• Imagination Technologies
is a leading IP supplier for
multimedia, processors and
communications
• More than 8bn units
containing Imagination IP
shipped
About Imagination Technologies
SoCfabric
PowerVR
Graphics & GPU Compute
Processors
Ensigma
Communications
Processors
PowerVR
Vision
Processors
MIPS
Processors
PowerVR
Video
Processors

What is a Mobile GPU?
Mobile GPU
Optimised for High
Performance at
Low Power

What is a Mobile GPU?
Mobile Devices
Automotive
Consumer Multimedia
Wearables
Internet of Things
Augmented Reality
Mobile GPU
Optimised for High
Performance at
Low Power

Why Mobile GPUs for Vision Processing?
CPUs can generate large amounts of heat• CPUs can deliver high peak/burst
performance
• But generate large amounts of heat
• PowerVR Mobile GPUs provide
• Lowest power FP16 & int pipelines
• Local memory for highly efficient data
access for compute operations
• Power-saving features such as gating
of non-compute parts of GPU for
efficient compute operation

Why Mobile GPUs for Vision Processing?
Provence
(raytracing)
Particle
Simulation –
32k
Particle
Simulation –
4k
Julia Set
Ambient
Occlusion
Denoise Gaussian Blur
CPU 100.00% 100% 100% 100% 100% 100% 100%
PowerVR Series6 265% 407% 517% 963% 1126% 482% 383%
0%
100%
200%
300%
400%
500%
600%
Performancerelative
toCPU

Moving the CNN Workload to the GPU
PowerVR GPU — Graphics and computeCPU
Large Cache
Unified System Memory
CPU1
CPU0
THREADS
Few
Multiprocessor (Unified Shading Cluster)
Multiprocessor (Unified Shading Cluster)
Coarse Grain Scheduler
L2
System Level CacheCache Unit
Residency
Slots
Common
StoreCompute Store
Texture
Processing Unit
Residency
Slots
Common
StoreCompute StoreScheduler
System Memory Interface
enqueue
Compute
Kernel
Host
Interface
Scheduler
System Memory Interface

Evolution of Mobile GPU
PowerVR
Series 6 GPU
PowerVR
Series 7 GPU
PowerVR
Series 8 GPU
…

Evolution of Mobile GPU
OpenCL 1.2
OpenCV
OpenVX
Vulkan
OpenCL 2.0
New APIs

• Mobile GPU increasingly dominating compute performance in SoCs
GPU Dominates Compute in Modern SoCs
CPU
GPU
Illustrative diagram only, to show relative CPU/GPU size

• State-of-the-art performance
• Rapid development cycles
• Range of vision tasks
• Classification
• Localisation
• Other applications…
Why CNNs?
Camera Localisation
PoseNet: A Convolutional Network for Real-Time 6-DOF Camera
Relocalization, Kendall, A., Grimes, M., Cipolla, R., ICCV 2015

What is a CNN?
Convolution Activation Normalization Pooling Fully Connected
ConvolutionImage Activation Pooling
Fully Connected
CNN Architecture Building Blocks
CNN Example Network
Normalization
Soft Max
Convolution Activation Pooling Normalization
Convolution Activation Pooling Soft Max

• Training — Offline
CNN Object Classification
Architecture
Data
CNN Library Compute + Time Model Coefficients

• Inference — Online
Architecture
Data
Architecture
Model Coefficients

• Inference — Online
Architecture
Data
Architecture
Model Coefficients
Image
CNN Library Compute Classification
Mobile GPU

Where is the Cost in CNN Inference?
Flops by layer-type (AlexNet)
Convolution
Normalisation
Pooling
Fully Connected

• Create as many work-items as is size of output matrix
• Each work-item will read it’s row and column and produce dot product
• Requires large number of accesses to memory
Matrix Multiply — Naïve
x =
A B C

• The OpenCL memory model
closely maps to GPU architecture
• Private Memory — Per work-item
• Local Memory
• Shared within a work-group
• Global Memory /Constant Memory
• Visible to all work-groups
• Host memory
• Typically share CPU/GPU on a
mobile SoC
OpenCL Memory Model

• Work-items load A data into private memory
Matrix Multiply — Tiling Approach
Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”
x =
A B C

• Work-items load A data into private memory
• Work-groups load B data into local memory
• Each work item will read from local memory and produce a dot product
• Significantly reduces global memory accesses
x =
A B C
Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”

• Choose work-group size to fit the GPU, 32 work-items is typically a good
choice for PowerVR GPUs
• Read multiple items (e.g. 4 or 8) into private memory at a time to optimise
memory transfers
• Consider the use of half data type in place of float
• Most PowerVR platforms provide up to 2x the flops
• Define workgroup size at compile time
• __attribute__((reqd_work_group_size(SIZE, 1, 1)))
Matrix Multiply — OpenCL Tips

0.1
1
10
100
1000
Time(s)
Matrix Size
Naïve
Tiled matrix multiply

CNN Classification: AlexNet & GoogLeNet
60
5.5
Model Coefficients
(Millions)
AlexNet GoogLeNet
1.3
3.1
Operations
(Billions)
AlexNet GoogLeNet18.2
10.07
Top-5 Error Rate (%)
AlexNet GoogLeNet
 Bandwidth  Compute

• Time consumed by layer type
Performance Analysis — CNN Inference
GoogLeNet
Convolutions
Pooling
Normalisation
Fully Connected
Reference Time*: 1.36 Reference Time*: 1.00
AlexNet
Convolutions
Pooling
Normalisation
Fully Connected

Performance Analysis — GPU v CPU*
* CPU results based on Caffe (with ATLAS)
0
2
4
6
8
10
12
14RelativeFPSPerformance
(Higherisbetter)
AlexNet
GPU - PowerVR 2 Cluster
GPU (480MHz)
CPU - ARM A15 (1.6GHz)

Efficiency Analysis — GPU v CPU
0
0.5
1
1.5
2
2.5
3
3.5
RelativeEfficiency(Higheris
better)
AlexNet
GPU - PowerVR 2
Cluster GPU (480MHz)
CPU - ARM A15
(1.6GHz)

• Mobile GPUs are widely available in a range of SoCs across numerous
markets today
• Compared to mobile CPUs, PowerVR Mobile GPUs offer
• upto 3x higher efficiency and
• upto 12x higher performance deployment for CNNs
• Newer CNN architectures with smaller fully connected layers help to
make more efficient use of compute resources
• PowerVR GPUs scale to allow for higher levels of performance & lower
power for current and future generations of vision enabled products
• COME & SEE THE DEMO DURING THE NEXT BREAK
Conclusions

• PowerVR GPU Compute
• https://imgtec.com/tools/powervr-gpu-compute/
• Guide to writing OpenCL
• http://blog.imgtec.com/powervr/a-quick-guide-to-writing-opencl-kernels-for-rogue
• PowerVR Imaging Framework
• http://blog.imgtec.com/powervr/powervr-imaging-framework-sdk
• PowerVR CNN Demo
• See our stand
• OpenCL Tutorial
• https://handsonopencl.github.io/
Resources

"Efficient Convolutional Neural Network Inference on Mobile GPUs," a Presentation from Imagination Technologies

More Related Content

More from Edge AI and Vision Alliance

Recently uploaded

"Efficient Convolutional Neural Network Inference on Mobile GPUs," a Presentation from Imagination Technologies