Using Raspberry Pi GPU
for Deep Neural Network
2017/9/3
Noriyuki OHKAWA
• Advantage
• Data transfer cost
DNN prediction on end devices
DNN prediction on end devices
• Disadvantage
• Poor computing resources
How to implement on end device ?
• Reduce Parameter (include approximation)
• Binary Net
• Parameter Quantization
• Learning Sparse Matrix
• Software
• Fast Convolution Algorithm / Fusion
• Hardware
• FPGA / ASIC
In this presentation
• Software Approach on Raspberry Pi Series
• Use Raspberry Pi GPU efficiently
• There is no..
• Hardware Approach
• Approximation and Re-training
Raspberry Pi GPU
Raspberry Pi CPU/GPU Spec.
Pi 3 Pi Zero/W
CPU ARM Cortex-A53
Quad Core 1.2Ghz
ARM1176JZF-S
Single Core 1Ghz
GPU Broadcom VideoCore IV
400MHz
Broadcom VideoCore IV
250MHz
Single Precision flops (theoretical)
Pi 3 Pi Zero/W
CPU 38.4 Gflops
(1.1 Gflops/$)
1 Gflops
(0.1-0.2 Gflops/$)
GPU 38.4 Gflops
(1.1 Gflops/$)
24 Gflops
(2.4-4.8 Gflops/$)
Architecture Overview
Architecture Overview
QPU / Quad Processing Unit
QPU / Quad Processing Unit
• general purpose register A/B x32 (= 64 register)
• accumulation register r0,r1,r2,r3 (= 4 register)
• see. Architecture Reference Guide
Architecture Overview
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
Architecture Overview
Uniforms Cache
Uniforms Cache
Uniforms Cache
Architecture Overview
VPM / Vertex Pipe Memory
Efficient Data Transfer
Read GPU Memory from CPU
SLOW
SHOULD compute every layer by GPU
Conv2D in GPU
ReLU in CPU
Conv2D in GPU
Conv2D in GPU
ReLU in GPU
Conv2D in GPU
GPU Mem Read
Architecture Overview
EXCLUSIVE
Architecture Overview
SGEMM
References
• Raspberry PiでGPGPU
• Raspberry PiのGPUで行列乗算(その1)
• Raspberry PiのGPUで行列乗算(その2)
Inner/Direct Product
Inner/Direct Product
Direct Product
Direct Product
Code Example
• PyVideoCore
rotate(broadcast, r1, 0)
fmul(r3, r4, r5)
fadd(ra0, ra0, r3)
rotate(broadcast, r1, 1)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
rotate(broadcast, r1, 2)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
r5 has
broadcast
result
Code Example
• PyVideoCore
rotate(broadcast, r1, 0)
fmul(r3, r4, r5)
fadd(ra0, ra0, r3)
rotate(broadcast, r1, 1)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
rotate(broadcast, r1, 2)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
Code Example
• PyVideoCore
rotate(broadcast, r1, 0)
fmul(r3, r4, r5)
rotate(broadcast, r1, 1)
fadd(ra0, ra0, r3).fmul(r3, r4, r5)
rotate(broadcast, r1, 2)
fadd(ra1, ra1, r3).fmul(r3, r4, r5)
rotate(broadcast, r1, 3)
fadd(ra2, ra2, r3).fmul(r3, r4, r5)
rotate(broadcast, r1, 4)
Practical Limit for DNN
Pi 3 Pi Zero/W
CPU ? Gflops ? Gflops
GPU 38.4→19.2 Gflops 24→12 Gflops
Convolution(2D)
Rejected Algorithms
• im2col
• TMU enable equivalent load
• Winograd
• increase data transfer > reduce operation
• Not enough register
• Direct (NCHW → NHWC, im2col equiv. TMU load)
• NHWC has bad data locality for next layer
Direct (NCHW → NHWC)
Accepted Algorithms
• Direct (NCHW → NCHW)
• DMA store with Transpose(C,HW)
• for small image (HxW < 2048)
• Direct (NCHW → NHCW)
• DMA store with Transpose(C,W)
Specialize for Kernel Size
• 3x3 with stride = 1
• Best performance
• Pi3: 18.5 Gflops
• 48% of theoretical limit (96% of practical limit)
• 1x1
• 1xK
• Kx1
• KxK
Specialize for Output Shape
• DMA Transfer Block Size
• HxWxC = 2x16x16
• HxWxC = 4x8x16
• for small image
• HxWxC = 2x14x16+2x16x16
• OverLap-Add Method
• 3x3 only
2x16x16 = 4x8x16 = use 32 general purpose register for accumulation
Other 32 registers have convolution parameters.
Combination
• Using 22 specialized implementations
• NCHW→NCHW / NCHW→NHCW
• 2x14x16+
2x16x16
2x16x16 4x8x16
3x3 Use Use Use
1x1 - Use Use
1xK - Use Use
Kx1 - Use Use
KxK - Use Use
Optimization
Convert Time Optimization
• Fuse Affine Transform
• ex) Convolution + BN + Scale → Convolution
• Inline (Leaky)ReLU
Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Decrement r2
Jump if non-0
Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Add Op. Only
Mul Op. Only
Mul Op. Only
Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -14, set_flags=True).rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3, set_flags=False).fmul(r3, r4, r5)
iadd(r2, r2, -1, cond='zs').rotate(broadcast, r1, -15)
jzc(L.loop)
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -14, set_flags=True).rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3, set_flags=False).fmul(r3, r4, r5)
iadd(r2, r2, -1, cond='zs').rotate(broadcast, r1, -15)
jzc(L.loop)
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Can t use 2 different Imm.Can t use 2 different Imm.
Instruction Golf
iadd(null, element_number, -13, set_flags=True).rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3, set_flags=False).fmul(r3, r4, r5)
isub(r2, r2, -14, cond='zs', set_flags=False).rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3, set_flags=False).fmul(r3, r4, r5)
iadd(r2, r2, -15, cond='zs').rotate(broadcast, r1, -15)
jzc(L.loop)
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Instruction Golf
iadd(null, element_number, -13, set_flags=True).rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3, set_flags=False).fmul(r3, r4, r5)
isub(r2, r2, -14, cond='zs', set_flags=False).rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3, set_flags=False).fmul(r3, r4, r5)
iadd(r2, r2, -15, cond='zs').rotate(broadcast, r1, -15)
jzc(L.loop)
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
rotate(broadcast, r1, -14)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
-1
-(-14)
-15
Experiments
Performance
• GoogLeNet
• Pi3: 320ms
• Pi0: 670ms
• ResNet50
• Pi3: 820ms
• YOLO tiny
• Pi3: 580ms
1 99.58% cicada, cicala
2 0.19% cockroach, roach
3 0.06% cricket
4 0.05% grasshopper, hopper
5 0.04% leafhopper
6 0.02% lacewing, lacewing fly
7 0.01% barn spider, Araneus cavaticus
8 0.00% ground beetle, carabid beetle
9 0.00% isopod
10 0.00% mantis, mantid
Future Work
Remove Wasteful Computation
• ex) ResNet
Remove Wasteful Computation
• ex) ResNet
Remove Wasteful Computation
• ex) ResNet
Remove Wasteful Computation
• ex) ResNet
Remove Wasteful Computation
• ex) ResNet
1/4
1/4
1/4
Improve
Data Locality
Improve
Data Locality
Remove Wasteful Computation
• ex) ResNet50
• Pi3: 820ms → 720ms (by hand-made optimization)
Thank You !
Appendix 1: 3x3 Convolution
4x8x16 Case
convolution area
required area
4x8x16 Case TMU load x 4
(16x4 < 6x10)
2x16x16 Case
2x16x16 Case TMU load x 5
(16x4 < 4x18 < 16x5)
2x16x16 Case TMU load x 5
(16x4 < 4x18 < 16x5)
Over
TMU load request
queue size = 4
2x14x16+2x16x16 Case
2x14x16+2x16x16 CaseTMU load x 4
(16x4 = 4x16)
2x14x16+2x16x16 Case OverLap
2x14x16+2x16x16 Case OverLap
Appendix 2: Pooling
Max / Average Pooling
• Convolution like TMU loading
• Specialized for global pooling

Using Raspberry Pi GPU for DNN