Using Raspberry Pi GPU for DNN

Using Raspberry Pi GPU
for Deep Neural Network
2017/9/3
Noriyuki OHKAWA

• Advantage
• Data transfer cost
DNN prediction on end devices

DNN prediction on end devices
• Disadvantage
• Poor computing resources

How to implement on end device ?
• Reduce Parameter (include approximation)
• Binary Net
• Parameter Quantization
• Learning Sparse Matrix
• Software
• Fast Convolution Algorithm / Fusion
• Hardware
• FPGA / ASIC

In this presentation
• Software Approach on Raspberry Pi Series
• Use Raspberry Pi GPU efficiently
• There is no..
• Hardware Approach
• Approximation and Re-training

Raspberry Pi CPU/GPU Spec.
Pi 3 Pi Zero/W
CPU ARM Cortex-A53
Quad Core 1.2Ghz
ARM1176JZF-S
Single Core 1Ghz
GPU Broadcom VideoCore IV
400MHz
Broadcom VideoCore IV
250MHz

Single Precision flops (theoretical)
Pi 3 Pi Zero/W
CPU 38.4 Gflops
(1.1 Gflops/$)
1 Gflops
(0.1-0.2 Gflops/$)
GPU 38.4 Gflops
(1.1 Gflops/$)
24 Gflops
(2.4-4.8 Gflops/$)

QPU / Quad Processing Unit
• general purpose register A/B x32 (= 64 register)
• accumulation register r0,r1,r2,r3 (= 4 register)
• see. Architecture Reference Guide

TMU / Texture and Memory Lookup Unit

SHOULD compute every layer by GPU
Conv2D in GPU
ReLU in CPU
Conv2D in GPU
Conv2D in GPU
ReLU in GPU
Conv2D in GPU
GPU Mem Read

Architecture Overview
EXCLUSIVE

References
• Raspberry PiでGPGPU
• Raspberry PiのGPUで行列乗算(その1)
• Raspberry PiのGPUで行列乗算(その2)

Code Example
• PyVideoCore
rotate(broadcast, r1, 0)
fmul(r3, r4, r5)
fadd(ra0, ra0, r3)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
r5 has
broadcast
result

Code Example
• PyVideoCore
fmul(r3, r4, r5)
fadd(ra0, ra0, r3)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)
fmul(r3, r4, r5)
fadd(ra1, ra1, r3)

Code Example
• PyVideoCore
fmul(r3, r4, r5)
fadd(ra0, ra0, r3).fmul(r3, r4, r5)

Practical Limit for DNN
Pi 3 Pi Zero/W
CPU ? Gflops ? Gflops
GPU 38.4→19.2 Gflops 24→12 Gflops

Rejected Algorithms
• im2col
• TMU enable equivalent load
• Winograd
• increase data transfer > reduce operation
• Not enough register
• Direct (NCHW → NHWC, im2col equiv. TMU load)
• NHWC has bad data locality for next layer

Accepted Algorithms
• Direct (NCHW → NCHW)
• DMA store with Transpose(C,HW)
• for small image (HxW < 2048)
• Direct (NCHW → NHCW)
• DMA store with Transpose(C,W)

Specialize for Kernel Size
• 3x3 with stride = 1
• Best performance
• Pi3: 18.5 Gflops
• 48% of theoretical limit (96% of practical limit)
• 1x1
• 1xK
• Kx1
• KxK

Specialize for Output Shape
• DMA Transfer Block Size
• HxWxC = 2x16x16
• HxWxC = 4x8x16
• for small image
• HxWxC = 2x14x16+2x16x16
• OverLap-Add Method
• 3x3 only
2x16x16 = 4x8x16 = use 32 general purpose register for accumulation
Other 32 registers have convolution parameters.

Combination
• Using 22 specialized implementations
• NCHW→NCHW / NCHW→NHCW
• 2x14x16+
2x16x16
2x16x16 4x8x16
3x3 Use Use Use
1x1 - Use Use
1xK - Use Use
Kx1 - Use Use
KxK - Use Use

Convert Time Optimization
• Fuse Affine Transform
• ex) Convolution + BN + Scale → Convolution
• Inline (Leaky)ReLU

Instruction Golf
rotate(broadcast, r1, -13)
fadd(rb[14], rb[14], r3).fmul(r3, r4, r5)
fadd(ra[14], ra[14], r3).fmul(r3, r4, r5)
iadd(null, element_number, -15, set_flags=True).rotate(broadcast, r1, -15)
isub(r2, r2, 1, cond='zs')
jzc(L.loop)
Decrement r2
Jump if non-0

Instruction Golf
jzc(L.loop)
Add Op. Only
Mul Op. Only
Mul Op. Only

Instruction Golf
jzc(L.loop)

Instruction Golf
fadd(ra[14], ra[14], r3, set_flags=False).fmul(r3, r4, r5)
iadd(r2, r2, -1, cond='zs').rotate(broadcast, r1, -15)
jzc(L.loop)
jzc(L.loop)

Instruction Golf
jzc(L.loop)
jzc(L.loop)
Can t use 2 different Imm.Can t use 2 different Imm.

Instruction Golf
fadd(rb[14], rb[14], r3, set_flags=False).fmul(r3, r4, r5)
isub(r2, r2, -14, cond='zs', set_flags=False).rotate(broadcast, r1, -14)
jzc(L.loop)
jzc(L.loop)

Instruction Golf
fadd(rb[14], rb[14], r3, set_flags=False).fmul(r3, r4, r5)
isub(r2, r2, -14, cond='zs', set_flags=False).rotate(broadcast, r1, -14)
jzc(L.loop)
jzc(L.loop)
-1
-(-14)
-15

Performance
• GoogLeNet
• Pi3: 320ms
• Pi0: 670ms
• ResNet50
• Pi3: 820ms
• YOLO tiny
• Pi3: 580ms
1 99.58% cicada, cicala
2 0.19% cockroach, roach
3 0.06% cricket
4 0.05% grasshopper, hopper
5 0.04% leafhopper
6 0.02% lacewing, lacewing fly
7 0.01% barn spider, Araneus cavaticus
8 0.00% ground beetle, carabid beetle
9 0.00% isopod
10 0.00% mantis, mantid

Remove Wasteful Computation
• ex) ResNet

• ex) ResNet
1/4
1/4
1/4
Improve
Data Locality
Improve
Data Locality

• ex) ResNet50
• Pi3: 820ms → 720ms (by hand-made optimization)

4x8x16 Case
convolution area
required area

4x8x16 Case TMU load x 4
(16x4 < 6x10)

(16x4 < 4x18 < 16x5)

(16x4 < 4x18 < 16x5)
Over
TMU load request
queue size = 4

2x14x16+2x16x16 CaseTMU load x 4
(16x4 = 4x16)

Max / Average Pooling
• Convolution like TMU loading
• Specialized for global pooling

Using Raspberry Pi GPU for DNN

More Related Content

What's hot

Viewers also liked

Similar to Using Raspberry Pi GPU for DNN

Recently uploaded

Using Raspberry Pi GPU for DNN