VC4C: Development of OpenCL
Compiler for VideoCore4
RaspberryPiのGPUを使うOSS OpenCL
コンパイラ開発の現状と課題
2018/11/10 コンパイラ勉強会@fixstars
私は誰
・光のインターネットの闇
 @no_maddo
・ideinのエンジニア
本日のトピック
- VideoCore IV(以下VC4)の紹介
- Architecture
- Memory characteristics
- VC4Cの紹介
- ドイツの方のOSSプロジェクト、私じゃないよ
- Master論文のためのプロジェクトだったらしい
- VC4Cならではの考慮点
- VC4きつい
- OpenCLきつい
Idein’s technology
・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
Why we can archive the performance?
・Performance maximam performance of VideoCore IV
・Hand-assembling parallel GPU code
・Run only on GPU
・No return during execution of the inference
・CPU usage is very low
・Pi Zero ($ 5 computer!!!) archive the performance
For the detail…..
See our president presentation
Why we “try” the OSS compiler?
・We don’t use VC4C in production now
・Tunning assembly is “hard”
・Diesel, TensorComprehension
・In near future, happy to write good performance
mathematical kernels in compiler…...
Architecture Introduction
VC4 overview
VC4 overview
QPU / Quad Processing Unit
QPU / Quad Processing Unit
・general purpose register A/B x32 (=64 registers)
・accumulator register r[0-3] (= 4 registers)
VC4 overview
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
VC4 overview
Uniform cache
Uniform cache
Uniform cache
VC4 overview
VPM / Vertex Pipe Memory
VPM / Vertex Pipe Memory
Efficient data transfer
Assembly Example: Hello World
mov(r0, uniform) # load from uniform & set it to `r0
setup_vpm_write() # prepare for vpm write
mov(vpm, 1) # write 1 row (16 elements) to vpm
setup_dma_store(nrows=1)# declaration to output 1 row
mov(vpm_st_addr, r0) # start write to the address of `r0
wait_dma_store() # sync dma_store
exit()
See the repository: py-videocore
Ex: A = A * 2 + 1
ldi(ra1, 64)
ldi(rb1, 16)
mov(rb0, uniform)
mov(ra0, uniform)
imul24(r1, element_number,4)
iadd(r1, uniform, r1)
L.loop
iadd(r1, r1, ra1).mov(tmu0_s, r1)
mutex_acquire()
setup_vpm_write(nrows=1)
nop(sig=’load tmu0’)
fmul(r0, r4, 2.0)
fadd(vpm, r0, 1.0)
setup_dma_store(nrows=1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(ra0, ra0, rb1, set_flags=True)
jzc(L.loop)
iadd(rb0, rb0, ra1); nop(); nop()
exit()
Flow of execution
・allocate GPU memory
・build uniforms
・for each thread
・run driver
with Driver () as drv :
n_threads = 12
r = drv.alloc((n_threads, 128),
’float32’)
a = drv.alloc((n_threads, 128),
’float32’)
………
code = drv.program(mul2)
uniforms = drv.alloc((n_threads, 3),
‘uint32’)
uniforms[:, 0] = r.address()[:, 0]
uniforms[:, 1] = 128
uniforms[:, 2] = a.address()[0][0]
drv.execute(n_threads=n_threads,
program=code, uniforms=uniforms)
performance example: qmkl
$ sudo ./qmkl/test/sgemm 224 224 224
GPU: 6.17614e+09 [flop/s]
CPU: 9.78483e+08 [flop/s]
NEON: 1.06783e+09 [flop/s]
https://github.com/idein/qmkl
・mathematical kernels using VC4 cation: no-trans, no-trans
Performance issue
・low memory band-width:
・4.48 GBPS v.s. 98 GBPS in my computer...
・TMU Latecy (cycle):
・TMU cache hit: 9
・L2 cache hit: 12
・Memory: 20 (if v3d_freq=250 [MHz])
・cache incoherency
Cache incoherency 1
②
③
④
⑤
①
Cache incoherency
Cache incoherency
VC4C: OpenCL compiler for VC4
・Parallel programming framework for
heterogene computing (GPU, DSP, FPGA, etc...)
・Support data paralle computing model
Recap: OpenCL
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
clCreateContext
clCreateProgramWithSource
clCreateBuffer
clEnqueueWriteBuffer
global_item_size = { 4, 8, 12 };
clEnqueueNDRangeKernel
compile at runtime
enqueue kernel
Host program
Recap: OpenCL
VC4C Overview
OpenCL runtime
offline compiler
Asm structure kernel void mul2(global float * a) {
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・make implicit loop
・OpenCL parameters are
passed via uniform
・Loop exit are passed via
uniform
VC4C demo
Let’s check the output….
Current status: works if registers are enough
・Works fine if register-allocation is successful
・Lack of register-spilling
・Performance issue
・better instruction scheduling
・adjust clang loop-optimizations for VC4
・innermost loop unrolling
・improve DMA transportation
・auto-vectorization
Implementatio
Issue
VC4 specific optimization
・To load 32bit constants, ldi is required
・Dealing with constants are costy
・moveConstantload removes ldi from loops
・But increase register-pressure…
immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
ldi(r0, 0)
ldi(r2, 10)
ldi(r1, 256)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Instead of rb regfile fields, limited imm can be encoded
・-16~15, 1.0, 2.0, 4.0, …
・by combining them, some imm can be ALU instruction
small immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
mov(r0, 0)
mov(r2, 10)
imul24(r1, -16, -16)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Fusion of writing VPM(WIP)
mov(r0, 0)
L.loop
setup_vpm_write(nrows=1)
mutex_acquire()
fadd(r1, uniform, 1.0)
iadd(r0, r0, 1).mov(vpm, r1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(None, r0, 3,
set_flag=True)
bne(L.loop)
nop(); nop(); nop()
setup_vpm_write(nrows=3)
fadd(ra0, uniform, 1.0)
fadd(ra1, uniform, 1.0)
fadd(ra2, uniform, 1.0)
mutex_acquire()
mov(vpm, ra0)
mov(vpm, ra1)
mov(vpm, ra2)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
Full unrolling
Hardware limitation
・Cache incoherency is huge problem
・Register-spill
・problematic in other GPU
・Effective TMU load
・If the same region is read/write, it makes wrong
・Use DMA discard parallelism at all
Insufficient use of DMA
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・region a is read/write
・a is just read once
・Acutually, Load via TMU is safe
・required complex analysis…???
Complex iteration via OpenCL IDs
・implicit loops (by ids) are hard to convert to natural loops
・global_id + worker_id + local_id ……
・want to remove such parameters by offline-compilation
Fusion of kernels(WIP)
・Fusion of some kernels (GEMM + ReLu + bias, etc…)
・For reducing memory transfer
・Diesel (NVIDIA Compiler project)
reported the impact
from Diesel: DSL for linear algebra and neural net computations on GPUs
Software pipelining(WIP)?
・Probably, it is not effect…
・Due to instruction cache limitation
・We rerolled some kernels…..
Conclusion?
・Introduce VC4
・Dual-issue in-order processor
・You can write its assembly freely
・Introduce VC4C
・heavily under development
・compiler-lovers, here is a unmatured compiler!!!!
Reference
・VideoCore® IV 3D Architecture Reference Guide
・Raspberry PiのGPUで行列乗算(その1)
・Raspberry PiのGPUで行列乗算(その2)
・Hacking the Raspberry Pi's VideoCore IV GPU
・GPU_FFT
・blog@ysugi

Vc4c development of opencl compiler for videocore4

  • 1.
    VC4C: Development ofOpenCL Compiler for VideoCore4 RaspberryPiのGPUを使うOSS OpenCL コンパイラ開発の現状と課題 2018/11/10 コンパイラ勉強会@fixstars
  • 2.
  • 3.
    本日のトピック - VideoCore IV(以下VC4)の紹介 -Architecture - Memory characteristics - VC4Cの紹介 - ドイツの方のOSSプロジェクト、私じゃないよ - Master論文のためのプロジェクトだったらしい - VC4Cならではの考慮点 - VC4きつい - OpenCLきつい
  • 4.
    Idein’s technology ・Execute mobilenetv2 1.0 224x224: 8.4 FPS ~= 140ms
  • 5.
    Why we canarchive the performance? ・Performance maximam performance of VideoCore IV ・Hand-assembling parallel GPU code ・Run only on GPU ・No return during execution of the inference ・CPU usage is very low ・Pi Zero ($ 5 computer!!!) archive the performance
  • 6.
    For the detail….. Seeour president presentation
  • 7.
    Why we “try”the OSS compiler? ・We don’t use VC4C in production now ・Tunning assembly is “hard” ・Diesel, TensorComprehension ・In near future, happy to write good performance mathematical kernels in compiler…...
  • 8.
  • 9.
  • 10.
  • 11.
    QPU / QuadProcessing Unit
  • 12.
    QPU / QuadProcessing Unit ・general purpose register A/B x32 (=64 registers) ・accumulator register r[0-3] (= 4 registers)
  • 13.
  • 14.
    TMU / Textureand Memory Lookup Unit
  • 15.
    TMU / Textureand Memory Lookup Unit
  • 16.
    TMU / Textureand Memory Lookup Unit
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    VPM / VertexPipe Memory
  • 23.
    VPM / VertexPipe Memory
  • 24.
  • 25.
    Assembly Example: HelloWorld mov(r0, uniform) # load from uniform & set it to `r0 setup_vpm_write() # prepare for vpm write mov(vpm, 1) # write 1 row (16 elements) to vpm setup_dma_store(nrows=1)# declaration to output 1 row mov(vpm_st_addr, r0) # start write to the address of `r0 wait_dma_store() # sync dma_store exit() See the repository: py-videocore
  • 26.
    Ex: A =A * 2 + 1 ldi(ra1, 64) ldi(rb1, 16) mov(rb0, uniform) mov(ra0, uniform) imul24(r1, element_number,4) iadd(r1, uniform, r1) L.loop iadd(r1, r1, ra1).mov(tmu0_s, r1) mutex_acquire() setup_vpm_write(nrows=1) nop(sig=’load tmu0’) fmul(r0, r4, 2.0) fadd(vpm, r0, 1.0) setup_dma_store(nrows=1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(ra0, ra0, rb1, set_flags=True) jzc(L.loop) iadd(rb0, rb0, ra1); nop(); nop() exit()
  • 27.
    Flow of execution ・allocateGPU memory ・build uniforms ・for each thread ・run driver with Driver () as drv : n_threads = 12 r = drv.alloc((n_threads, 128), ’float32’) a = drv.alloc((n_threads, 128), ’float32’) ……… code = drv.program(mul2) uniforms = drv.alloc((n_threads, 3), ‘uint32’) uniforms[:, 0] = r.address()[:, 0] uniforms[:, 1] = 128 uniforms[:, 2] = a.address()[0][0] drv.execute(n_threads=n_threads, program=code, uniforms=uniforms)
  • 28.
    performance example: qmkl $sudo ./qmkl/test/sgemm 224 224 224 GPU: 6.17614e+09 [flop/s] CPU: 9.78483e+08 [flop/s] NEON: 1.06783e+09 [flop/s] https://github.com/idein/qmkl ・mathematical kernels using VC4 cation: no-trans, no-trans
  • 29.
    Performance issue ・low memoryband-width: ・4.48 GBPS v.s. 98 GBPS in my computer... ・TMU Latecy (cycle): ・TMU cache hit: 9 ・L2 cache hit: 12 ・Memory: 20 (if v3d_freq=250 [MHz]) ・cache incoherency
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
    ・Parallel programming frameworkfor heterogene computing (GPU, DSP, FPGA, etc...) ・Support data paralle computing model Recap: OpenCL kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } clCreateContext clCreateProgramWithSource clCreateBuffer clEnqueueWriteBuffer global_item_size = { 4, 8, 12 }; clEnqueueNDRangeKernel compile at runtime enqueue kernel Host program
  • 35.
  • 36.
  • 37.
    Asm structure kernelvoid mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・make implicit loop ・OpenCL parameters are passed via uniform ・Loop exit are passed via uniform
  • 38.
  • 39.
    Current status: worksif registers are enough ・Works fine if register-allocation is successful ・Lack of register-spilling ・Performance issue ・better instruction scheduling ・adjust clang loop-optimizations for VC4 ・innermost loop unrolling ・improve DMA transportation ・auto-vectorization Implementatio Issue
  • 40.
  • 41.
    ・To load 32bitconstants, ldi is required ・Dealing with constants are costy ・moveConstantload removes ldi from loops ・But increase register-pressure… immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() ldi(r0, 0) ldi(r2, 10) ldi(r1, 256) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 42.
    Instead of rbregfile fields, limited imm can be encoded ・-16~15, 1.0, 2.0, 4.0, … ・by combining them, some imm can be ALU instruction small immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() mov(r0, 0) mov(r2, 10) imul24(r1, -16, -16) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 43.
    Fusion of writingVPM(WIP) mov(r0, 0) L.loop setup_vpm_write(nrows=1) mutex_acquire() fadd(r1, uniform, 1.0) iadd(r0, r0, 1).mov(vpm, r1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(None, r0, 3, set_flag=True) bne(L.loop) nop(); nop(); nop() setup_vpm_write(nrows=3) fadd(ra0, uniform, 1.0) fadd(ra1, uniform, 1.0) fadd(ra2, uniform, 1.0) mutex_acquire() mov(vpm, ra0) mov(vpm, ra1) mov(vpm, ra2) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() Full unrolling
  • 44.
    Hardware limitation ・Cache incoherencyis huge problem ・Register-spill ・problematic in other GPU ・Effective TMU load ・If the same region is read/write, it makes wrong ・Use DMA discard parallelism at all
  • 45.
    Insufficient use ofDMA kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・region a is read/write ・a is just read once ・Acutually, Load via TMU is safe ・required complex analysis…???
  • 46.
    Complex iteration viaOpenCL IDs ・implicit loops (by ids) are hard to convert to natural loops ・global_id + worker_id + local_id …… ・want to remove such parameters by offline-compilation
  • 47.
    Fusion of kernels(WIP) ・Fusionof some kernels (GEMM + ReLu + bias, etc…) ・For reducing memory transfer ・Diesel (NVIDIA Compiler project) reported the impact from Diesel: DSL for linear algebra and neural net computations on GPUs
  • 48.
    Software pipelining(WIP)? ・Probably, itis not effect… ・Due to instruction cache limitation ・We rerolled some kernels…..
  • 49.
    Conclusion? ・Introduce VC4 ・Dual-issue in-orderprocessor ・You can write its assembly freely ・Introduce VC4C ・heavily under development ・compiler-lovers, here is a unmatured compiler!!!!
  • 50.
    Reference ・VideoCore® IV 3DArchitecture Reference Guide ・Raspberry PiのGPUで行列乗算(その1) ・Raspberry PiのGPUで行列乗算(その2) ・Hacking the Raspberry Pi's VideoCore IV GPU ・GPU_FFT ・blog@ysugi