Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vc4c development of opencl compiler for videocore4

1,995 views

Published on

https://connpass.com/event/103976/presentation/

Published in: Technology
  • Be the first to comment

Vc4c development of opencl compiler for videocore4

  1. 1. VC4C: Development of OpenCL Compiler for VideoCore4 RaspberryPiのGPUを使うOSS OpenCL コンパイラ開発の現状と課題 2018/11/10 コンパイラ勉強会@fixstars
  2. 2. 私は誰 ・光のインターネットの闇  @no_maddo ・ideinのエンジニア
  3. 3. 本日のトピック - VideoCore IV(以下VC4)の紹介 - Architecture - Memory characteristics - VC4Cの紹介 - ドイツの方のOSSプロジェクト、私じゃないよ - Master論文のためのプロジェクトだったらしい - VC4Cならではの考慮点 - VC4きつい - OpenCLきつい
  4. 4. Idein’s technology ・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
  5. 5. Why we can archive the performance? ・Performance maximam performance of VideoCore IV ・Hand-assembling parallel GPU code ・Run only on GPU ・No return during execution of the inference ・CPU usage is very low ・Pi Zero ($ 5 computer!!!) archive the performance
  6. 6. For the detail….. See our president presentation
  7. 7. Why we “try” the OSS compiler? ・We don’t use VC4C in production now ・Tunning assembly is “hard” ・Diesel, TensorComprehension ・In near future, happy to write good performance mathematical kernels in compiler…...
  8. 8. Architecture Introduction
  9. 9. VC4 overview
  10. 10. VC4 overview
  11. 11. QPU / Quad Processing Unit
  12. 12. QPU / Quad Processing Unit ・general purpose register A/B x32 (=64 registers) ・accumulator register r[0-3] (= 4 registers)
  13. 13. VC4 overview
  14. 14. TMU / Texture and Memory Lookup Unit
  15. 15. TMU / Texture and Memory Lookup Unit
  16. 16. TMU / Texture and Memory Lookup Unit
  17. 17. VC4 overview
  18. 18. Uniform cache
  19. 19. Uniform cache
  20. 20. Uniform cache
  21. 21. VC4 overview
  22. 22. VPM / Vertex Pipe Memory
  23. 23. VPM / Vertex Pipe Memory
  24. 24. Efficient data transfer
  25. 25. Assembly Example: Hello World mov(r0, uniform) # load from uniform & set it to `r0 setup_vpm_write() # prepare for vpm write mov(vpm, 1) # write 1 row (16 elements) to vpm setup_dma_store(nrows=1)# declaration to output 1 row mov(vpm_st_addr, r0) # start write to the address of `r0 wait_dma_store() # sync dma_store exit() See the repository: py-videocore
  26. 26. Ex: A = A * 2 + 1 ldi(ra1, 64) ldi(rb1, 16) mov(rb0, uniform) mov(ra0, uniform) imul24(r1, element_number,4) iadd(r1, uniform, r1) L.loop iadd(r1, r1, ra1).mov(tmu0_s, r1) mutex_acquire() setup_vpm_write(nrows=1) nop(sig=’load tmu0’) fmul(r0, r4, 2.0) fadd(vpm, r0, 1.0) setup_dma_store(nrows=1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(ra0, ra0, rb1, set_flags=True) jzc(L.loop) iadd(rb0, rb0, ra1); nop(); nop() exit()
  27. 27. Flow of execution ・allocate GPU memory ・build uniforms ・for each thread ・run driver with Driver () as drv : n_threads = 12 r = drv.alloc((n_threads, 128), ’float32’) a = drv.alloc((n_threads, 128), ’float32’) ……… code = drv.program(mul2) uniforms = drv.alloc((n_threads, 3), ‘uint32’) uniforms[:, 0] = r.address()[:, 0] uniforms[:, 1] = 128 uniforms[:, 2] = a.address()[0][0] drv.execute(n_threads=n_threads, program=code, uniforms=uniforms)
  28. 28. performance example: qmkl $ sudo ./qmkl/test/sgemm 224 224 224 GPU: 6.17614e+09 [flop/s] CPU: 9.78483e+08 [flop/s] NEON: 1.06783e+09 [flop/s] https://github.com/idein/qmkl ・mathematical kernels using VC4 cation: no-trans, no-trans
  29. 29. Performance issue ・low memory band-width: ・4.48 GBPS v.s. 98 GBPS in my computer... ・TMU Latecy (cycle): ・TMU cache hit: 9 ・L2 cache hit: 12 ・Memory: 20 (if v3d_freq=250 [MHz]) ・cache incoherency
  30. 30. Cache incoherency 1 ② ③ ④ ⑤ ①
  31. 31. Cache incoherency
  32. 32. Cache incoherency
  33. 33. VC4C: OpenCL compiler for VC4
  34. 34. ・Parallel programming framework for heterogene computing (GPU, DSP, FPGA, etc...) ・Support data paralle computing model Recap: OpenCL kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } clCreateContext clCreateProgramWithSource clCreateBuffer clEnqueueWriteBuffer global_item_size = { 4, 8, 12 }; clEnqueueNDRangeKernel compile at runtime enqueue kernel Host program
  35. 35. Recap: OpenCL
  36. 36. VC4C Overview OpenCL runtime offline compiler
  37. 37. Asm structure kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・make implicit loop ・OpenCL parameters are passed via uniform ・Loop exit are passed via uniform
  38. 38. VC4C demo Let’s check the output….
  39. 39. Current status: works if registers are enough ・Works fine if register-allocation is successful ・Lack of register-spilling ・Performance issue ・better instruction scheduling ・adjust clang loop-optimizations for VC4 ・innermost loop unrolling ・improve DMA transportation ・auto-vectorization Implementatio Issue
  40. 40. VC4 specific optimization
  41. 41. ・To load 32bit constants, ldi is required ・Dealing with constants are costy ・moveConstantload removes ldi from loops ・But increase register-pressure… immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() ldi(r0, 0) ldi(r2, 10) ldi(r1, 256) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  42. 42. Instead of rb regfile fields, limited imm can be encoded ・-16~15, 1.0, 2.0, 4.0, … ・by combining them, some imm can be ALU instruction small immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() mov(r0, 0) mov(r2, 10) imul24(r1, -16, -16) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  43. 43. Fusion of writing VPM(WIP) mov(r0, 0) L.loop setup_vpm_write(nrows=1) mutex_acquire() fadd(r1, uniform, 1.0) iadd(r0, r0, 1).mov(vpm, r1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(None, r0, 3, set_flag=True) bne(L.loop) nop(); nop(); nop() setup_vpm_write(nrows=3) fadd(ra0, uniform, 1.0) fadd(ra1, uniform, 1.0) fadd(ra2, uniform, 1.0) mutex_acquire() mov(vpm, ra0) mov(vpm, ra1) mov(vpm, ra2) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() Full unrolling
  44. 44. Hardware limitation ・Cache incoherency is huge problem ・Register-spill ・problematic in other GPU ・Effective TMU load ・If the same region is read/write, it makes wrong ・Use DMA discard parallelism at all
  45. 45. Insufficient use of DMA kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・region a is read/write ・a is just read once ・Acutually, Load via TMU is safe ・required complex analysis…???
  46. 46. Complex iteration via OpenCL IDs ・implicit loops (by ids) are hard to convert to natural loops ・global_id + worker_id + local_id …… ・want to remove such parameters by offline-compilation
  47. 47. Fusion of kernels(WIP) ・Fusion of some kernels (GEMM + ReLu + bias, etc…) ・For reducing memory transfer ・Diesel (NVIDIA Compiler project) reported the impact from Diesel: DSL for linear algebra and neural net computations on GPUs
  48. 48. Software pipelining(WIP)? ・Probably, it is not effect… ・Due to instruction cache limitation ・We rerolled some kernels…..
  49. 49. Conclusion? ・Introduce VC4 ・Dual-issue in-order processor ・You can write its assembly freely ・Introduce VC4C ・heavily under development ・compiler-lovers, here is a unmatured compiler!!!!
  50. 50. Reference ・VideoCore® IV 3D Architecture Reference Guide ・Raspberry PiのGPUで行列乗算(その1) ・Raspberry PiのGPUで行列乗算(その2) ・Hacking the Raspberry Pi's VideoCore IV GPU ・GPU_FFT ・blog@ysugi

×