SlideShare a Scribd company logo
1 of 50
Download to read offline
VC4C: Development of OpenCL
Compiler for VideoCore4
RaspberryPiのGPUを使うOSS OpenCL
コンパイラ開発の現状と課題
2018/11/10 コンパイラ勉強会@fixstars
私は誰
・光のインターネットの闇
 @no_maddo
・ideinのエンジニア
本日のトピック
- VideoCore IV(以下VC4)の紹介
- Architecture
- Memory characteristics
- VC4Cの紹介
- ドイツの方のOSSプロジェクト、私じゃないよ
- Master論文のためのプロジェクトだったらしい
- VC4Cならではの考慮点
- VC4きつい
- OpenCLきつい
Idein’s technology
・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
Why we can archive the performance?
・Performance maximam performance of VideoCore IV
・Hand-assembling parallel GPU code
・Run only on GPU
・No return during execution of the inference
・CPU usage is very low
・Pi Zero ($ 5 computer!!!) archive the performance
For the detail…..
See our president presentation
Why we “try” the OSS compiler?
・We don’t use VC4C in production now
・Tunning assembly is “hard”
・Diesel, TensorComprehension
・In near future, happy to write good performance
mathematical kernels in compiler…...
Architecture Introduction
VC4 overview
VC4 overview
QPU / Quad Processing Unit
QPU / Quad Processing Unit
・general purpose register A/B x32 (=64 registers)
・accumulator register r[0-3] (= 4 registers)
VC4 overview
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
VC4 overview
Uniform cache
Uniform cache
Uniform cache
VC4 overview
VPM / Vertex Pipe Memory
VPM / Vertex Pipe Memory
Efficient data transfer
Assembly Example: Hello World
mov(r0, uniform) # load from uniform & set it to `r0
setup_vpm_write() # prepare for vpm write
mov(vpm, 1) # write 1 row (16 elements) to vpm
setup_dma_store(nrows=1)# declaration to output 1 row
mov(vpm_st_addr, r0) # start write to the address of `r0
wait_dma_store() # sync dma_store
exit()
See the repository: py-videocore
Ex: A = A * 2 + 1
ldi(ra1, 64)
ldi(rb1, 16)
mov(rb0, uniform)
mov(ra0, uniform)
imul24(r1, element_number,4)
iadd(r1, uniform, r1)
L.loop
iadd(r1, r1, ra1).mov(tmu0_s, r1)
mutex_acquire()
setup_vpm_write(nrows=1)
nop(sig=’load tmu0’)
fmul(r0, r4, 2.0)
fadd(vpm, r0, 1.0)
setup_dma_store(nrows=1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(ra0, ra0, rb1, set_flags=True)
jzc(L.loop)
iadd(rb0, rb0, ra1); nop(); nop()
exit()
Flow of execution
・allocate GPU memory
・build uniforms
・for each thread
・run driver
with Driver () as drv :
n_threads = 12
r = drv.alloc((n_threads, 128),
’float32’)
a = drv.alloc((n_threads, 128),
’float32’)
………
code = drv.program(mul2)
uniforms = drv.alloc((n_threads, 3),
‘uint32’)
uniforms[:, 0] = r.address()[:, 0]
uniforms[:, 1] = 128
uniforms[:, 2] = a.address()[0][0]
drv.execute(n_threads=n_threads,
program=code, uniforms=uniforms)
performance example: qmkl
$ sudo ./qmkl/test/sgemm 224 224 224
GPU: 6.17614e+09 [flop/s]
CPU: 9.78483e+08 [flop/s]
NEON: 1.06783e+09 [flop/s]
https://github.com/idein/qmkl
・mathematical kernels using VC4 cation: no-trans, no-trans
Performance issue
・low memory band-width:
・4.48 GBPS v.s. 98 GBPS in my computer...
・TMU Latecy (cycle):
・TMU cache hit: 9
・L2 cache hit: 12
・Memory: 20 (if v3d_freq=250 [MHz])
・cache incoherency
Cache incoherency 1
②
③
④
⑤
①
Cache incoherency
Cache incoherency
VC4C: OpenCL compiler for VC4
・Parallel programming framework for
heterogene computing (GPU, DSP, FPGA, etc...)
・Support data paralle computing model
Recap: OpenCL
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
clCreateContext
clCreateProgramWithSource
clCreateBuffer
clEnqueueWriteBuffer
global_item_size = { 4, 8, 12 };
clEnqueueNDRangeKernel
compile at runtime
enqueue kernel
Host program
Recap: OpenCL
VC4C Overview
OpenCL runtime
offline compiler
Asm structure kernel void mul2(global float * a) {
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・make implicit loop
・OpenCL parameters are
passed via uniform
・Loop exit are passed via
uniform
VC4C demo
Let’s check the output….
Current status: works if registers are enough
・Works fine if register-allocation is successful
・Lack of register-spilling
・Performance issue
・better instruction scheduling
・adjust clang loop-optimizations for VC4
・innermost loop unrolling
・improve DMA transportation
・auto-vectorization
Implementatio
Issue
VC4 specific optimization
・To load 32bit constants, ldi is required
・Dealing with constants are costy
・moveConstantload removes ldi from loops
・But increase register-pressure…
immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
ldi(r0, 0)
ldi(r2, 10)
ldi(r1, 256)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Instead of rb regfile fields, limited imm can be encoded
・-16~15, 1.0, 2.0, 4.0, …
・by combining them, some imm can be ALU instruction
small immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
mov(r0, 0)
mov(r2, 10)
imul24(r1, -16, -16)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Fusion of writing VPM(WIP)
mov(r0, 0)
L.loop
setup_vpm_write(nrows=1)
mutex_acquire()
fadd(r1, uniform, 1.0)
iadd(r0, r0, 1).mov(vpm, r1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(None, r0, 3,
set_flag=True)
bne(L.loop)
nop(); nop(); nop()
setup_vpm_write(nrows=3)
fadd(ra0, uniform, 1.0)
fadd(ra1, uniform, 1.0)
fadd(ra2, uniform, 1.0)
mutex_acquire()
mov(vpm, ra0)
mov(vpm, ra1)
mov(vpm, ra2)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
Full unrolling
Hardware limitation
・Cache incoherency is huge problem
・Register-spill
・problematic in other GPU
・Effective TMU load
・If the same region is read/write, it makes wrong
・Use DMA discard parallelism at all
Insufficient use of DMA
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・region a is read/write
・a is just read once
・Acutually, Load via TMU is safe
・required complex analysis…???
Complex iteration via OpenCL IDs
・implicit loops (by ids) are hard to convert to natural loops
・global_id + worker_id + local_id ……
・want to remove such parameters by offline-compilation
Fusion of kernels(WIP)
・Fusion of some kernels (GEMM + ReLu + bias, etc…)
・For reducing memory transfer
・Diesel (NVIDIA Compiler project)
reported the impact
from Diesel: DSL for linear algebra and neural net computations on GPUs
Software pipelining(WIP)?
・Probably, it is not effect…
・Due to instruction cache limitation
・We rerolled some kernels…..
Conclusion?
・Introduce VC4
・Dual-issue in-order processor
・You can write its assembly freely
・Introduce VC4C
・heavily under development
・compiler-lovers, here is a unmatured compiler!!!!
Reference
・VideoCore® IV 3D Architecture Reference Guide
・Raspberry PiのGPUで行列乗算(その1)
・Raspberry PiのGPUで行列乗算(その2)
・Hacking the Raspberry Pi's VideoCore IV GPU
・GPU_FFT
・blog@ysugi

More Related Content

What's hot

Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Teddy Hsiung
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 

What's hot (20)

Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with Multithreading
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPU
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Caffe studying 2017
Caffe studying 2017Caffe studying 2017
Caffe studying 2017
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarking
 
OpenMP
OpenMPOpenMP
OpenMP
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 

Similar to Vc4c development of opencl compiler for videocore4

Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
Jian-Hong Pan
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
imec.archive
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
trupeace
 

Similar to Vc4c development of opencl compiler for videocore4 (20)

Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 
Rsltollvm
RsltollvmRsltollvm
Rsltollvm
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Php engine
Php enginePhp engine
Php engine
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEK
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Metasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSMetasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OS
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Vc4c development of opencl compiler for videocore4

  • 1. VC4C: Development of OpenCL Compiler for VideoCore4 RaspberryPiのGPUを使うOSS OpenCL コンパイラ開発の現状と課題 2018/11/10 コンパイラ勉強会@fixstars
  • 3. 本日のトピック - VideoCore IV(以下VC4)の紹介 - Architecture - Memory characteristics - VC4Cの紹介 - ドイツの方のOSSプロジェクト、私じゃないよ - Master論文のためのプロジェクトだったらしい - VC4Cならではの考慮点 - VC4きつい - OpenCLきつい
  • 4. Idein’s technology ・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
  • 5. Why we can archive the performance? ・Performance maximam performance of VideoCore IV ・Hand-assembling parallel GPU code ・Run only on GPU ・No return during execution of the inference ・CPU usage is very low ・Pi Zero ($ 5 computer!!!) archive the performance
  • 6. For the detail….. See our president presentation
  • 7. Why we “try” the OSS compiler? ・We don’t use VC4C in production now ・Tunning assembly is “hard” ・Diesel, TensorComprehension ・In near future, happy to write good performance mathematical kernels in compiler…...
  • 11. QPU / Quad Processing Unit
  • 12. QPU / Quad Processing Unit ・general purpose register A/B x32 (=64 registers) ・accumulator register r[0-3] (= 4 registers)
  • 14. TMU / Texture and Memory Lookup Unit
  • 15. TMU / Texture and Memory Lookup Unit
  • 16. TMU / Texture and Memory Lookup Unit
  • 22. VPM / Vertex Pipe Memory
  • 23. VPM / Vertex Pipe Memory
  • 25. Assembly Example: Hello World mov(r0, uniform) # load from uniform & set it to `r0 setup_vpm_write() # prepare for vpm write mov(vpm, 1) # write 1 row (16 elements) to vpm setup_dma_store(nrows=1)# declaration to output 1 row mov(vpm_st_addr, r0) # start write to the address of `r0 wait_dma_store() # sync dma_store exit() See the repository: py-videocore
  • 26. Ex: A = A * 2 + 1 ldi(ra1, 64) ldi(rb1, 16) mov(rb0, uniform) mov(ra0, uniform) imul24(r1, element_number,4) iadd(r1, uniform, r1) L.loop iadd(r1, r1, ra1).mov(tmu0_s, r1) mutex_acquire() setup_vpm_write(nrows=1) nop(sig=’load tmu0’) fmul(r0, r4, 2.0) fadd(vpm, r0, 1.0) setup_dma_store(nrows=1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(ra0, ra0, rb1, set_flags=True) jzc(L.loop) iadd(rb0, rb0, ra1); nop(); nop() exit()
  • 27. Flow of execution ・allocate GPU memory ・build uniforms ・for each thread ・run driver with Driver () as drv : n_threads = 12 r = drv.alloc((n_threads, 128), ’float32’) a = drv.alloc((n_threads, 128), ’float32’) ……… code = drv.program(mul2) uniforms = drv.alloc((n_threads, 3), ‘uint32’) uniforms[:, 0] = r.address()[:, 0] uniforms[:, 1] = 128 uniforms[:, 2] = a.address()[0][0] drv.execute(n_threads=n_threads, program=code, uniforms=uniforms)
  • 28. performance example: qmkl $ sudo ./qmkl/test/sgemm 224 224 224 GPU: 6.17614e+09 [flop/s] CPU: 9.78483e+08 [flop/s] NEON: 1.06783e+09 [flop/s] https://github.com/idein/qmkl ・mathematical kernels using VC4 cation: no-trans, no-trans
  • 29. Performance issue ・low memory band-width: ・4.48 GBPS v.s. 98 GBPS in my computer... ・TMU Latecy (cycle): ・TMU cache hit: 9 ・L2 cache hit: 12 ・Memory: 20 (if v3d_freq=250 [MHz]) ・cache incoherency
  • 34. ・Parallel programming framework for heterogene computing (GPU, DSP, FPGA, etc...) ・Support data paralle computing model Recap: OpenCL kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } clCreateContext clCreateProgramWithSource clCreateBuffer clEnqueueWriteBuffer global_item_size = { 4, 8, 12 }; clEnqueueNDRangeKernel compile at runtime enqueue kernel Host program
  • 37. Asm structure kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・make implicit loop ・OpenCL parameters are passed via uniform ・Loop exit are passed via uniform
  • 38. VC4C demo Let’s check the output….
  • 39. Current status: works if registers are enough ・Works fine if register-allocation is successful ・Lack of register-spilling ・Performance issue ・better instruction scheduling ・adjust clang loop-optimizations for VC4 ・innermost loop unrolling ・improve DMA transportation ・auto-vectorization Implementatio Issue
  • 41. ・To load 32bit constants, ldi is required ・Dealing with constants are costy ・moveConstantload removes ldi from loops ・But increase register-pressure… immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() ldi(r0, 0) ldi(r2, 10) ldi(r1, 256) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 42. Instead of rb regfile fields, limited imm can be encoded ・-16~15, 1.0, 2.0, 4.0, … ・by combining them, some imm can be ALU instruction small immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() mov(r0, 0) mov(r2, 10) imul24(r1, -16, -16) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 43. Fusion of writing VPM(WIP) mov(r0, 0) L.loop setup_vpm_write(nrows=1) mutex_acquire() fadd(r1, uniform, 1.0) iadd(r0, r0, 1).mov(vpm, r1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(None, r0, 3, set_flag=True) bne(L.loop) nop(); nop(); nop() setup_vpm_write(nrows=3) fadd(ra0, uniform, 1.0) fadd(ra1, uniform, 1.0) fadd(ra2, uniform, 1.0) mutex_acquire() mov(vpm, ra0) mov(vpm, ra1) mov(vpm, ra2) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() Full unrolling
  • 44. Hardware limitation ・Cache incoherency is huge problem ・Register-spill ・problematic in other GPU ・Effective TMU load ・If the same region is read/write, it makes wrong ・Use DMA discard parallelism at all
  • 45. Insufficient use of DMA kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・region a is read/write ・a is just read once ・Acutually, Load via TMU is safe ・required complex analysis…???
  • 46. Complex iteration via OpenCL IDs ・implicit loops (by ids) are hard to convert to natural loops ・global_id + worker_id + local_id …… ・want to remove such parameters by offline-compilation
  • 47. Fusion of kernels(WIP) ・Fusion of some kernels (GEMM + ReLu + bias, etc…) ・For reducing memory transfer ・Diesel (NVIDIA Compiler project) reported the impact from Diesel: DSL for linear algebra and neural net computations on GPUs
  • 48. Software pipelining(WIP)? ・Probably, it is not effect… ・Due to instruction cache limitation ・We rerolled some kernels…..
  • 49. Conclusion? ・Introduce VC4 ・Dual-issue in-order processor ・You can write its assembly freely ・Introduce VC4C ・heavily under development ・compiler-lovers, here is a unmatured compiler!!!!
  • 50. Reference ・VideoCore® IV 3D Architecture Reference Guide ・Raspberry PiのGPUで行列乗算(その1) ・Raspberry PiのGPUで行列乗算(その2) ・Hacking the Raspberry Pi's VideoCore IV GPU ・GPU_FFT ・blog@ysugi