SlideShare a Scribd company logo
1 of 50
Download to read offline
VC4C: Development of OpenCL
Compiler for VideoCore4
RaspberryPiのGPUを使うOSS OpenCL
コンパイラ開発の現状と課題
2018/11/10 コンパイラ勉強会@fixstars
私は誰
・光のインターネットの闇
 @no_maddo
・ideinのエンジニア
本日のトピック
- VideoCore IV(以下VC4)の紹介
- Architecture
- Memory characteristics
- VC4Cの紹介
- ドイツの方のOSSプロジェクト、私じゃないよ
- Master論文のためのプロジェクトだったらしい
- VC4Cならではの考慮点
- VC4きつい
- OpenCLきつい
Idein’s technology
・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
Why we can archive the performance?
・Performance maximam performance of VideoCore IV
・Hand-assembling parallel GPU code
・Run only on GPU
・No return during execution of the inference
・CPU usage is very low
・Pi Zero ($ 5 computer!!!) archive the performance
For the detail…..
See our president presentation
Why we “try” the OSS compiler?
・We don’t use VC4C in production now
・Tunning assembly is “hard”
・Diesel, TensorComprehension
・In near future, happy to write good performance
mathematical kernels in compiler…...
Architecture Introduction
VC4 overview
VC4 overview
QPU / Quad Processing Unit
QPU / Quad Processing Unit
・general purpose register A/B x32 (=64 registers)
・accumulator register r[0-3] (= 4 registers)
VC4 overview
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
VC4 overview
Uniform cache
Uniform cache
Uniform cache
VC4 overview
VPM / Vertex Pipe Memory
VPM / Vertex Pipe Memory
Efficient data transfer
Assembly Example: Hello World
mov(r0, uniform) # load from uniform & set it to `r0
setup_vpm_write() # prepare for vpm write
mov(vpm, 1) # write 1 row (16 elements) to vpm
setup_dma_store(nrows=1)# declaration to output 1 row
mov(vpm_st_addr, r0) # start write to the address of `r0
wait_dma_store() # sync dma_store
exit()
See the repository: py-videocore
Ex: A = A * 2 + 1
ldi(ra1, 64)
ldi(rb1, 16)
mov(rb0, uniform)
mov(ra0, uniform)
imul24(r1, element_number,4)
iadd(r1, uniform, r1)
L.loop
iadd(r1, r1, ra1).mov(tmu0_s, r1)
mutex_acquire()
setup_vpm_write(nrows=1)
nop(sig=’load tmu0’)
fmul(r0, r4, 2.0)
fadd(vpm, r0, 1.0)
setup_dma_store(nrows=1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(ra0, ra0, rb1, set_flags=True)
jzc(L.loop)
iadd(rb0, rb0, ra1); nop(); nop()
exit()
Flow of execution
・allocate GPU memory
・build uniforms
・for each thread
・run driver
with Driver () as drv :
n_threads = 12
r = drv.alloc((n_threads, 128),
’float32’)
a = drv.alloc((n_threads, 128),
’float32’)
………
code = drv.program(mul2)
uniforms = drv.alloc((n_threads, 3),
‘uint32’)
uniforms[:, 0] = r.address()[:, 0]
uniforms[:, 1] = 128
uniforms[:, 2] = a.address()[0][0]
drv.execute(n_threads=n_threads,
program=code, uniforms=uniforms)
performance example: qmkl
$ sudo ./qmkl/test/sgemm 224 224 224
GPU: 6.17614e+09 [flop/s]
CPU: 9.78483e+08 [flop/s]
NEON: 1.06783e+09 [flop/s]
https://github.com/idein/qmkl
・mathematical kernels using VC4 cation: no-trans, no-trans
Performance issue
・low memory band-width:
・4.48 GBPS v.s. 98 GBPS in my computer...
・TMU Latecy (cycle):
・TMU cache hit: 9
・L2 cache hit: 12
・Memory: 20 (if v3d_freq=250 [MHz])
・cache incoherency
Cache incoherency 1
②
③
④
⑤
①
Cache incoherency
Cache incoherency
VC4C: OpenCL compiler for VC4
・Parallel programming framework for
heterogene computing (GPU, DSP, FPGA, etc...)
・Support data paralle computing model
Recap: OpenCL
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
clCreateContext
clCreateProgramWithSource
clCreateBuffer
clEnqueueWriteBuffer
global_item_size = { 4, 8, 12 };
clEnqueueNDRangeKernel
compile at runtime
enqueue kernel
Host program
Recap: OpenCL
VC4C Overview
OpenCL runtime
offline compiler
Asm structure kernel void mul2(global float * a) {
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・make implicit loop
・OpenCL parameters are
passed via uniform
・Loop exit are passed via
uniform
VC4C demo
Let’s check the output….
Current status: works if registers are enough
・Works fine if register-allocation is successful
・Lack of register-spilling
・Performance issue
・better instruction scheduling
・adjust clang loop-optimizations for VC4
・innermost loop unrolling
・improve DMA transportation
・auto-vectorization
Implementatio
Issue
VC4 specific optimization
・To load 32bit constants, ldi is required
・Dealing with constants are costy
・moveConstantload removes ldi from loops
・But increase register-pressure…
immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
ldi(r0, 0)
ldi(r2, 10)
ldi(r1, 256)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Instead of rb regfile fields, limited imm can be encoded
・-16~15, 1.0, 2.0, 4.0, …
・by combining them, some imm can be ALU instruction
small immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
mov(r0, 0)
mov(r2, 10)
imul24(r1, -16, -16)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Fusion of writing VPM(WIP)
mov(r0, 0)
L.loop
setup_vpm_write(nrows=1)
mutex_acquire()
fadd(r1, uniform, 1.0)
iadd(r0, r0, 1).mov(vpm, r1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(None, r0, 3,
set_flag=True)
bne(L.loop)
nop(); nop(); nop()
setup_vpm_write(nrows=3)
fadd(ra0, uniform, 1.0)
fadd(ra1, uniform, 1.0)
fadd(ra2, uniform, 1.0)
mutex_acquire()
mov(vpm, ra0)
mov(vpm, ra1)
mov(vpm, ra2)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
Full unrolling
Hardware limitation
・Cache incoherency is huge problem
・Register-spill
・problematic in other GPU
・Effective TMU load
・If the same region is read/write, it makes wrong
・Use DMA discard parallelism at all
Insufficient use of DMA
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・region a is read/write
・a is just read once
・Acutually, Load via TMU is safe
・required complex analysis…???
Complex iteration via OpenCL IDs
・implicit loops (by ids) are hard to convert to natural loops
・global_id + worker_id + local_id ……
・want to remove such parameters by offline-compilation
Fusion of kernels(WIP)
・Fusion of some kernels (GEMM + ReLu + bias, etc…)
・For reducing memory transfer
・Diesel (NVIDIA Compiler project)
reported the impact
from Diesel: DSL for linear algebra and neural net computations on GPUs
Software pipelining(WIP)?
・Probably, it is not effect…
・Due to instruction cache limitation
・We rerolled some kernels…..
Conclusion?
・Introduce VC4
・Dual-issue in-order processor
・You can write its assembly freely
・Introduce VC4C
・heavily under development
・compiler-lovers, here is a unmatured compiler!!!!
Reference
・VideoCore® IV 3D Architecture Reference Guide
・Raspberry PiのGPUで行列乗算(その1)
・Raspberry PiのGPUで行列乗算(その2)
・Hacking the Raspberry Pi's VideoCore IV GPU
・GPU_FFT
・blog@ysugi

More Related Content

What's hot

Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationKito Cheng
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingMatsuo and Tsumura lab.
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTe-Yen Liu
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321Teddy Hsiung
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcYukio Okuda
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Shinya Takamaeda-Y
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
Caffe studying 2017
Caffe studying 2017Caffe studying 2017
Caffe studying 2017Te-Yen Liu
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarkingAndrey Akinshin
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingcppfrug
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 

What's hot (20)

Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with Multithreading
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPU
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Caffe studying 2017
Caffe studying 2017Caffe studying 2017
Caffe studying 2017
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarking
 
OpenMP
OpenMPOpenMP
OpenMP
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 

Similar to Vc4c development of opencl compiler for videocore4

Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonoveurobsdcon
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXLLinaro
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardJian-Hong Pan
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKSaumil Shah
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Tom Paulus
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Metasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSMetasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSKiwamu Okabe
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxtrupeace
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
 

Similar to Vc4c development of opencl compiler for videocore4 (20)

Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 
Rsltollvm
RsltollvmRsltollvm
Rsltollvm
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Php engine
Php enginePhp engine
Php engine
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEK
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Metasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSMetasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OS
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

Vc4c development of opencl compiler for videocore4

  • 1. VC4C: Development of OpenCL Compiler for VideoCore4 RaspberryPiのGPUを使うOSS OpenCL コンパイラ開発の現状と課題 2018/11/10 コンパイラ勉強会@fixstars
  • 3. 本日のトピック - VideoCore IV(以下VC4)の紹介 - Architecture - Memory characteristics - VC4Cの紹介 - ドイツの方のOSSプロジェクト、私じゃないよ - Master論文のためのプロジェクトだったらしい - VC4Cならではの考慮点 - VC4きつい - OpenCLきつい
  • 4. Idein’s technology ・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
  • 5. Why we can archive the performance? ・Performance maximam performance of VideoCore IV ・Hand-assembling parallel GPU code ・Run only on GPU ・No return during execution of the inference ・CPU usage is very low ・Pi Zero ($ 5 computer!!!) archive the performance
  • 6. For the detail….. See our president presentation
  • 7. Why we “try” the OSS compiler? ・We don’t use VC4C in production now ・Tunning assembly is “hard” ・Diesel, TensorComprehension ・In near future, happy to write good performance mathematical kernels in compiler…...
  • 11. QPU / Quad Processing Unit
  • 12. QPU / Quad Processing Unit ・general purpose register A/B x32 (=64 registers) ・accumulator register r[0-3] (= 4 registers)
  • 14. TMU / Texture and Memory Lookup Unit
  • 15. TMU / Texture and Memory Lookup Unit
  • 16. TMU / Texture and Memory Lookup Unit
  • 22. VPM / Vertex Pipe Memory
  • 23. VPM / Vertex Pipe Memory
  • 25. Assembly Example: Hello World mov(r0, uniform) # load from uniform & set it to `r0 setup_vpm_write() # prepare for vpm write mov(vpm, 1) # write 1 row (16 elements) to vpm setup_dma_store(nrows=1)# declaration to output 1 row mov(vpm_st_addr, r0) # start write to the address of `r0 wait_dma_store() # sync dma_store exit() See the repository: py-videocore
  • 26. Ex: A = A * 2 + 1 ldi(ra1, 64) ldi(rb1, 16) mov(rb0, uniform) mov(ra0, uniform) imul24(r1, element_number,4) iadd(r1, uniform, r1) L.loop iadd(r1, r1, ra1).mov(tmu0_s, r1) mutex_acquire() setup_vpm_write(nrows=1) nop(sig=’load tmu0’) fmul(r0, r4, 2.0) fadd(vpm, r0, 1.0) setup_dma_store(nrows=1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(ra0, ra0, rb1, set_flags=True) jzc(L.loop) iadd(rb0, rb0, ra1); nop(); nop() exit()
  • 27. Flow of execution ・allocate GPU memory ・build uniforms ・for each thread ・run driver with Driver () as drv : n_threads = 12 r = drv.alloc((n_threads, 128), ’float32’) a = drv.alloc((n_threads, 128), ’float32’) ……… code = drv.program(mul2) uniforms = drv.alloc((n_threads, 3), ‘uint32’) uniforms[:, 0] = r.address()[:, 0] uniforms[:, 1] = 128 uniforms[:, 2] = a.address()[0][0] drv.execute(n_threads=n_threads, program=code, uniforms=uniforms)
  • 28. performance example: qmkl $ sudo ./qmkl/test/sgemm 224 224 224 GPU: 6.17614e+09 [flop/s] CPU: 9.78483e+08 [flop/s] NEON: 1.06783e+09 [flop/s] https://github.com/idein/qmkl ・mathematical kernels using VC4 cation: no-trans, no-trans
  • 29. Performance issue ・low memory band-width: ・4.48 GBPS v.s. 98 GBPS in my computer... ・TMU Latecy (cycle): ・TMU cache hit: 9 ・L2 cache hit: 12 ・Memory: 20 (if v3d_freq=250 [MHz]) ・cache incoherency
  • 34. ・Parallel programming framework for heterogene computing (GPU, DSP, FPGA, etc...) ・Support data paralle computing model Recap: OpenCL kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } clCreateContext clCreateProgramWithSource clCreateBuffer clEnqueueWriteBuffer global_item_size = { 4, 8, 12 }; clEnqueueNDRangeKernel compile at runtime enqueue kernel Host program
  • 37. Asm structure kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・make implicit loop ・OpenCL parameters are passed via uniform ・Loop exit are passed via uniform
  • 38. VC4C demo Let’s check the output….
  • 39. Current status: works if registers are enough ・Works fine if register-allocation is successful ・Lack of register-spilling ・Performance issue ・better instruction scheduling ・adjust clang loop-optimizations for VC4 ・innermost loop unrolling ・improve DMA transportation ・auto-vectorization Implementatio Issue
  • 41. ・To load 32bit constants, ldi is required ・Dealing with constants are costy ・moveConstantload removes ldi from loops ・But increase register-pressure… immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() ldi(r0, 0) ldi(r2, 10) ldi(r1, 256) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 42. Instead of rb regfile fields, limited imm can be encoded ・-16~15, 1.0, 2.0, 4.0, … ・by combining them, some imm can be ALU instruction small immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() mov(r0, 0) mov(r2, 10) imul24(r1, -16, -16) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 43. Fusion of writing VPM(WIP) mov(r0, 0) L.loop setup_vpm_write(nrows=1) mutex_acquire() fadd(r1, uniform, 1.0) iadd(r0, r0, 1).mov(vpm, r1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(None, r0, 3, set_flag=True) bne(L.loop) nop(); nop(); nop() setup_vpm_write(nrows=3) fadd(ra0, uniform, 1.0) fadd(ra1, uniform, 1.0) fadd(ra2, uniform, 1.0) mutex_acquire() mov(vpm, ra0) mov(vpm, ra1) mov(vpm, ra2) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() Full unrolling
  • 44. Hardware limitation ・Cache incoherency is huge problem ・Register-spill ・problematic in other GPU ・Effective TMU load ・If the same region is read/write, it makes wrong ・Use DMA discard parallelism at all
  • 45. Insufficient use of DMA kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・region a is read/write ・a is just read once ・Acutually, Load via TMU is safe ・required complex analysis…???
  • 46. Complex iteration via OpenCL IDs ・implicit loops (by ids) are hard to convert to natural loops ・global_id + worker_id + local_id …… ・want to remove such parameters by offline-compilation
  • 47. Fusion of kernels(WIP) ・Fusion of some kernels (GEMM + ReLu + bias, etc…) ・For reducing memory transfer ・Diesel (NVIDIA Compiler project) reported the impact from Diesel: DSL for linear algebra and neural net computations on GPUs
  • 48. Software pipelining(WIP)? ・Probably, it is not effect… ・Due to instruction cache limitation ・We rerolled some kernels…..
  • 49. Conclusion? ・Introduce VC4 ・Dual-issue in-order processor ・You can write its assembly freely ・Introduce VC4C ・heavily under development ・compiler-lovers, here is a unmatured compiler!!!!
  • 50. Reference ・VideoCore® IV 3D Architecture Reference Guide ・Raspberry PiのGPUで行列乗算(その1) ・Raspberry PiのGPUで行列乗算(その2) ・Hacking the Raspberry Pi's VideoCore IV GPU ・GPU_FFT ・blog@ysugi