OpenCL for RISC-V
Shao-Chung Wang
RISC-V Summit, Dec. 8,2020
Agenda
OpenCL Introduction
1
OpenCL Extension for RVV Cores
2
OpenCL Framework for RISC-V
3
Status
4
Taking RISC-V® Mainstream 3
Open Computing Language
• A popular framework for writing programs that execute across
heterogeneous platforms consisting of CPUs, GPUs, DSPs,
FPGAs or hardware accelerators with a host and multiple
devices
• Examples of host  devices pairs
– x86  multiple Andes NX27V (vector processors)
– Andes AX45MP  multiple Andes NX27V
– Andes AX45MP  multiple HW accelerators
Taking RISC-V® Mainstream 4
Open Computing Language
• OpenCL Runtime
– Platform Layer API
• Query, select, and initialize compute devices
– Runtime API
• Build and dispatch kernel programs
• Resource management
• OpenCL kernel language
– OpenCL C - subset of C99 but with language extensions
– A set of built-in functions
Taking RISC-V® Mainstream 5
Application
OpenCL Framework Overview
• OpenCL programs include host and kernel code fragment
– Host program is run on CPU
– Kernels can be run on both CPU and hardware accelerators
Host
Program
OpenCL
Kernels
OpenCL
Kernels
OpenCL
Kernels
OpenCL Runtime
Platform API
Runtime API
OpenCL Compiler
Frontend
Backend
CPU
Hardware
Accelerator
Hardware
Accelerator
Taking RISC-V® Mainstream 6
Example: Vector Addition
• A simple C program uses
a loop to add two vectors.
• The “hello world” program
to demonstrate the data
parallel programming
void vadd(float *a, float *b, float *c, int n)
{
int i;
for (i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
return;
}
int main()
{
int a[100], b[100], c[100];
vadd(a, b, c, 100);
return 0;
}
Taking RISC-V® Mainstream 7
• For vector addition, the following defines the problem domain
– To process two arrays with 100 elements
• 1 kernel instance executes the addition for one array element
• 100 total kernel instances are executed
Expressing Data Parallelism in OpenCL
Work Item – smallest parallel execution unit
• Define a problem domain to execute the kernel
Work Group - A set of work items
• The work items in the same group can be
synchronized
Taking RISC-V® Mainstream 8
Vector Addition (Kernel)
__kernel void vadd (__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0) ;
c[gid] = a[gid] + b[gid];
}
Function qualifier to identify
the function is kernel
Address space qualifier,
__private, __local, __global, or__constant,
to annotate the data locations
Built-in function returns the unique id for
work item
Taking RISC-V® Mainstream 9
Vector Addition (Host Program)
Execute the kernel
Read result from the device
Setup kernel
Build kernel (or load binary)
Allocate memory buffer
Set the platforms and queues
int main () {
……
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
……
memobjs[0] = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,…);
……
program = clCreateProgramWithSource(context, 1, &program_source, …);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, “vadd”, NULL);
clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem));
……
global_work_size[0] = n;
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, …);
clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, …);
……
}
Taking RISC-V® Mainstream 10
An Example OpenCL Platform
• Host: x86
• Devices: 32 NX27V cores with RVV support
• Each core runs one or more work items at one time
• Host runtime dispatches work groups to multiple cores for parallel
execution
Host Device
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
x86
Cache
……
Host Memory
Taking RISC-V® Mainstream 11
OpenCL C Extension for RVV
• Support RVV intrinsic and new built-in functions
__kernel
void vadd_rvv_cl(__global float *a,
__global float *b,
__global float *c
int n)
{
//return the index of the first element
//to be executed by a workitem
int wi = get_work_id(sizeof(float),n,0);
vfloat32m1_t vb = vle32_v_f32m1(&b[wi]);
vfloat32m1_t vc = vle32_v_f32m1(&c[wi]);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(&a[wi], va);
}
void vadd_rvv(float *a, float *b, float *c,
int n)
{
int tn = n;
while (tn > 0) {
size_t vl = vsetvl_e32m1(tn);
vfloat32m1_t vb = vle32_v_f32m1(b);
vfloat32m1_t vc = vle32_v_f32m1(c);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(a, va);
a += vl; b += vl;
c += vl; tn -= vl;
}
}
C with RVV Intrinsic OpenCL Kernel with RVV Intrinsic
Taking RISC-V® Mainstream 12
OpenCL Runtime Support
• Runtime is composed of host and device layer
– Host layer is portable to different targets
– Device layer is designed to porting for different platforms
• Device query scheme for OpenCL platform layer
• Kernel launching scheme for OpenCL runtime
Host (x86)
Device Layer
Devices (AndeSim)
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
Device
Query
Kernel
Launching
GDB
Host Layer
Taking RISC-V® Mainstream 13
OpenCL Compilation Flow
• OpenCL Clang translates the OpenCL kernel into SPIR
– SPIR is an intermediate language for parallel computation defined by
Khronos
– SPIR is based on LLVM IR
• Translate SPIR to LLVM IR
– IR must be compatible to RISCV ABI
Kernel
Function(.cl)
OpenCL C
Frontend (Clang)
SPIR
Work Item
Grouping
SPIR to LLVM
IR (RISC-V ABI)
RISC-V
Codegen
LLVM
IR
LLVM
IR
Target
Binary
Taking RISC-V® Mainstream 14
Status
• Platforms
• QEMU (host and device are both Andes RISCV core)
• x86 + AndeSim (NX27V)
– Target: RV64GVC
• OpenCL Conformance Tests (CTS)
– Qemu: Most cases are passed
• Issues to be clarified with upstream
– x86 + AndeSim: ongoing
• RVV Intrinsic examples for optimization targets
• Next: optimizations on RVV compilation and host framework
Andes open cl for RISC-V

Andes open cl for RISC-V

  • 1.
    OpenCL for RISC-V Shao-ChungWang RISC-V Summit, Dec. 8,2020
  • 2.
    Agenda OpenCL Introduction 1 OpenCL Extensionfor RVV Cores 2 OpenCL Framework for RISC-V 3 Status 4
  • 3.
    Taking RISC-V® Mainstream3 Open Computing Language • A popular framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs or hardware accelerators with a host and multiple devices • Examples of host  devices pairs – x86  multiple Andes NX27V (vector processors) – Andes AX45MP  multiple Andes NX27V – Andes AX45MP  multiple HW accelerators
  • 4.
    Taking RISC-V® Mainstream4 Open Computing Language • OpenCL Runtime – Platform Layer API • Query, select, and initialize compute devices – Runtime API • Build and dispatch kernel programs • Resource management • OpenCL kernel language – OpenCL C - subset of C99 but with language extensions – A set of built-in functions
  • 5.
    Taking RISC-V® Mainstream5 Application OpenCL Framework Overview • OpenCL programs include host and kernel code fragment – Host program is run on CPU – Kernels can be run on both CPU and hardware accelerators Host Program OpenCL Kernels OpenCL Kernels OpenCL Kernels OpenCL Runtime Platform API Runtime API OpenCL Compiler Frontend Backend CPU Hardware Accelerator Hardware Accelerator
  • 6.
    Taking RISC-V® Mainstream6 Example: Vector Addition • A simple C program uses a loop to add two vectors. • The “hello world” program to demonstrate the data parallel programming void vadd(float *a, float *b, float *c, int n) { int i; for (i=0; i < n; i++) { a[i] = b[i] + c[i]; } return; } int main() { int a[100], b[100], c[100]; vadd(a, b, c, 100); return 0; }
  • 7.
    Taking RISC-V® Mainstream7 • For vector addition, the following defines the problem domain – To process two arrays with 100 elements • 1 kernel instance executes the addition for one array element • 100 total kernel instances are executed Expressing Data Parallelism in OpenCL Work Item – smallest parallel execution unit • Define a problem domain to execute the kernel Work Group - A set of work items • The work items in the same group can be synchronized
  • 8.
    Taking RISC-V® Mainstream8 Vector Addition (Kernel) __kernel void vadd (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0) ; c[gid] = a[gid] + b[gid]; } Function qualifier to identify the function is kernel Address space qualifier, __private, __local, __global, or__constant, to annotate the data locations Built-in function returns the unique id for work item
  • 9.
    Taking RISC-V® Mainstream9 Vector Addition (Host Program) Execute the kernel Read result from the device Setup kernel Build kernel (or load binary) Allocate memory buffer Set the platforms and queues int main () { …… clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); …… memobjs[0] = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,…); …… program = clCreateProgramWithSource(context, 1, &program_source, …); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, “vadd”, NULL); clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); …… global_work_size[0] = n; clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, …); clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, …); …… }
  • 10.
    Taking RISC-V® Mainstream10 An Example OpenCL Platform • Host: x86 • Devices: 32 NX27V cores with RVV support • Each core runs one or more work items at one time • Host runtime dispatches work groups to multiple cores for parallel execution Host Device Device Memory NX27V Local MEM NX27V Local MEM NX27V Local MEM …… …… x86 Cache …… Host Memory
  • 11.
    Taking RISC-V® Mainstream11 OpenCL C Extension for RVV • Support RVV intrinsic and new built-in functions __kernel void vadd_rvv_cl(__global float *a, __global float *b, __global float *c int n) { //return the index of the first element //to be executed by a workitem int wi = get_work_id(sizeof(float),n,0); vfloat32m1_t vb = vle32_v_f32m1(&b[wi]); vfloat32m1_t vc = vle32_v_f32m1(&c[wi]); vflaot32m1_t va = vadd_vv_f32m1(vb, vc); vse32_v_f32m1(&a[wi], va); } void vadd_rvv(float *a, float *b, float *c, int n) { int tn = n; while (tn > 0) { size_t vl = vsetvl_e32m1(tn); vfloat32m1_t vb = vle32_v_f32m1(b); vfloat32m1_t vc = vle32_v_f32m1(c); vflaot32m1_t va = vadd_vv_f32m1(vb, vc); vse32_v_f32m1(a, va); a += vl; b += vl; c += vl; tn -= vl; } } C with RVV Intrinsic OpenCL Kernel with RVV Intrinsic
  • 12.
    Taking RISC-V® Mainstream12 OpenCL Runtime Support • Runtime is composed of host and device layer – Host layer is portable to different targets – Device layer is designed to porting for different platforms • Device query scheme for OpenCL platform layer • Kernel launching scheme for OpenCL runtime Host (x86) Device Layer Devices (AndeSim) Device Memory NX27V Local MEM NX27V Local MEM NX27V Local MEM …… …… Device Query Kernel Launching GDB Host Layer
  • 13.
    Taking RISC-V® Mainstream13 OpenCL Compilation Flow • OpenCL Clang translates the OpenCL kernel into SPIR – SPIR is an intermediate language for parallel computation defined by Khronos – SPIR is based on LLVM IR • Translate SPIR to LLVM IR – IR must be compatible to RISCV ABI Kernel Function(.cl) OpenCL C Frontend (Clang) SPIR Work Item Grouping SPIR to LLVM IR (RISC-V ABI) RISC-V Codegen LLVM IR LLVM IR Target Binary
  • 14.
    Taking RISC-V® Mainstream14 Status • Platforms • QEMU (host and device are both Andes RISCV core) • x86 + AndeSim (NX27V) – Target: RV64GVC • OpenCL Conformance Tests (CTS) – Qemu: Most cases are passed • Issues to be clarified with upstream – x86 + AndeSim: ongoing • RVV Intrinsic examples for optimization targets • Next: optimizations on RVV compilation and host framework