Andes open cl for RISC-V

OpenCL for RISC-V
Shao-Chung Wang
RISC-V Summit, Dec. 8,2020

Agenda
OpenCL Introduction
1
OpenCL Extension for RVV Cores
2
OpenCL Framework for RISC-V
3
Status
4

Taking RISC-V® Mainstream 3
Open Computing Language
• A popular framework for writing programs that execute across
heterogeneous platforms consisting of CPUs, GPUs, DSPs,
FPGAs or hardware accelerators with a host and multiple
devices
• Examples of host  devices pairs
– x86  multiple Andes NX27V (vector processors)
– Andes AX45MP  multiple Andes NX27V
– Andes AX45MP  multiple HW accelerators

Open Computing Language
• OpenCL Runtime
– Platform Layer API
• Query, select, and initialize compute devices
– Runtime API
• Build and dispatch kernel programs
• Resource management
• OpenCL kernel language
– OpenCL C - subset of C99 but with language extensions
– A set of built-in functions

Application
OpenCL Framework Overview
• OpenCL programs include host and kernel code fragment
– Host program is run on CPU
– Kernels can be run on both CPU and hardware accelerators
Host
Program
OpenCL
Kernels
OpenCL
Kernels
OpenCL
Kernels
OpenCL Runtime
Platform API
Runtime API
OpenCL Compiler
Frontend
Backend
CPU
Hardware
Accelerator
Hardware
Accelerator

Example: Vector Addition
• A simple C program uses
a loop to add two vectors.
• The “hello world” program
to demonstrate the data
parallel programming
void vadd(float *a, float *b, float *c, int n)
{
int i;
for (i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
return;
}
int main()
{
int a[100], b[100], c[100];
vadd(a, b, c, 100);
return 0;
}

• For vector addition, the following defines the problem domain
– To process two arrays with 100 elements
• 1 kernel instance executes the addition for one array element
• 100 total kernel instances are executed
Expressing Data Parallelism in OpenCL
Work Item – smallest parallel execution unit
• Define a problem domain to execute the kernel
Work Group - A set of work items
• The work items in the same group can be
synchronized

Vector Addition (Kernel)
__kernel void vadd (__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0) ;
c[gid] = a[gid] + b[gid];
}
Function qualifier to identify
the function is kernel
Address space qualifier,
__private, __local, __global, or__constant,
to annotate the data locations
Built-in function returns the unique id for
work item

Vector Addition (Host Program)
Execute the kernel
Read result from the device
Setup kernel
Build kernel (or load binary)
Allocate memory buffer
Set the platforms and queues
int main () {
……
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
……
memobjs[0] = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,…);
……
program = clCreateProgramWithSource(context, 1, &program_source, …);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, “vadd”, NULL);
clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem));
……
global_work_size[0] = n;
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, …);
clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, …);
……
}

An Example OpenCL Platform
• Host: x86
• Devices: 32 NX27V cores with RVV support
• Each core runs one or more work items at one time
• Host runtime dispatches work groups to multiple cores for parallel
execution
Host Device
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
x86
Cache
……
Host Memory

OpenCL C Extension for RVV
• Support RVV intrinsic and new built-in functions
__kernel
void vadd_rvv_cl(__global float *a,
__global float *b,
__global float *c
int n)
{
//return the index of the first element
//to be executed by a workitem
int wi = get_work_id(sizeof(float),n,0);
vfloat32m1_t vb = vle32_v_f32m1(&b[wi]);
vfloat32m1_t vc = vle32_v_f32m1(&c[wi]);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(&a[wi], va);
}
void vadd_rvv(float *a, float *b, float *c,
int n)
{
int tn = n;
while (tn > 0) {
size_t vl = vsetvl_e32m1(tn);
vfloat32m1_t vb = vle32_v_f32m1(b);
vfloat32m1_t vc = vle32_v_f32m1(c);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(a, va);
a += vl; b += vl;
c += vl; tn -= vl;
}
}
C with RVV Intrinsic OpenCL Kernel with RVV Intrinsic

OpenCL Runtime Support
• Runtime is composed of host and device layer
– Host layer is portable to different targets
– Device layer is designed to porting for different platforms
• Device query scheme for OpenCL platform layer
• Kernel launching scheme for OpenCL runtime
Host (x86)
Device Layer
Devices (AndeSim)
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
Device
Query
Kernel
Launching
GDB
Host Layer

OpenCL Compilation Flow
• OpenCL Clang translates the OpenCL kernel into SPIR
– SPIR is an intermediate language for parallel computation defined by
Khronos
– SPIR is based on LLVM IR
• Translate SPIR to LLVM IR
– IR must be compatible to RISCV ABI
Kernel
Function(.cl)
OpenCL C
Frontend (Clang)
SPIR
Work Item
Grouping
SPIR to LLVM
IR (RISC-V ABI)
RISC-V
Codegen
LLVM
IR
LLVM
IR
Target
Binary

Status
• Platforms
• QEMU (host and device are both Andes RISCV core)
• x86 + AndeSim (NX27V)
– Target: RV64GVC
• OpenCL Conformance Tests (CTS)
– Qemu: Most cases are passed
• Issues to be clarified with upstream
– x86 + AndeSim: ongoing
• RVV Intrinsic examples for optimization targets
• Next: optimizations on RVV compilation and host framework

Andes open cl for RISC-V

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Andes open cl for RISC-V

Similar to Andes open cl for RISC-V (20)

More from RISC-V International

More from RISC-V International (20)

Recently uploaded

Recently uploaded (20)

Andes open cl for RISC-V