SlideShare a Scribd company logo
1 of 15
Download to read offline
OpenCL for RISC-V
Shao-Chung Wang
RISC-V Summit, Dec. 8,2020
Agenda
OpenCL Introduction
1
OpenCL Extension for RVV Cores
2
OpenCL Framework for RISC-V
3
Status
4
Taking RISC-V® Mainstream 3
Open Computing Language
• A popular framework for writing programs that execute across
heterogeneous platforms consisting of CPUs, GPUs, DSPs,
FPGAs or hardware accelerators with a host and multiple
devices
• Examples of host  devices pairs
– x86  multiple Andes NX27V (vector processors)
– Andes AX45MP  multiple Andes NX27V
– Andes AX45MP  multiple HW accelerators
Taking RISC-V® Mainstream 4
Open Computing Language
• OpenCL Runtime
– Platform Layer API
• Query, select, and initialize compute devices
– Runtime API
• Build and dispatch kernel programs
• Resource management
• OpenCL kernel language
– OpenCL C - subset of C99 but with language extensions
– A set of built-in functions
Taking RISC-V® Mainstream 5
Application
OpenCL Framework Overview
• OpenCL programs include host and kernel code fragment
– Host program is run on CPU
– Kernels can be run on both CPU and hardware accelerators
Host
Program
OpenCL
Kernels
OpenCL
Kernels
OpenCL
Kernels
OpenCL Runtime
Platform API
Runtime API
OpenCL Compiler
Frontend
Backend
CPU
Hardware
Accelerator
Hardware
Accelerator
Taking RISC-V® Mainstream 6
Example: Vector Addition
• A simple C program uses
a loop to add two vectors.
• The “hello world” program
to demonstrate the data
parallel programming
void vadd(float *a, float *b, float *c, int n)
{
int i;
for (i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
return;
}
int main()
{
int a[100], b[100], c[100];
vadd(a, b, c, 100);
return 0;
}
Taking RISC-V® Mainstream 7
• For vector addition, the following defines the problem domain
– To process two arrays with 100 elements
• 1 kernel instance executes the addition for one array element
• 100 total kernel instances are executed
Expressing Data Parallelism in OpenCL
Work Item – smallest parallel execution unit
• Define a problem domain to execute the kernel
Work Group - A set of work items
• The work items in the same group can be
synchronized
Taking RISC-V® Mainstream 8
Vector Addition (Kernel)
__kernel void vadd (__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0) ;
c[gid] = a[gid] + b[gid];
}
Function qualifier to identify
the function is kernel
Address space qualifier,
__private, __local, __global, or__constant,
to annotate the data locations
Built-in function returns the unique id for
work item
Taking RISC-V® Mainstream 9
Vector Addition (Host Program)
Execute the kernel
Read result from the device
Setup kernel
Build kernel (or load binary)
Allocate memory buffer
Set the platforms and queues
int main () {
……
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
……
memobjs[0] = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,…);
……
program = clCreateProgramWithSource(context, 1, &program_source, …);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, “vadd”, NULL);
clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem));
……
global_work_size[0] = n;
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, …);
clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, …);
……
}
Taking RISC-V® Mainstream 10
An Example OpenCL Platform
• Host: x86
• Devices: 32 NX27V cores with RVV support
• Each core runs one or more work items at one time
• Host runtime dispatches work groups to multiple cores for parallel
execution
Host Device
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
x86
Cache
……
Host Memory
Taking RISC-V® Mainstream 11
OpenCL C Extension for RVV
• Support RVV intrinsic and new built-in functions
__kernel
void vadd_rvv_cl(__global float *a,
__global float *b,
__global float *c
int n)
{
//return the index of the first element
//to be executed by a workitem
int wi = get_work_id(sizeof(float),n,0);
vfloat32m1_t vb = vle32_v_f32m1(&b[wi]);
vfloat32m1_t vc = vle32_v_f32m1(&c[wi]);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(&a[wi], va);
}
void vadd_rvv(float *a, float *b, float *c,
int n)
{
int tn = n;
while (tn > 0) {
size_t vl = vsetvl_e32m1(tn);
vfloat32m1_t vb = vle32_v_f32m1(b);
vfloat32m1_t vc = vle32_v_f32m1(c);
vflaot32m1_t va = vadd_vv_f32m1(vb, vc);
vse32_v_f32m1(a, va);
a += vl; b += vl;
c += vl; tn -= vl;
}
}
C with RVV Intrinsic OpenCL Kernel with RVV Intrinsic
Taking RISC-V® Mainstream 12
OpenCL Runtime Support
• Runtime is composed of host and device layer
– Host layer is portable to different targets
– Device layer is designed to porting for different platforms
• Device query scheme for OpenCL platform layer
• Kernel launching scheme for OpenCL runtime
Host (x86)
Device Layer
Devices (AndeSim)
Device Memory
NX27V
Local
MEM
NX27V
Local
MEM
NX27V
Local
MEM
……
……
Device
Query
Kernel
Launching
GDB
Host Layer
Taking RISC-V® Mainstream 13
OpenCL Compilation Flow
• OpenCL Clang translates the OpenCL kernel into SPIR
– SPIR is an intermediate language for parallel computation defined by
Khronos
– SPIR is based on LLVM IR
• Translate SPIR to LLVM IR
– IR must be compatible to RISCV ABI
Kernel
Function(.cl)
OpenCL C
Frontend (Clang)
SPIR
Work Item
Grouping
SPIR to LLVM
IR (RISC-V ABI)
RISC-V
Codegen
LLVM
IR
LLVM
IR
Target
Binary
Taking RISC-V® Mainstream 14
Status
• Platforms
• QEMU (host and device are both Andes RISCV core)
• x86 + AndeSim (NX27V)
– Target: RV64GVC
• OpenCL Conformance Tests (CTS)
– Qemu: Most cases are passed
• Issues to be clarified with upstream
– x86 + AndeSim: ongoing
• RVV Intrinsic examples for optimization targets
• Next: optimizations on RVV compilation and host framework
Andes open cl for RISC-V

More Related Content

What's hot

FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53KarthiSugumar
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsVipin Varghese
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsXPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsThe Linux Foundation
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation Jiann-Fuh Liaw
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelSUSE Labs Taipei
 
Red Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABCRed Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABCRobert Bohne
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmicsDenys Haryachyy
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64Yi-Hsiu Hsu
 
Cilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFCilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFThomas Graf
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingMichelle Holley
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetJames Wernicke
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-pattersonKrste Asanovic
 

What's hot (20)

FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
 
RISC-V Foundation Overview
RISC-V Foundation OverviewRISC-V Foundation Overview
RISC-V Foundation Overview
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpoints
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsXPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux Kernel
 
Red Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABCRed Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABC
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 
Cilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPFCilium - API-aware Networking and Security for Containers based on BPF
Cilium - API-aware Networking and Security for Containers based on BPF
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-patterson
 

Similar to Andes open cl for RISC-V

MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortStefan Marr
 
Cross Platform App Development with C++
Cross Platform App Development with C++Cross Platform App Development with C++
Cross Platform App Development with C++Joan Puig Sanz
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыDevGAMM Conference
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkanax inc.
 
Android RenderScript
Android RenderScriptAndroid RenderScript
Android RenderScriptJungsoo Nam
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Windows Developer
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveAmiq Consulting
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialNeera Agarwal
 

Similar to Andes open cl for RISC-V (20)

MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
 
SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel
 
Android ndk
Android ndkAndroid ndk
Android ndk
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Cross Platform App Development with C++
Cross Platform App Development with C++Cross Platform App Development with C++
Cross Platform App Development with C++
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформы
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkan
 
Android RenderScript
Android RenderScriptAndroid RenderScript
Android RenderScript
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 

More from RISC-V International

London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VRISC-V International
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...RISC-V International
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VRISC-V International
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresRISC-V International
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipRISC-V International
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V International
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V International
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V International
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V International
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V International
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the unionRISC-V International
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...RISC-V International
 

More from RISC-V International (20)

WD RISC-V inliner work effort
WD RISC-V inliner work effortWD RISC-V inliner work effort
WD RISC-V inliner work effort
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
RISC-V Online Tutor
RISC-V Online TutorRISC-V Online Tutor
RISC-V Online Tutor
 
London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-V
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-V
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V cores
 
Security and functional safety
Security and functional safetySecurity and functional safety
Security and functional safety
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_gen
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmware
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notes
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the union
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...
 
Porting tock to open titan
Porting tock to open titanPorting tock to open titan
Porting tock to open titan
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Andes open cl for RISC-V

  • 1. OpenCL for RISC-V Shao-Chung Wang RISC-V Summit, Dec. 8,2020
  • 2. Agenda OpenCL Introduction 1 OpenCL Extension for RVV Cores 2 OpenCL Framework for RISC-V 3 Status 4
  • 3. Taking RISC-V® Mainstream 3 Open Computing Language • A popular framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs or hardware accelerators with a host and multiple devices • Examples of host  devices pairs – x86  multiple Andes NX27V (vector processors) – Andes AX45MP  multiple Andes NX27V – Andes AX45MP  multiple HW accelerators
  • 4. Taking RISC-V® Mainstream 4 Open Computing Language • OpenCL Runtime – Platform Layer API • Query, select, and initialize compute devices – Runtime API • Build and dispatch kernel programs • Resource management • OpenCL kernel language – OpenCL C - subset of C99 but with language extensions – A set of built-in functions
  • 5. Taking RISC-V® Mainstream 5 Application OpenCL Framework Overview • OpenCL programs include host and kernel code fragment – Host program is run on CPU – Kernels can be run on both CPU and hardware accelerators Host Program OpenCL Kernels OpenCL Kernels OpenCL Kernels OpenCL Runtime Platform API Runtime API OpenCL Compiler Frontend Backend CPU Hardware Accelerator Hardware Accelerator
  • 6. Taking RISC-V® Mainstream 6 Example: Vector Addition • A simple C program uses a loop to add two vectors. • The “hello world” program to demonstrate the data parallel programming void vadd(float *a, float *b, float *c, int n) { int i; for (i=0; i < n; i++) { a[i] = b[i] + c[i]; } return; } int main() { int a[100], b[100], c[100]; vadd(a, b, c, 100); return 0; }
  • 7. Taking RISC-V® Mainstream 7 • For vector addition, the following defines the problem domain – To process two arrays with 100 elements • 1 kernel instance executes the addition for one array element • 100 total kernel instances are executed Expressing Data Parallelism in OpenCL Work Item – smallest parallel execution unit • Define a problem domain to execute the kernel Work Group - A set of work items • The work items in the same group can be synchronized
  • 8. Taking RISC-V® Mainstream 8 Vector Addition (Kernel) __kernel void vadd (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0) ; c[gid] = a[gid] + b[gid]; } Function qualifier to identify the function is kernel Address space qualifier, __private, __local, __global, or__constant, to annotate the data locations Built-in function returns the unique id for work item
  • 9. Taking RISC-V® Mainstream 9 Vector Addition (Host Program) Execute the kernel Read result from the device Setup kernel Build kernel (or load binary) Allocate memory buffer Set the platforms and queues int main () { …… clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); …… memobjs[0] = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,…); …… program = clCreateProgramWithSource(context, 1, &program_source, …); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, “vadd”, NULL); clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); …… global_work_size[0] = n; clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, …); clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, …); …… }
  • 10. Taking RISC-V® Mainstream 10 An Example OpenCL Platform • Host: x86 • Devices: 32 NX27V cores with RVV support • Each core runs one or more work items at one time • Host runtime dispatches work groups to multiple cores for parallel execution Host Device Device Memory NX27V Local MEM NX27V Local MEM NX27V Local MEM …… …… x86 Cache …… Host Memory
  • 11. Taking RISC-V® Mainstream 11 OpenCL C Extension for RVV • Support RVV intrinsic and new built-in functions __kernel void vadd_rvv_cl(__global float *a, __global float *b, __global float *c int n) { //return the index of the first element //to be executed by a workitem int wi = get_work_id(sizeof(float),n,0); vfloat32m1_t vb = vle32_v_f32m1(&b[wi]); vfloat32m1_t vc = vle32_v_f32m1(&c[wi]); vflaot32m1_t va = vadd_vv_f32m1(vb, vc); vse32_v_f32m1(&a[wi], va); } void vadd_rvv(float *a, float *b, float *c, int n) { int tn = n; while (tn > 0) { size_t vl = vsetvl_e32m1(tn); vfloat32m1_t vb = vle32_v_f32m1(b); vfloat32m1_t vc = vle32_v_f32m1(c); vflaot32m1_t va = vadd_vv_f32m1(vb, vc); vse32_v_f32m1(a, va); a += vl; b += vl; c += vl; tn -= vl; } } C with RVV Intrinsic OpenCL Kernel with RVV Intrinsic
  • 12. Taking RISC-V® Mainstream 12 OpenCL Runtime Support • Runtime is composed of host and device layer – Host layer is portable to different targets – Device layer is designed to porting for different platforms • Device query scheme for OpenCL platform layer • Kernel launching scheme for OpenCL runtime Host (x86) Device Layer Devices (AndeSim) Device Memory NX27V Local MEM NX27V Local MEM NX27V Local MEM …… …… Device Query Kernel Launching GDB Host Layer
  • 13. Taking RISC-V® Mainstream 13 OpenCL Compilation Flow • OpenCL Clang translates the OpenCL kernel into SPIR – SPIR is an intermediate language for parallel computation defined by Khronos – SPIR is based on LLVM IR • Translate SPIR to LLVM IR – IR must be compatible to RISCV ABI Kernel Function(.cl) OpenCL C Frontend (Clang) SPIR Work Item Grouping SPIR to LLVM IR (RISC-V ABI) RISC-V Codegen LLVM IR LLVM IR Target Binary
  • 14. Taking RISC-V® Mainstream 14 Status • Platforms • QEMU (host and device are both Andes RISCV core) • x86 + AndeSim (NX27V) – Target: RV64GVC • OpenCL Conformance Tests (CTS) – Qemu: Most cases are passed • Issues to be clarified with upstream – x86 + AndeSim: ongoing • RVV Intrinsic examples for optimization targets • Next: optimizations on RVV compilation and host framework