The document provides an overview of OpenCL, including:
- OpenCL allows programs to execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
- It defines an programming model for parallel computation along with a framework API for controlling devices and allocating memory.
- The OpenCL framework handles compiling programs for different devices and scheduling work across processors. It provides interfaces for querying platforms and devices, creating contexts, and managing memory and command queues.
- OpenCL aims to standardize parallel programming and overcome the need to learn separate APIs for each type of hardware as processors evolve with increasing core counts.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit-opencv
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gary Bradski, President and CEO of the OpenCV Foundation, presents the "OpenCV Open Source Computer Vision Library: Latest Developments" tutorial at the May 2015 Embedded Vision Summit.
OpenCV is an enormously popular open source computer vision library, with over 9 million downloads. Originally used mainly for research and prototyping, in recent years OpenCV has increasingly been used in deployed products on a wide range of platforms from cloud to mobile.
The latest version, OpenCV 3.0 is currently in beta, and is a major overhaul, bringing OpenCV up to modern C++ standards and incorporating expanded support for 3D vision. The new release also introduces a modular “contrib” facility that enables independently developed modules to be quickly integrated with OpenCV as needed, providing a flexible mechanism to allow developers to experiment with new techniques before they are officially integrated into the library.
In this talk, Gary Bradski, head of the OpenCV Foundation, provides an insider’s perspective on the new version of OpenCV and how developers can utilize it to maximum advantage for vision research, prototyping, and product development.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit-opencv
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gary Bradski, President and CEO of the OpenCV Foundation, presents the "OpenCV Open Source Computer Vision Library: Latest Developments" tutorial at the May 2015 Embedded Vision Summit.
OpenCV is an enormously popular open source computer vision library, with over 9 million downloads. Originally used mainly for research and prototyping, in recent years OpenCV has increasingly been used in deployed products on a wide range of platforms from cloud to mobile.
The latest version, OpenCV 3.0 is currently in beta, and is a major overhaul, bringing OpenCV up to modern C++ standards and incorporating expanded support for 3D vision. The new release also introduces a modular “contrib” facility that enables independently developed modules to be quickly integrated with OpenCV as needed, providing a flexible mechanism to allow developers to experiment with new techniques before they are officially integrated into the library.
In this talk, Gary Bradski, head of the OpenCV Foundation, provides an insider’s perspective on the new version of OpenCV and how developers can utilize it to maximum advantage for vision research, prototyping, and product development.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Presented at the Bossa'10 conference in Manaus, Brazil. The presentation talks about the direction in which the Qt widgets are being developed and introduces the idea of Controls to Qt and QML.
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCLAngelo Impedovo
Intro to heterogeneous parallel computing on GPU using OpenCL, the entire presentation has been presented during the Linux Day 2016 @ Polytechnic University of Bari (and therefore it is completely written in italian).
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Presented at the Bossa'10 conference in Manaus, Brazil. The presentation talks about the direction in which the Qt widgets are being developed and introduces the idea of Controls to Qt and QML.
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCLAngelo Impedovo
Intro to heterogeneous parallel computing on GPU using OpenCL, the entire presentation has been presented during the Linux Day 2016 @ Polytechnic University of Bari (and therefore it is completely written in italian).
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
In this article we compare the results obtained with an implementation of the Finite Volume for structured meshes on GPGPUs with experimental results and also with a Finite Element code with boundary fitted strategy. The example is a fully submerged spherical buoy immersed in a cubic water recipient. The recipient undergoes an harmonic linear motion imposed with a shake table. The experiment is recorded with a high speed camera and the displacement of the buoy if obtained from the video with a MoCap (Motion Capture) algorithm. The amplitude and phase of the resulting motion allows to determine indirectly the added mass and drag of the sphere.
NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.
Dustin Franklin (GPGPU Applications Engineer, GE Intelligent Platforms ) presents:
"GPUDirect support for RDMA provides low-latency interconnectivity between NVIDIA GPUs and various networking, storage, and FPGA devices. Discussion will include how the CUDA 5 technology increases GPU autonomy and promotes multi-GPU topologies with high GPU-to-CPU ratios. In addition to improved bandwidth and latency, the resulting increase in GFLOPS/watt poses a significant impact to both HPC and embedded applications. We will dig into scalable PCIe switch hierarchies, as well as software infrastructure to manage device interopability and GPUDirect streaming. Highlighting emerging architectures composed of Tegra-style SoCs that further decouple GPUs from discrete CPUs to achieve greater computational density."
Learn more at: http://www.gputechconf.com/page/home.html
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Implementing OpenCL support in GEGL and GIMPlgworld
"In this session I'm going to describe some efforts to bring OpenCL acceleration to the General Graphics Library (GEGL) and the GNU Image Manipulation Program (GIMP). I intend to show the current state of the project, some implementations techniques used and performance comparisons among common GPUs."
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/intel/embedded-vision-training/videos/pages/may-2017-embedded-vision-summit-pisarevsky
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vadim Pisarevsky, Software Engineering Manager at Intel, presents the "Making OpenCV Code Run Fast" tutorial at the May 2017 Embedded Vision Summit.
OpenCV is the de facto standard framework for computer vision developers, with a 16+ year history, approximately one million lines of code, thousands of algorithms and tens of thousands of unit tests. While OpenCV delivers decent performance out-of-the-box for some classical algorithms on desktop PCs, it lacks sufficient performance when using some modern algorithms, such as deep neural networks, and when running on embedded platforms. Pisarevsky examines current and forthcoming approaches to performance optimization of OpenCV, including the existing OpenCL-based transparent API, newly added support for OpenVX, and early experimental results using Halide.
He demonstrates the use of the OpenCL-based transparent API on a popular CV problem: pedestrian detection. Because OpenCL does not provide good performance-portability, he explores additional approaches. He discusses how OpenVX support in OpenCV accelerates image processing pipelines and deep neural network execution. He also presents early experimental results using Halide, which provides a higher level of abstraction and ease of use, and is being actively considered for future support in OpenCV.
Devoxx 2015 - Building the Internet of Things with Eclipse IoTBenjamin Cabé
Eclipse is much more than an IDE. Repeat after me: "Eclipse is much more than just an IDE! Eclipse has a lot of cool projects that can get me started with the Internet of Things!". So whether or not you are using Eclipse as your IDE, this session will give you a crash course on the available technologies to build the Internet of Things on top of Java. You will learn how protocols like MQTT, CoAP or LwM2M and embedded frameworks like Kura help solve classical IoT issues, and you will get useful tips to move from "yay, I blinked an LED!" to more useful industrial IoT scenarios.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
A basic introduction to GPU architecture. Based on Kayvon\'s "From Shader Code to a Teraflop: How GPU Shader Cores Work"
Updated to include the latest GPUs: AMD Tahiti (HD7970) and NVIDIA Kepler (GTX690)
2. Agenda
• OpenCL intro
– GPGPU in a nutshell
– OpenCL roots
– OpenCL status
• OpenCL 1.0 deep dive
– Programming Model
– Framework API
– Language
– Embedded Profile & Extensions
• Summary
2
3. GPGPU in a nutshell
Disclaimer:
1. GPGPU is a lot of things to a lot of people.
This is my view & vision on GPGPU…
2. GPGPU is a huge subject, and this is a Nutshell.
I recommend Kayvon’s lecture at http://s08.idav.ucdavis.edu/
3
4. GPGPU in a nutshell
#include <stdio.c>
… • On the right there is a very
void main (int argc, char* argv[]) artificial example to explain
{
GPGPU.
…
• A simple program , with a “for” loop
for (i=0 ; i < iBuffSize; i++) which takes two buffers and adds
C[i] = A[i] + B[i];
them into a third buffer
…
(did we mention the word artificial yet ???)
}
4
5. GPGPU in a nutshell
#include <stdio.c>
… • The example after an expert visit:
void main (int argc, char* argv[]) – Dual threaded to support Dual Core
{
… – SSE2 code is doing a vectorized
_beginthread(vec_add, 0, A, B, C, iBuffSize/2);
operation
_beginthread(vec_add, 0, A[iBuffSize/2],
B[iBuffSize/2], C[iBuffSize/2],
iBuffSize/2);
…
}
void vec_add (const float *A, const float *B, float *C,
int iBuffSize)
{
__m128 vC, vB, vA;
for (i=0 ; i < iBuffSize/4; i++)
{
vA = _mm_load_sp(&A[i*4]);
vB = _mm_load_sp(&B[i*4]);
vC = _mm_add_ps (vA, vB);
_mm_store_ps (&C[i*4], vC);
}
_endthread();
}
5
6. Traditional GPGPU…
• Write in graphics language and use the GPU
• Highly effective, but :
– The developer needs to learn another (not intuitive) language
– The developer was limited by the graphics language
6
7. GPGPU reloaded
#include <stdio.c>
… • CUDA was a disruptive technology
Void main (int argc, char* argv[]) – Write C on the GPU
{ – Extend to non-traditional usages
… – Provide synchronization mechnaism
• OpenCL deepens and extends the
for (i=0 ; i < iBuffSize; i++) revolution
C[i] = A[i] + B[i];
• GPGPU now used for games to enhance the
… standard GFX pipe
} – Physics
– Advanced Rendering
__kernel void dot_product (__global const float4 *a,
__global const float4 *b,
__global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
OpenCL
7
8. GPGPU also in Games…
Non-interactive point light Dynamic light affects character & environment
Jagged edge Clean edge
artifacting details
8
9. A new type of programming…
“The way the processor industry is going is to add more and more cores, but
nobody knows how to program those things. I mean, two, yeah; four, not
really; eight, forget it.”
Steve Jobs, NY Times interview, June 10 2008
What About GPU’s ?
NVIDIA G80: 16 Cores, 8 HW threads per core
Larrabee: XX Cores, Y HW threads per core
“Basically it lets you use graphics processors to do computation,” he said. “It’s
way beyond what Nvidia or anyone else has, and it’s really simple.”
Steve Jobs on OpenCL, NY Times interview, June 10 2008
http://bits.blogs.nytimes.com/2008/06/10/apple-in-parallel-turning-the-pc-world-upside-down/
9
10. OpenCL in a nutshell
• OpenCL is :
– An opened Standard managed by the Khronos group (Cross-IHV, Cross-OS)
– Influenced & guided by Apple
– Spec 1.0 approved Dec’08
– A system for executing short "Enhanced C" routines (kernels) across devices
– All around Heterogeneous Platforms – Host & Devices
– Devices: CPU, GPU, Accelerator (FPGA)
– Skewed towards GPU HW
– Samplers, Vector Types, etc.
– Offers hybrid execution capability
• OpenCL is not:
– OpenGL, or any other 3D graphics
language
– Meant to replace C/C++ (don’t write the
entire application in it…)
Khronos WG Key contributors
Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts
(EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,
Codeplay, Ericsson, Freescale, HI, Imagination Technologies,
Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,
Seaweed, Takumi, TI and Umeå University.
10
11. The Roots of OpenCL
• Apple Has History on GPGPU…
– Developed a Framework called “Core Image”
(& Core Video)
– Based on OpenGL for GPU operation and
Optimized SSE code for CPU
• Feb 15th 2007 – NVIDIA introduces CUDA
– Stream Programming Language – C with
extensions for GPU
– Supported by Any Geforce 8 GPU
– Works on XP & Vista (CUDA 2.0)
– Amazing adoption rate
– 40 university courses worldwide
– 100+ Applications/Articles
• Apple & NVIDIA cooperate to create
OpenCL – Open Compute Language
11
1
12. OpenCL Status
• Apple Submitted the OpenCL 1.0 specification draft to Khronos (owner of OpenGL)
• June 16th 2008 - Khronos established the “Compute Working Group”
– Members: AMD, Apple, Ardites, ARM, Blizzard, Broadcom, Codeplay, EA, Ericsson, Freescale, Hi Corp., IBM, Imagination
Technologies, Intel, Kestrel Institute, Movidia, Nokia, Nvidia, Qualcomm, Rapid Mind, Samsung, Takumi and TI.
• Dec. 1st 2008 – OpenCL 1.0 ratification
• Apple is expected to release “Snow Leopard” Mac OS (containing OpenCL 1.0) by end of 2009
• Apple already began porting code to OpenCL
12
1
13. Agenda
• OpenCL intro
– GPGPU in a nutshell
– OpenCL roots
– OpenCL status
• OpenCL 1.0 deep dive
– Programming Model
– Framework API
– Language
– Embedded Profile & Extensions
• Summary
13
1
14. OpenCL from 10,000 feet…
• The Standard Defines two major elements:
– The Framework/Software Stack
– The Language
__kernel void dot_product (__global const float4 *a,
__kernel void dot_product (__global const float4 *a,
__kernel void dot_product (__global const float4 *a, *b,
__global const float4
__global const float4 *b,
__global float *c)
__global const float4 *b,
{ __global float *c)
__global float *c)
Application
{ int tid = get_global_id(0);
{ int tid = get_global_id(0);
int c[tid] get_global_id(0);
tid = = dot(a[tid], b[tid]);
c[tid] = dot(a[tid], b[tid]);
} c[tid] = dot(a[tid], b[tid]);
}
}
Chapter 6
OpenCL Framework
Chapters 2-5
14
1
15. OpenCL Platform Model
• The basic platform composed of a Host and a few Devices
• Each device is made of a few compute units (well, cores…)
• Each compute unit is made of a few processing elements (virtual scalar processor)
Under OpenCL the CPU is also compute device
15
1
16. Compute Device Memory Model
• Compute Device – CPU or GPU
• Compute Unit = Core
• Compute Kernel
– A function written in OpenCL C
– Mapped to Work Item(s)
• Work-item
– A single copy of the compute kernel,
running on one data element
– In Data Parallel mode, kernel execution
contains multiple work-items
– In Task Parallel mode, kernel execution
contains a single work-item
• Four Memory Types:
– Global : default for images/buffers
– Constant : global const variables
– Local : shared between work-items
– Private : kernel internal variables
16
1
17. Execution Model
• Host defines a command queue and associates it with a context (devices, kernels, memory,
etc).
• Host enqueues commands to the command queue
• Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract
Index Space
• Work items execute together as a work-group.
(wx,
w y)
(wxSx + sx, wySy (wxSx + sx, wySy
+ sy) + sy)
(sx, sy) = (0,0) (sx, sy) = (Sx-1,0)
Gy
(wxSx + sx, wySy (wxSx + sx, wySy
+ sy) + sy)
(sx, sy) = (0, Sy-1) (sx, sy) = (Sx-1,
Sy- 1)
G
x
17
18. Programming Model
• Data Parallel, SPMD
– Work-items in a work-group run the same program
– Update data structures in parallel using the work-item ID to select data
and guide execution.
• Task Parallel
– One work-item per work group … for coarse grained task-level
parallelism.
– Native function interface: trap-door to run arbitrary code from an
OpenCL command-queue.
18
19. Compilation Model
• OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL)
• Static compilation:
– The code is compiled from source to machine execution code at a specific point in the
past (when the developer complied it using the IDE)
• Dynamic compilation:
– Also known as runtime compilation
– Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an
assembler of a virtual machine. This step is known as offline compilation, and it’s done by
the Front-End compiler
– Step 2: The IR is compiled to a machine code for execution. This step is much shorter. It
is known as online compilation, and it’s done by the Back-end compiler
• In dynamic compilation, step 1 is done usually only once, and the IR is stored. The
App loads the IR and performs step 2 during the App’s runtime (hence the term…)
19
20. OpenCL Framework overview
Kernels to Accelerate (OpenCL C)
Application __kernel void dot (__global const float4 *a,
(written in C++, C#, Java, …) __kernel void dot (__global const float4 *a, *b,
__global const float4
__kernel void dot (__global constfloat *c) *b,
__global float4 *a,
__global const float4
{ __global const float4 *b,
__global float *c)
{ __global float *c)
int tid = get_global_id(0);
{ intc[tid] = dot(a[tid], b[tid]);
tid = get_global_id(0); Open CL Framework
} intc[tid] = dot(a[tid], b[tid]);
tid = get_global_id(0);
}
} c[tid] = dot(a[tid], b[tid]); Allows applications to use a host
and one or more OpenCL devices as
a single heterogeneous parallel
computer system.
Platform Runtime OpenCL Framework
API API
OpenCL Runtime:
Allows the host program to
OpenCL Compiler
Front-End Compiler manipulate contexts once they
OpenCL Runtime
have been created.
IR
Back-End Compiler Front End Compiler:
OpenCL Platform Back-End Compiler Compile from source into common
binary intermediate that contain
OpenCL kernels.
CPU Device GPU Device Backend compiler:
Compile from general intermediate
Note: binary into a device specific binary
with device specific optimizations
The CPU is both the Host and a compute device
20
21. Agenda
• OpenCL intro
– GPGPU in a nutshell
– OpenCL roots
– OpenCL status
• OpenCL 1.0 deep dive
– Programming Model
– Framework API
– Language
– Embedded Profile & Extensions
• Summary
21
22. The Platform Layer
clGetDeviceIDs (cl_device_type device_type …
• Query the Platform Layer cl_device_type Description
– clGetPlatformInfo An OpenCL device that is the host processor. The host processor
CL_DEVICE_TYPE_CPU runs the OpenCL implementations and is a single or multi-core CPU.
• Query Devices (by type)
An OpenCL device that is a GPU. By this we mean that the device
– clGetDeviceIDs CL_DEVICE_TYPE_GPU can also be used to accelerate a 3D API such as OpenGL or DirectX.
• For each device, Query Device Configuration Dedicated OpenCL accelerators (for example the IBM CELL Blade).
CL_DEVICE_TYPE_ACCELERATOR These devices communicate with the host processor using a
– clGetDeviceConfigInfo peripheral interconnect such as PCIe.
– clGetDeviceConfigString CL_DEVICE_TYPE_DEFAULT The default OpenCL device in the system.
• Create Contexts using the devices found by CL_DEVICE_TYPE_ALL All OpenCL devices available in the system.
the “get” function
clGetDeviceInfo (cl_device_id device,
• Context is the central element used by the cl_device_info param_name,
runtime layer to manage:
cl_device_info Description
– Command Queque
The OpenCL device type. Currently supported values are:
– Memory objects CL_DEVICE_TYPE
CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU,
CL_DEVICE_TYPE_ACCELERATOR, CL_DEVICE_TYPE_DEFAULT or a
– Programs combination of the above.
– Kernels CL_DEVICE_MAX_COMPUTE_ The number of parallel compute cores on the OpenCL device. The
minimum value is one.
UNITS
Maximum dimensions that specify the global and local work-item IDs
CL_DEVICE_MAX_WORK_
clCreateContext ITEM_DIMENSIONS
used by the data-parallel execution model. (Refer to
clEnqueueNDRangeKernel). The minimum value is 3.
compute_device[0] CL_DEVICE_MAX_WORK_ Maximum number of work-items in a work-group executing a kernel
using the data-parallel execution model. (Refer to
compute_device[1] GROUP_SIZE clEnqueueNDRangeKernel).
Context
compute_device[2] CL_DEVICE_MAX_CLOCK_ Maximum configured clock frequency of the device in MHz.
FREQUENCY
compute_device[3]
Device address space size specified as an unsigned integer value in
CL_DEVICE_ADDRESS_BITS bits. Currently supported values are 32 or 64 bits.
CL_DEVICE_MAX_MEM_ Max size of memory object allocation in bytes. The minimum value is
max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)
ALLOC_SIZE
22
And many more : 40+ parameters
23. OpenCL Runtime
Everything in OpenCL Runtime is happening within a context
compute_device[0]
compute_device[1] Context
compute_device[2]
OpenCL Programs Kernels Memory Objects Command Queues
Device[2]
#define g_c __global const
Compiled Kernel Images Buffers
__kernel void dot_prod (g_c float4 *a,
g_c float4 *b, Program Handler Device[1]
g_c float *c)
{
Function Device[0]
int tid = get_global_id(0);
c[tid] = dot(a[tid], b[tid]);
In Out of
dot_prod
} Order Order
In Out of
buff_add Args List Queue OrderQueue
__kernel void buff_add (g_c float4 *a,
g_c float4 *b,
Order
g_c float *c) arg[0] In Queue Out Queue
of
{
int tid = get_global_id(0);
Order Order
arg[1]
c[tid] = a[tid] +b[tid]; Queue Queue
}
…
Create data & Send to
Compile code
arguments execution
23
24. OpenCL “boot” process
Platform Layer
Query Platform
Query Devices
Create Context
Runtime
Create Command Queue
Create Memory Object
Compiler
Create Program
Build Program
Create Kernel
Set Kernel Args
Enqueue Kernel
24
2
25. OpenCL C Programming Language in a
Nutshell
• Derived from ISO C99
• A few restrictions:
– Recursion
– Function pointers
– Functions in C99 standard headers
• New Data Types
– New Scalar types __global float4 *color; // An array of float4
– Vector Types typedef struct {
float a[3];
– Image types int b[2];
} foo_t;
• Address Space Qualifiers __global image2d_t texture; // A 2D texture image
• Synchronization objects __kernel void stam_dugma(
__global float *output,
– Barrier __global float *input,
• Built-in functions {
__local float *tile)
• IEEE 754 compliant with a few exceptions // private variables
int in_x, in_y ;
const unsigned int lid = get_local_id(0));
// declares a pointer p in the __private
// that points to an int object in __global
__global int *p;
25
2
26. Agenda
• OpenCL intro
– GPGPU in a nutshell
– OpenCL roots
– OpenCL status
• OpenCL 1.0 deep dive
– Programming Model
– Framework API
– Language
– Embedded Profile & Extensions
• Summary
26
27. OpenCL Extensions
• As in OpenGL, OpenCL supports Specification Extensions
• Extension is an optional feature, which might be supported by a device,
but not part of the “Core features” (Khronos term)
– Application is required to query the device using CL_DEVICE_EXTENSIONS
parameter
• Two types of extensions:
– Extensions approved by Khronos OpenCL working group
– Uses the “KHR” in functions/enums/etc.
– Might be promoted to required Core feature on next versions of OpenCL
– Extensions which are Vendor Specific
• The specification already provides some KHR extensions
27
28. OpenCL 1.0 KHR Extensions
• Double Precision Floating Point
– Support Double as data type and extend built-in functions to support it
• Selecting Rounding Mode
– Add to the mandatory “round to nearest” : “round to nearest even”, “round to
zero”, “round to positive infinity”, “round to negative infinity”
• Atomic Functions (for 32-bit integers, for Local memory, for 64-bit)
• Writing to 3D image memory objects
– OpenCL mandates only read.
• Byte addressable stores
– In OpenCL core, Write to Pointers is limited to 32bit granularity
• Half floating point
– Add 16bit Floating point type
28
29. OpenCL 1.0 Embedded Profile
• A “relaxed” version for embedded devices (as in OpenGL ES)
• No 64bit integers
• No 3D images
• Reduced requirements on Samplers
– No CL_FILTER_LINEAR for Float/Half
– Less addressing modes
• Not IEEE 754 compliant on some functions
– Example: Min Accuracy for atan() >= 5 ULP
• Reduced set of minimal device requirements
– Image height/width : 2048 instead of 8192
– Number of samplers : 8 instead of 16
– Local memory size : 1K instead of 16K
– More…
29
30. Agenda
• OpenCL intro
– GPGPU in a nutshell
– OpenCL roots
– OpenCL status
• OpenCL 1.0 deep dive
– Programming Model
– Framework API
– Language
– Embedded Profile & Extensions
• Summary
30
31. OpenCL Unique Features
• As a Summary, here are some unique features of OpenCL :
• An Opened Standard for Cross-OS, Cross-Platform, heterogeneous
processing
– Khronos owned
• Creates a unified, flat, system model where the GPU, CPU and other
devices are treated (almost) the same
• Includes Data & Task Parallelism
– Extend GPGPU beyond the traditional usages
• Supports Native functions (C++ interop)
• Derived from ISO C99 with additional types, functions, etc. (and some
restrictions)
• IEEE 754 compliant
31
33. Building OpenCL Code
clCreateProgramWithSource()
1. Creating OpenCL Programs
– From Source : receive array of strings clCreateProgramWithBinary()
– From Binaries : receive array of binaries
– Intermediate Representation
– Device specific executable clBuildProgram()
2. Building the programs
– The developer can define a subgroup of
devices to build on clCreateKernel()
3. Creating Kernel objects
– Single kernel : according to kernel name cl_program clCreateProgramWithSource (cl_context context,
cl_uint count,
– All kernels in program const char **strings,
const size_t *lengths,
cl_int *errcode_ret)
cl_program clCreateProgramWithBinary (cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const size_t *lengths,
OpenCL supports dynamic compilation scheme – the const void **binaries,
cl_int *binary_status,
Application uses “create from source” on the first time cl_int *errcode_ret)
and then use “create from binaries” on next times cl_int clBuildProgram (cl_program program,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
void (*pfn_notify)(cl_program, void *user_data),
void *user_data)
cl_kernel clCreateKernel (cl_program program,
const char *kernel_name,
cl_int *errcode_ret)
33
34. Some order here, please…
• OpenCL defines a command queue, which is Context 1 Context 2
created on a single device Device 1 Device 1
Device 2 Device 3
– within the scope of a context, of course…
• Commands are enqueued to a specific Q1,2 Q1,2 Q1,3 Q1,4 Q2,1 Q2,2
queue D1 D1 D2 D2 D1 D3
IOQ IOQ IOQ OOQ OOQ IOQ
– Kernels Execution
C4
– Memory Operations
C3 C3
• Events C2 C2 C2
C1 C1 C1
– Each command can be created with an event
associated to it
• A few queues can be created on the same
– Each command execution can be dependent
device
in a list of pre-created events
• Commands can dependent on events
• Two types of queues created on other queues/contexts
– In order queue : commands are executed in • In the example above :
the order of issuing – C3 from Q1,2 depends on C1 & C2 from
– Out of order queue : command execution is Q1,2
dependent only on its event list completion – C1 from Q1,4 depends on C2 from Q1,2
– In Q1,4, C3 depends on C2
34
35. Memory Objects
• OpenCL defines Memory Objects (Buffers/Images)
– Reside in Global Memory
– Object is defined in the scope of a context
– Memory Objects are the only way to pass a large amount of data between the Host & the devices.
• Two mechanisms to sync Memory Objects:
• Transactions - Read/Write
– Read - take a “snapshot” of the buffer/image to Host memory
– Write – overwrites the buffer/image with data from the Host
– Can be blocking or non-blocking (app needs to sync on event)
• Mapping
– Similar to DX lock/map and command
– Passes ownership of the buffer/image to the host, and back to the device
35
36. Executing Kernels
• A Kernel is executed by enqueueing it to a Specific command queue
• The App must set the Kernel Arguments before enqueueing
– Setting the arguments is done one-by-one
– The kernel's arguments list is preserved after enqueueing
– This enables changing only the required arguments before enqueueing again
• There are two separate enqueueing API’s – Data Parallel & Task Parallel
• In Data Parallel enqueueing, the App specifies the global & local work size
– Global : the overall work-items to be executed, described by an N dimensional matrix
– Local : the breakdown of the global to fit the specific device (can be left NULL)
cl_int clSetKernelArg (cl_kernel kernel,
cl_uint arg_index,
size_t arg_size,
const void *arg_value)
cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
cl_int clEnqueueTask (cl_command_queue command_queue,
cl_kernel kernel,
Back…
cl_uint num_events_in_wait_list,
36
3 const cl_event *event_wait_list,
cl_event *event)
37. Reality Check – Apple compilation scheme
• OCL compilation process on
SnowLeopard (OSX 10.6) OpenCL
Compute Program
– Step1: Compile OCL to LLVM IR
(Intermediate Representation) OpenCL Front-End
(Apple)
– Step2: Compile to target device
• NVIDIA GPU device compiles the LLVM LLVM IR
IR in two steps:
NVIDIA x86 Back-End
– LLVM IR to PTX (CUDA IR)
(LLVM Project)
– PTX to target GPU
• CPU device uses LLVM x86 BE to x86
PTX IR
binary
compile directly to x86 Binary code.
• So what is LLVM ? Next slide…
G80 G92 G200
binary binary binary
37
3
38. The LLVM Project
• LLVM - Low Level Virtual Machine CLang GCC GLSL+
• Open Source Compiler for
– Multi Language LLVM IR
– Cross Platform/Architecture
– Cross OS
x86 PPC ARM MIPS
38
39. OCL C Data Types
• OpenCL C Programming language supports all ANSI C data types
• In addition, the following Scalar types are supported:
Type Description
half A 16-bit float. The half data type must conform to the IEEE 754-2008 half precision storage format.
size_t The unsigned integer type of the result of the sizeof operator (32bit or 64bit).
ptrdiff_t A signed integer type that is the result of subtracting two pointers (32bit or 64bit).
A signed integer type with the property that any valid pointer to void can be converted to this type, then
intptr_t
converted back to pointer to void, and the result will compare equal to the original pointer.
• And the following vector to pointer(n void, and the resultor 16) pointer toto thecan be converted to this type, then
uintptr_t
An unsigned integer type with the property that any valid
converted back
types to can be 2,4,8, will compare equal void original pointer.
39
3
40. Address Space Qualifiers
• OpenCL Memory model defined 4 memory spaces
• Accordingly, the OCL C defines 4 qualifiers:
– __global, __local, __constant, __private
• Best to explain on a piece of code:
__global float4 *color; // An array of float4 elements Variables outside the
typedef struct { scope of kernels must be
float a[3]; global
int b[2];
} foo_t;
__global image2d_t texture; // A 2D texture image
Variables passed to the
__kernel void stam_dugma( kernel can be of any type
__global float *output,
__global float *input,
__local float *tile)
Internal Variables are
{
// private variables private unless specified
int in_x, in_y ; otherwise
const unsigned int lid = get_local_id(0));
// declares a pointer p in the __private address space And here’s an example to
// that points to an int object in address space __global the “specified otherwise”…
__global int *p;
40
41. Built-in functions
• The Spec specified over 80 built-in functions which must be supported
• The built-in functions are divided to the following types:
– Work-item functions – Relational functions
– get_work_dim, get_global_id, etc… – isequal, isgreater, isfinite
– Math functions – Vector data load & store functions
– acos, asin, atan, ceil, hypot, ilogb – vloadn, vstoren,
– Integer functions – Image Read & Write Functions
– abs, add_sat, mad_hi, max, mad24 – read_imagef, read_imagei,
– Common functions (float only) – Synchronization Functions
– Clamp, min, max, radians, step – barrier
– Geometric functions – Memory Fence Functions
– cross, dot, distance, normalize – Read_mem_fence
41
4