An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

An Introduction to OpenCL™ Using AMD GPUs
Chris Mason Product Manager, Acceleware September 17, 2014

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
About Acceleware
Programmer Training
–OpenCL, CUDA, OpenMP
–Over 100 courses taught
–http://acceleware.com/training
Consulting Services
–Completed projects for: Oil & Gas, Medical, Finance, Security & Defence, Computer Aided Engineering, Media & Entertainment
–http://acceleware.com/services
GPU Accelerated Software
–Seismic imaging & modeling
–Electromagnetics
2

Seismic Imaging & Modeling
AxWave
–Seismic forward modeling
–2D, 3D, constant and variable density models
–High fidelity finite-difference modeling
AxRTM
–High performance Reverse Time Migration Application
–Isotropic, VTI and TTI media
HPC Implementation
–Optimized for GPUs
–Efficient multi-GPU scaling
3

Electromagnetics
AxFDTD™
–Finite-Difference Time-Domain Electromagnetic Solver
–Optimized for GPUs
–Sub-gridding and large feature coverage
–Multi-GPU, GPU clusters, GPU targeting
–Available from:
4

Consulting Services
Industry
Application
Work Completed
Results
Finance
Option Pricing
Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing
30-50x performance improvement compared to single-threaded CPU code
Security & Defense
Detection System
Replaced legacy Cell-based infrastructure with GPUs
Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms
Surpassed the performance targets Reduced hardware cost by a factor of 10
CAE
SIMULIA Abaqus
Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs
Delivered an accelerated (2- 3x) solution that supports NVIDIA and AMD GPUs
Medical
CT Reconstruction Software
Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections
Accelerated back projection by 31x
Oil & Gas
Seismic Application
Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations
20-30x speedup
5

Programmer Training
OpenCL, CUDA, OpenMP
Teachers with real world experience
Hands-on lab exercises
Progressive lectures
Small class sizes to maximize learning
90 days post training support
“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”
Jason Gauci, Software Engineer
Lockheed Martin
6

Outline
Introduction to the OpenCL Architecture
–Contexts, Devices, Queues
Memory and Error Management
Data-Parallel Computing
–Kernel Launches
GPU Kernels
7

Introduction To The OpenCL Architecture

OpenCL Architecture Introduction and Terminology
Four high level models describe the key OpenCL concepts:
–Platform Model – high level host/device interaction
–Execution Model – OpenCL programs execute on host/device
–Memory Model – different memory resources on device
–Programming Model – types of parallel workloads
9

OpenCL Platform Model
A host connected to one or more devices
–Example: GPUs, DSPs, FPGAs
A program can work with devices from multiple vendors
A platform is a host and a collection of devices that share resources and execute programs
10
Host
Device 1 GPU
Device 2 CPU
…
Device N GPU

OpenCL Execution Model
The host defines a context to control the device
–The context manages the following resources:
–Devices – hardware to run on
–Kernels – functions to run on the hardware
–Program Objects – device executables
–Memory Objects – memory visible to host and device
A command queue schedules commands for execution on the device
11

OpenCL API - Platform and Runtime Layer
The OpenCL API is divided into two layers: Platform and Runtime
The platform layer allows the host program to discover devices and capabilities
The runtime layer allows the host program to work with contexts once created
12

Program Set Up
To set up an OpenCL program, the typical steps are as follows:
1.Query and select the platforms (e.g., AMD)
2.Query the devices
3.Create a context
4.Create a command queue
5.Read/Write to the device
6.Launch the kernel
13
Platform Layer
Runtime Layer

Sample Platform Layer C Code
14
//Get the platform ID
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
// Get the first GPU device associated with the platform
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
//Create an OpenCL context for the GPU device
cl_context context;
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

OpenCL Runtime Layer
A command queue operates on contexts, memory, and program objects
Each device can have one or more command queues
Operations in the command queue will execute in order unless the out of order mode is enabled
15
Copy Data
Copy Data
Launch Kernel
Copy Data
Command Queue

Memory and Error Management

OpenCL Buffers
A buffer stores a one dimensional collection of elements
Buffer objects use the cl_mem type
–cl_mem is an abstract memory container (i.e., a handle)
–The buffer object cannot be dereferenced on the host
•cl_mem a; a[0] = 5; // Not allowed
OpenCL commands interact with buffers
17

OpenCL Syntax – C Memory Management Example
Example:
18
//Create an OpenCL command queue
cl_int err;
cl_command_queue queue;
queue = clCreateCommandQueue(context, device, 0, &err);
// Allocate memory on device
const int N = 5;
int nBytes = N*sizeof(int);
cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,
nBytes, NULL, &err);
int hostarr [N] = {3,1,4,1,5};
// Transfer Memory
err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,
nBytes, hostarr, 0, NULL,
NULL);

OpenCL Syntax – Error Management
Host code manages errors:
–Most host side OpenCL function calls return cl_int
•“Create” calls return the object that is created
–Error code is passed by reference as last argument
•Error codes are negative values defined in cl.h
•CL_SUCCESS == 0
19

OpenCL Syntax – Clean Up
All objects that are created can be released with the following functions:
–clReleaseContext
–clReleaseCommandQueue
–clReleaseMemObject
20

Data-Parallel Computing

Data-Parallel Computing
Data-parallelism
1.Performs operations on a data set organized into a common structure (e.g. an array)
2.Tasks work collectively on the same structure with each task operating on its own portion of the structure
3.Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!
22

Data Dependence
Data dependence occurs when a program statement refers to the data of a preceding statement.
Data dependence limits parallelism
23
a = 2 * x;
b = 2 * y;
c = 3 * x;
a = 2 * x; b = 2 * a * a; c = b * 9;
These 3 statements are independent!
b depends on a, c depends on b and a!

Data-Parallel Computing Example
Data set consisting of arrays A,B, and C
Same operations performed on each element - Cx = Ax + Bx
Two tasks operating on a subset of the arrays. Tasks 0 and 1 are independent. Could have more tasks.
24
A0
A1
A2
A3
A4
A5
A6
A7
B0
B1
B2
B3
B4
B5
B6
B7
C1
C2
C3
C4
C5
C6
C7
C0
Cx = Ax + Bx
Task 0
Task 1
Operation

The OpenCL Programming Model
OpenCL is a heterogeneous model, including provisions for both host and device
25
CPU
Chipset
DRAM
DRAM
DSP or GPU or FPGA
Device
Host
PCIe

Data-parallel portions of an algorithm are executed on the device as kernels
–Kernels are C functions with some restrictions, and a few language extensions
Only one kernel is executed at a time
A kernel is executed by many work-items
–Each work-item executes the same kernel
26

OpenCL Work-Items
OpenCL work-items are conceptually similar to data- parallel tasks or threads
–Each work-item performs the same operations on a subset of a data structure
–Work-items execute independently
OpenCL work-items are not CPU threads
–OpenCL work-items are extremely lightweight
•Little creation overhead
•Instant context-switching
–Work-items must execute the same kernel
27

OpenCL Work-Item Hierarchy
OpenCL is designed to execute millions of work-items
Work-items are grouped together into work-groups
–Maximum # of work-items per work-group (HW limit)
–Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo
•Typically 256-1024
The entire collection of work-items is called the N- Dimensional Range (NDRange)
28

OpenCL Work-Item Hierarchy
Work-groups and NDRange can be 1D, 2D, or 3D
Dimensions set at launch time
29
Work-Item (3,0)
Work-Item (1,0)
Work-Item (2,0)
Work-Item (0,0)
Work-Item (3,1)
Work-Item (1,1)
Work-Item (2,1)
Work-Item (0,1)
Work-Item (3,2)
Work-Item (1,2)
Work-Item (2,2)
Work-Item (0,2)
Work-Group (1,1)
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range

The host launches kernels
The host executes serial code between device kernel launches
–Memory management
–Data exchange to/from device (usually)
–Error handling
30
Work-Group (0,0)
Work-Group (1,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (0,2)
Work-Group( 1,2)
ND Range
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range
Host
Device
Host
Device

Data-Parallel Computing on GPUs
Data-parallel computing maps well to GPUs:
–Identical operations executed on many data elements in parallel
•Simplified flow control allows increased ratio of compute logic (ALUs) to control logic
31
DRAM
GPU
DRAM
CPU
ALU
Control
L1 Cache
L2 Cache
ALU
ALU
ALU
ALU
Control
L1 Cache
L2 Cache
ALU
ALU
ALU
L3 Cache

OpenCL API – Launching a Kernel C
How to launch a kernel:
32
//3D Work-Group, let OpenCL Runtime determine
//local work size.
size_t const globalWorkSize[3] = {512,512,512};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL, globalWorkSize, NULL,
0, NULL, NULL);
//2D Work-Group, specify local work size
size_t const globalWorkSize[2] = {512,512};
size_t const localWorkSize[2] = {16, 16};
clEnqueueNDRangeKernel(queue, kernel, 2, NULL,
globalWorkSize, localWorkSize,
0, NULL, NULL);

GPU Kernels

Writing OpenCL Kernels
Denoted by __kernel function qualifier
–Eg. __kernel void myKernel(__global float* a)
Queued from host, executed on device
A few noteworthy restrictions:
–No access to host memory (in general!)
–Must return void
–No function pointers
–No static variables
–No recursion (no stack)
34

OpenCL Syntax - Kernels
Kernels have built-in functions:
–The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch
–get_work_dim (): number of dimensions in use
–get_global_id (dim): unique index of a work-item
–get_global_size (dim): number of global work-items
35

OpenCL Syntax – Kernels (Continued)
Built-in function listing (continued):
–get_local_id (dim): unique index of the work-item within the work-group
–get_local_size (dim): number of work-items within the work-group
–get_group_id (dim): index of the work-group
–get_num_groups (dim): number of work-groups
–Cannot vary the size of work-groups or work-items during a kernel call
36

Built-in functions are typically used to determine unique work-item identifiers:
37
get_group_id(0)
get_local_size(0) = 5
get_global_id(0)
ND Range
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
get_local_id(0)
One Dimensional Array (get_work_dim () == 1)
get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)

OpenCL Syntax – Thread Identifiers
Result for each kernel launched with the following execution configuration:
Dimension = 1 Global work size = 12 Local Work Size = 4
38
__kernel void MyKernel(__global int* a)
{
int idx = get_global_id(0);
a[idx] = 7;
}
{
a[idx] = get_group_id(0);
}
{
a[idx] = get_local_id(0);
}
a: 7 7 7 7 7 7 7 7 7 7 7 7
a: 0 0 0 0 1 1 1 1 2 2 2 2
a: 0 1 2 3 0 1 2 3 0 1 2 3

Code Example - Kernel
Kernel is executed by N work-items
–Each work-item has a unique ID between 0 and N-1
39
void inc(float* a, float b,
int N)
{
for(int i = 0; i<N; i++)
a[i] = a[i] + b;
}
void main()
{
…
increment(a,b,N);
}
__kernel
void inc(__global float* a,
float b)
{
int i = get_global_id(0);
a[i] = a[i] + b;
}
void main()
{
…
clEnqueueNDRangeKernel(…,…);
}

All C operators are supported
–eg. +, *, /, ^, >, >>
Many functions from the standard math library
–eg. sin(), cos(), ceil(), fabs()
Can write/call your own non-kernel functions
–float myDeviceFunction(__global float *a)
–Non-kernel functions cannot be called by host
Control flow statements too!
–eg. if(), while(), for()
40

OpenCL Syntax - Synchronization
Kernel launches are asynchronous
–Control returns to CPU immediately
–Subsequent commands added to the command queue will wait until the kernel has completed
–If you want to synchronize on the host:
•Implicit synchronization via blocking commands
–eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE
–Explicitly call clFinish()
clFinish(queue)
–Blocks on host until all outstanding OpenCL commands are complete in a given queue
41

Questions?
OpenCL training courses and consulting services
Acceleware Ltd.
Twitter: @Acceleware
Web: http://acceleware.com/opencl-training
Email: services@acceleware.com
-------------------
Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:
AMD Developer Central
Twitter: @AMDDevCentral
Web: http://developer.amd.com/
YouTube: https://www.youtube.com/user/AMDDevCentral
Developer Forums: http://devgurus.amd.com/welcome
42

An Overview of GPU Hardware

What is the GPU?
The GPU is a graphics processing unit
Historically used to offload graphics computations from the CPU
Can either be a dedicated video card, integrated on the motherboard or on the same die as the CPU
–Highest performance will require a dedicated video card
44

Why use GPUs? Performance!
45
Intel Xeon E5-2697 v2 (Ivy Bridge)
AMD Opteron 6386SE (Bulldozer)
AMD FirePro
W9100 (Volcanic Islands)
AMD FirePro S10000 (Southern Islands)
Processing Cores
12
16
2816
3584
Clock Frequency (GHz)
2.7-3.4* GHz
2.8-3.5* GHz
930 MHz
825 MHz
Memory Bandwidth
59.7 GB/s / socket
59.7 GB/s / socket
320 GB/s
480 GB/s
Peak Gflops** (single)
576 @ 3.0GHz
410 @ 3.2GHz
5240
5910
Peak Gflops** (double)
288 @ 3.0GHz
205 @ 3.2GHz
2620
1480
Gflops/Watt
(single)
4.4
2.9
19
15.76
Total Memory
>>16GB
>>16GB
16 GB
6 GB
*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology
** At maximum frequency when all cores are executing

An Introduction to OpenCL
Using AMD GPUs
GPU Potential Advantages
 9x more single-precision floating-point throughput
 9x more double-precision floating-point throughput
 5x higher memory bandwidth
46
AMD FirePro W9100 vs. Xeon E5-2697 v2

GPU Disadvantages
Architecture not as flexible as CPU
Must rewrite algorithms and maintain software in GPU languages
Attached to CPU via relatively slow PCIe
–16GB/s bi-directional for PCIe 3.0 16x
Limited memory (though 6-16GB is reasonable for many applications)
47

Software Approaches for Acceleration
Maximum Flexibility
–OpenCL
Simple programming for heterogeneous systems
–Simple compiler hints/pragmas
–Compiler parallelizes code
–Target a variety of platforms
“Drop-in” Acceleration
–In-depth GPU knowledge not required
–Highly optimized by GPU experts
–Provides functions used in a broad range of applications (eg. FFT, BLAS)
48
Programming Languages
OpenACC Directives
Libraries
Effort

An Introduction to OpenCL

OpenCL Overview
Parallel computing architecture standardized by the Khronos Group
OpenCL:
–Is a royalty free standard
–Provides an API to coordinate parallel computation across heterogeneous processors
–Defines a cross-platform programming language
50

OpenCL Versions
To date there are four different versions of OpenCL
–OpenCL 1.0
–OpenCL 1.1
–OpenCL 1.2
–OpenCL 2.0 (finalized November 2013)
Different versions support different functionality
51
Hardware Vendor
Supported OpenCL Version
AMD
OpenCL 1.2
Intel
OpenCL 1.2
NVIDIA
OpenCL 1.1

OpenCL Extensions
Optional functionality is exposed through extensions
–Vendors are not required to support extensions to achieve conformance
–However, extensions are expected to be widely available
Some OpenCL extensions are approved by the OpenCL working group
–These extensions are likely to be promoted to core functionality in future versions of the standard
Multi-vendor and vendor specific extensions do not need approval by the working group
52

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

Similar to An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar