INTRODUCTION
Hammad Ghulam Mustafa
Hafiz Muhammad Noman Zahid
Muhammad Abdullah Ijaz
Malik Waqas Bashir Abid
Muhammad Umar Arshad
Umair Javaid
Suleman Khan
Ali Islal
• Introduction
• Programming Basics
• OpenCL Execution Model
• “Hello World”
• Conclusion
• Standard for the development of data parallel applications
• Most used for the development of GPGPU applications
• General Purpose computing on Graphics Processing Units
• A GPU is comprised of hundreds of compute cores
• Specialized for massively data parallel computation
• GPGPU: Take advantage of GPU’s computing power to make massively parallel
applications
• Parallel applications with huge acceleration in Molecular Dynamics, Image
Processing, Evolutionary Computation,…
• All cases based on data parallelism:
each thread processes a subset of the data
For example, a vector addition:
• Furthermore, OpenCL provides portability:
same code can run on different architectures
• For Example:
• Provides the following abstraction: A compute device is composed by
compute units
• OpenCL platform: Host + Compute Devices
Each manufacturer provides an SDK:
• NVIDIA SDK for GPUs
• AMD APP for CPUs/GPU
• Intel for CPUs
• IBM for PowerPC and Cell B/E
• Kernel: function that defines the behavior of each thread
• For example, kernel for vector addition:
__kernel void sumKernel (
__global int* a, __global int* b, __global int* c)
{
int i = get_global_id(0);
c[i] = a[i] + b[i];
}
Written in OpenCL-C: ANSI-C + Set of kernel functions, e.g.:
• get_global_id: obtains thread index
• barrier: synchronizes threads
• An OpenCL applications consists of:
• Basic host application flow:
a. Load and Compilation of kernel
b. Data copy from host to device (e.g. from CPU to GPU)
c. Execution of kernel
d. Data copy from device to host
e. Release kernels and data from device memory
• Execution using command queue in each device
• Host code: programmed using OpenCL API
• API Calls, such as:
• clCreateProgramWithSource: Load kernel from char*
• clBuildProgram: Compile kernel
• clSetKernelArgs: Set kernel arguments for the device
• clEnqueueWriteBuffer/clEnqueueRead: Copy data vector to device
• clEnqueueNDRangerKernel: Launch kernel in device
• API Types, such as:
• cl_mem: Pointer to device memory objects
• cl_program: Kernel object
• cl_float / cl_int / cl_uint: Redefinition of C types
• Kernel
• Basic unit of executable code -similar to a C function
• Data-parallel or task-parallel
• H.264Encode is not a kernel
• Kernel should be a small separate function (SAD)
• Program
• Collection of kernels and other functions
• Analogous to a dynamic library
• Applications queue kernel execution instances
• Queued in-order
• Executed in-order or out-of-order
• Define N-dimensional computation domain (N = 1, 2 or 3)
• Each independent element of execution in N-D domain is called a work-item
• The N-D domain defines the total number of work-items that execute in parallel
• Create a program
• Input: String (source code) or precompiled binary
• Analogous to a dynamic library: A collection of kernels
• Compile the program
• Specify the devices for which kernels should be compiled
• Pass in compiler flags
• Check for compilation/build errors
• Create the kernels
• Returns a kernel object used to hold arguments for a given execution
• OpenCL does not provide performance portability
• Alternative to NVIDIA CUDA:
Programming paradigm for NVIDIA GPU cards
• Combinable with other parallel programming models:
OpenMP for SMPs / MPI for MPPs
• Huge ecosystems for OpenCL, e.g. OpenACC:
Develop GPGPU applications using directives
#pragma acc kernels
for(i = 0; i< N; i++)
c[i] = b[i] + a[i];
Introduction to OpenCL By Hammad Ghulam Mustafa

Introduction to OpenCL By Hammad Ghulam Mustafa

  • 1.
    INTRODUCTION Hammad Ghulam Mustafa HafizMuhammad Noman Zahid Muhammad Abdullah Ijaz Malik Waqas Bashir Abid Muhammad Umar Arshad Umair Javaid Suleman Khan Ali Islal
  • 2.
    • Introduction • ProgrammingBasics • OpenCL Execution Model • “Hello World” • Conclusion
  • 3.
    • Standard forthe development of data parallel applications • Most used for the development of GPGPU applications • General Purpose computing on Graphics Processing Units • A GPU is comprised of hundreds of compute cores • Specialized for massively data parallel computation
  • 4.
    • GPGPU: Takeadvantage of GPU’s computing power to make massively parallel applications • Parallel applications with huge acceleration in Molecular Dynamics, Image Processing, Evolutionary Computation,… • All cases based on data parallelism: each thread processes a subset of the data For example, a vector addition:
  • 5.
    • Furthermore, OpenCLprovides portability: same code can run on different architectures • For Example:
  • 6.
    • Provides thefollowing abstraction: A compute device is composed by compute units • OpenCL platform: Host + Compute Devices Each manufacturer provides an SDK: • NVIDIA SDK for GPUs • AMD APP for CPUs/GPU • Intel for CPUs • IBM for PowerPC and Cell B/E
  • 7.
    • Kernel: functionthat defines the behavior of each thread • For example, kernel for vector addition: __kernel void sumKernel ( __global int* a, __global int* b, __global int* c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } Written in OpenCL-C: ANSI-C + Set of kernel functions, e.g.: • get_global_id: obtains thread index • barrier: synchronizes threads
  • 8.
    • An OpenCLapplications consists of: • Basic host application flow: a. Load and Compilation of kernel b. Data copy from host to device (e.g. from CPU to GPU) c. Execution of kernel d. Data copy from device to host e. Release kernels and data from device memory • Execution using command queue in each device
  • 9.
    • Host code:programmed using OpenCL API • API Calls, such as: • clCreateProgramWithSource: Load kernel from char* • clBuildProgram: Compile kernel • clSetKernelArgs: Set kernel arguments for the device • clEnqueueWriteBuffer/clEnqueueRead: Copy data vector to device • clEnqueueNDRangerKernel: Launch kernel in device • API Types, such as: • cl_mem: Pointer to device memory objects • cl_program: Kernel object • cl_float / cl_int / cl_uint: Redefinition of C types
  • 11.
    • Kernel • Basicunit of executable code -similar to a C function • Data-parallel or task-parallel • H.264Encode is not a kernel • Kernel should be a small separate function (SAD) • Program • Collection of kernels and other functions • Analogous to a dynamic library • Applications queue kernel execution instances • Queued in-order • Executed in-order or out-of-order
  • 12.
    • Define N-dimensionalcomputation domain (N = 1, 2 or 3) • Each independent element of execution in N-D domain is called a work-item • The N-D domain defines the total number of work-items that execute in parallel
  • 13.
    • Create aprogram • Input: String (source code) or precompiled binary • Analogous to a dynamic library: A collection of kernels • Compile the program • Specify the devices for which kernels should be compiled • Pass in compiler flags • Check for compilation/build errors • Create the kernels • Returns a kernel object used to hold arguments for a given execution
  • 24.
    • OpenCL doesnot provide performance portability • Alternative to NVIDIA CUDA: Programming paradigm for NVIDIA GPU cards • Combinable with other parallel programming models: OpenMP for SMPs / MPI for MPPs • Huge ecosystems for OpenCL, e.g. OpenACC: Develop GPGPU applications using directives #pragma acc kernels for(i = 0; i< N; i++) c[i] = b[i] + a[i];