OpenCL - The Open Standard for Heterogeneous Parallel Programming

  • 6,270 views
Uploaded on

DLR TechTalk von Dr. Achim Basermann (14.07.2009, Köln-Porz)

DLR TechTalk von Dr. Achim Basermann (14.07.2009, Köln-Porz)

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,270
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
226
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. OpenCL – The Open Standard for Heterogeneous Parallel Programming Dr.-Ing. Achim Basermann Deutsches Zentrum für Luft- und Raumfahrt e.V. Abteilung Verteilte Systeme und Komponentensoftware, Köln-Porz Folie 1 20090714-1 TechTalk OpenCL SC-VK Basermann
  • 2. Survey Motivation Trends in parallel processing Role of OpenCL Khronos´ OpenCL Platform model Execution model Memory model Programming model Conclusions Material used: OpenCL survey and specification from http://www.khronos.org/opencl/ Folie 2 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 3. Motivation: Trends in Parallel Processing, HW Trend #1: Multicore processor chips Maintain (or even reduce) frequency while replicating cores Trend #2: Accelerators (GPGPUs, e.g.) Previously, processors would “catch” up with accelerator function in the next generation Accelerator design expense not amortized well New accelerator designs more likely to maintain performance advantage And will maintain an enormous power advantage for target workloads Trend #2b: Heterogeneous multicore in general (IBM Cell, e.g.) Mixes of powerful cores, smaller cores, and accelerators potentially offer the most efficient nodes The challenge is harnessing them efficiently Folie 3 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 4. Motivation: Programming Issues Many cores per node, and accelerators/heterogeneity Future performance gains will come via massive parallelism (not clock speed) An unwelcome situation for HPC apps! Need new programming models to exploit At the system/cluster level: Message-passing to connect node-level languages, or Global addressing to make communication implicit? Folie 4 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 5. Motivation: Classic Programming Models Folie 5 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 6. Motivation: OpenCL New open standard that specifically addresses parallel compute accelerators Extension to C Provides data parallel and task parallel models Facilitates natural transition from the growing number of CUDA (Compute Unified Device Architecture, NVIDIA) programs Porting of Cell applications to a standard model Play wells with MPI Can interoperate with Fortran and OpenMP Folie 6 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 7. Motivation: Roles of OpenCL Folie 7 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 8. Motivation: Before OpenCL Folie 8 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 9. Motivation: The promise of OpenCL Folie 9 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 10. : OpenCL CPUs GPUs Multiple cores driving Emerging Increasingly general purpose performance increases data-parallel computing Intersection Improving numerical precision OpenCL Heterogenous Computing Multi-processor Graphics APIs programming – and Shading e.g. OpenMP Languages OpenCL ––Open Computing Language OpenCL Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors parallel computing CPUs, GPUs, and other processors Folie 10 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 11. OpenCL Working Group Diverse industry participation Processor vendors, system OEMs, middleware vendors, application developers Many industry-leading experts involved in OpenCL’s design A healthy diversity of industry perspectives Apple initially proposed and is very active in the working group Serving as specification editor Here are some of the other companies in the OpenCL working group Folie 11 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 12. OpenCL Timeline Six months from proposal to released specification Due to a strong initial proposal and a shared commercial incentive to work quickly Apple’s Mac OS X Snow Leopard will include OpenCL Improving speed and responsiveness for a wide spectrum of applications Multiple OpenCL implementations expected in the next 12 months On diverse platforms OpenCL Apple works working group Khronos with AMD, Intel, develops draft publicly releases NVIDIA and into cross- OpenCL as others on draft vendor royalty-free proposal specification specification Jun08 Oct08 May09 Dec08 Apple proposes Working Group Khronos to OpenCL working sends release group and completed draft conformance contributes draft to Khronos tests to ensure specification to Board for high-quality Khronos Ratification implementations Folie 12 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 13. Silicon Community OpenCL: Part of the Khronos API Ecosystem Desktop 3D Ecosystem Parallel computing and visualization in scientific and Software Community 3D Asset Interchange Format consumer applications Streamlined APIs for mobile and embedded graphics, media and Cross platform desktop 3D OpenCL compute acceleration Heterogeneous Parallel Computing Embedded 3D Streaming Media and Image Processing Vector 2D OpenCL is at the center of an emerging visual computing Enhanced Audio ecosystem that includes 3D graphics, video and image Surface and synch abstraction processing on desktop, embedded and mobile systems Integrated Mixed-media Stack Umbrella specifications define coherent acceleration stacks for mobile application portability Mobile OS Abstraction Folie 13 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 14. OpenCL: Platform Model One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units Each Compute Unit is further divided into one or more Processing Elements Folie 14 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 15. OpenCL: Execution Model OpenCL Program: Kernels Basic unit of executable code — similar to C functions, CUDA kernels, etc. Data-parallel or task-parallel Host Program Collection of compute kernels and internal functions Analogous to a dynamic library Kernel Execution The host program invokes a kernel over an index space called an NDRange NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space A single kernel instance at a point in the index space is called a work-item Work-items have unique global IDs from the index space CUDA: thread IDs Work-items are further grouped into work-groups Work-groups have a unique work-group ID Work-items have a unique local ID within a work-group CUDA: Block IDs Folie 15 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 16. OpenCL: Execution Model, example 2D NDRange Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID Folie 16 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 17. OpenCL: Execution Model Contexts are used to contain and manage the state of the “world” Kernels are executed in contexts defined and manipulated by the host Devices Kernels - OpenCL functions Program objects - kernel source and executable Memory objects Command-queue - coordinates execution of kernels Kernel execution commands Memory commands: Transfer or map memory object data Synchronization commands: Constrain the order of commands Execution order of commands Launched in-order Executed in-order or out-of-order Events are used to implement appropriate synchronization of execution instances Folie 17 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 18. OpenCL: Memory Model Shared memory Relaxed consistency (similar to CUDA) Multiple distinct address spaces (which can be collapsed) Private memory (to a work item) - Qualifier: __private; Ex.: __private char *px; PE PE PE PE - local memory in CUDA Local memory (to work group) - Qualifier: __local Local Memory Local Memory - Shared memory in CUDA Global memory Global / Constant Memory Data Cache - Qualifier: __global; Ex.: __global float4 *p; Compute Device - Global memory in CUDA Constant memory Global Memory - Qualifier: __constant Compute Device Memory - Constant memory in CUDA Folie 18 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 19. OpenCL: Memory Consistency A relaxed consistency memory model Across work-items (CUDA: threads) no consistency Within a work-item (CUDA: thread) load/store consistency Consistency of memory shared between commands is enforced through synchronization Folie 19 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 20. OpenCL: Programming Model Data-Parallel Model Must be implemented by all OpenCL compute devices Define N-Dimensional computation domain Each independent element of execution in an N-Dimensional domain is called a work-item N-Dimensional domain defines total # of work-items that execute in parallel = global work size Work-items can be grouped together — work-group Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access Execute multiple work-groups in parallel Mapping of global work size to work-group can be implicit or explicit Folie 20 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 21. OpenCL: Programming Model Task-Parallel Model Some compute devices can also execute task-parallel compute kernels Execute as a single work-item Users express parallelism by using vector data types implemented by the device, enqueuing multiple tasks (compute kernels written in OpenCL), and/or enqueing native kernels developed using a programming model orthogonal to OpenCL (native C / C++ functions, e.g.). Folie 21 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 22. OpenCL: Programming Model, Synchronization Work-items in a single work-group (work-group barrier) Similar to _synchthreads () in CUDA No mechanism for synchronization between work-groups Synchronization points between commands in command-queues Similar to multiple kernels in CUDA but more generalized Command-queue barrier Waiting on an event Folie 22 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 23. OpenCL C for Compute Kernels Derived from ISO C99 A few restrictions: recursion, function pointers, functions in C99 standard headers ... Preprocessing directives defined by C99 are supported Built-in Data Types Scalar and vector data types, Pointers Data-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_t Built-in Functions — Required Work-item functions, math.h, read and write image Relational, geometric functions, synchronization functions Built-in Functions — Optional double precision (latest CUDA supports this), atomics to global and local memory selection of rounding mode, writes to image3d_t surface Folie 23 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 24. OpenCL C Language Highlights Function qualifiers “__kernel” qualifier declares a function as a kernel Kernels can call other kernel functions Address space qualifiers __global, __local, __constant, __private Pointer kernel arguments must be declared with an address space qualifier Work-item functions Query work-item identifiers get_work_dim() get_global_id(), get_local_id(), get_group_id() Image functions Images must be accessed through built-in functions Reads/writes performed through sampler objects from host or defined in source Synchronization functions Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue Memory fences - provides ordering between memory operations Folie 24 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 25. OpenCL C Language Restrictions Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions Folie 25 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 26. Basic OpenCL Program Structure Folie 26 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 27. OpenCL: Kernel Code Example __kernel void VectorAdd( Simple element by __global const float* a, element vector addition __global const float* b, __global float* c) { For all i, int iGID = get_global_id(0); C(i) = A(i) + B(i) c[iGID] = a[iGID] + b[iGID]; } Folie 27 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 28. OpenCL VectorAdd: Contexts and Queues cl_context cxMainContext; // OpenCL context cl_command_queue cqCommandQue; // OpenCL command queue cl_device_id* cdDevices; // OpenCL device list size_t szParmDataBytes; // byte length of parameter storage // create the OpenCL context on a GPU device cxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes); cdDevices = (cl_device_id*)malloc(szParmDataBytes); clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL); // create a command-queue cqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL); Folie 28 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 29. OpenCL VectorAdd: Create Memory Objects, Program and Kernel Create Memory Objects // allocate the first source buffer memory object… source data, so read only … // allocate the second source buffer memory object … source data, so read only … // allocate the destination buffer memory object … result data, so write only … Create Program and Kernel // create the program … // build the program … // create the kernel ... // set the kernel Argument values … Folie 29 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 30. OpenCL VectorAdd: Launch Kernel cl_command_queue cqCommandQue; // OpenCL command queue cl_kernel ckKernel; // OpenCL kernel “VectorAdd” size_t szGlobalWorkSize[1]; // Global # of work items size_t szLocalWorkSize[1]; // # of Work Items in Work Group int iTestN = 10000; // Length of demo test vectors // set work-item dimensions szGlobalWorkSize[0] = iTestN; szLocalWorkSize[0]= 1; // execute kernel ciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL); // Cleanup: release kernel, program, and memory objects … Folie 30 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 31. Summary: OpenCL versus CUDA OpenCL CUDA Execution Model Work-groups/work-items Block/Thread Memory model Global/constant/local/ Global/constant/shared/ private local + texture Memory consistency Weak consistency Weak consistency Synchronization Synchronization using a Using synch_threads work-group barrier between threads (between work-items) Folie 31 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 32. Conclusions OpenCL meets the trends in HPC Supports multicore SMPs Supports accelerators (in particular GPGPUs) Allows programming of heterogenous processors Supports vectorization (SIMD operations) Explicitly supports parallel image processing Suitable for massive parallelism: OpenCL + MPI OpenCL is a low-level parallel language and complicated. If you master it you master (heterogeneous) parallel programming. It might be the basis for new high-level parallel languages. Folie 32 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 33. Questions? Folie 33 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 34. Next TechTalk on CUDA: Wednesday, July 22, 2009 (15:00-16:00, Raum-KP-2b-06, Funktional) CUDA - The Compute Unified Device Architecture from NVIDIA Jens Rühmkorf Folie 34 20090714-1 TechTalk OpenCL SC-VK Basermannt
  • 35. Additional material: OpenCL Demos NVIDIA: http://www.youtube.com/watch?v=PJ1jydg8mLg AMD (1): http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related AMD (2): http://www.youtube.com/watch?v=mcU89Td53Gg Folie 35 20090714-1 TechTalk OpenCL SC-VK Basermannt