OpenCL Heterogeneous Parallel Computing

•Download as PPTX, PDF•

2 likes•636 views

João Paulo Leonidas Fernandes Dias da Silva

Presentation on OpenCL framework for heterogeneous parallel computing. Mostly used for General GPU Computing.

Software

The Open Standard for Heterogeneous Parallel Programming

The Kronos Group
https://www.khronos.org

Many languages…
• C/C++ - https://www.khronos.org/opencl/
• .NET - http://openclnet.codeplex.com/
• Python - http://mathema.tician.de/software/pyopencl/
• Java - http://www.jocl.org/
• Julia - https://github.com/JuliaGPU/OpenCL.jl

Many platforms…
• AMD - CPUs, APUs, GPUs
• NVIDIA - GPUs
• INTEL - CPUs, GPUs
• APPLE - CPUs
• SAMSUMG - ARM processors
• OTHERS -
https://www.khronos.org/conformance/adopters/conforman
t-products#opencl

Why GPUs?
• Designed for Parallelism - Supports thousands of threads
with no thread management cost
• High Speed
• Low Cost
• Availability

How does it work?
• Host code - Runs on CPU
• Serial code (data pre-processing, sequential algorithms)
• Reads data from input (files, databases, streams)
• Transfers data from host to device (gpu)
• Calls device code (kernels)
• Copies data back from device to host
• Device code - Runs on GPU
• Independent parallel tasks called kernels
• Same task acts on different pieces of data - SIMD - Data Parallelism
• Different tasks act on different pieces of data - MIMD - Task Parallelism

Computing Model
• Compute Device = GPU
• Compute Unit = Processor
• Compute/Processing Element = Processor Core
• A GPU can contain from hundreds up to thousands cores

Work-items/Work-groups
• Work-item = Thread
• Work-items are grouped into Work-groups
• Work-items in the same Work-group can:
• Share Data
• Synchronize
• Map work-items to better match the data structure

Matrix Multiplication
• Matrix A[4,2]
• Matrix B[2,3]
• Matrix C[4,3] = A * B

Matrix Multiplication
• For matrices A[128,128] and B[128,128]
• Matrix C will have 16384 elements
• We can launch 16384 work-items (threads)
• The work-group size can be set to [16,16]
• So we end up with 64 groups of 256 elements each

$Kernel Code __kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB ) { //will range from 0 to 127 int i = get_global_id(0); //will range from 0 to 127 int j = get_global_id(1); float value=0; for ( int k = 0; k < widthA; k++) { value = value + A[k + j * widthA] * B[k*widthB + i]; } C[i + widthA * j] = value; }$

$Host Code /* Create Kernel Program from the source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); /* Build Kernel Program */ ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /* Create OpenCL Kernel */ kernel = clCreateKernel(program, "matrixMultiplication", &ret); /* Set OpenCL Kernel Arguments */ ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC); ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row); ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col); /* Execute OpenCL Kernel */ size_t globalThreads[2] = {widthA, heightB}; size_t localThreads[2] = {16,16}; clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0); /* Copy results from the memory buffer */ ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC * sizeof(float),Res, 0, NULL, NULL);$

Limitations
• Number of work-items (threads)
• Group size (# of work-items, memory size)
• Data transfer bandwidth
• Device memory size

Be careful with…
• Uncoalesced memory access
• Branch divergence
• Access to global memory
• Data transfer between host and device

What's hot

Hacking QNXricardomcm

“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...Edge AI and Vision Alliance

Introduction to CUDARaymond Tay

Openstack 101Kamesh Pemmaraju

Fluentd vs. Logstash for OpenStack Log ManagementNTT Communications Technology Development

Qt on Real Time Operating Systemsaccount inactive

Static partitioning virtualization on RISC-VRISC-V International

GPU ComputingKhan Mostafa

Introduction to GPU ProgrammingChakkrit (Kla) Tantithamthavorn

RDMA programming design and case studies – for better performance distributed...NTT Software Innovation Center

THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioDevOpsDays Tel Aviv

DPDK KNI interfaceDenys Haryachyy

Understanding the Android System ServerOpersys inc.

Project ACRN hypervisor introduction Project ACRN

IPMI is dead, Long live RedfishBruno Cornec

GPU ProgrammingWilliam Cunningham

Understanding DPDKDenys Haryachyy

Implementation & Comparison Of Rdma Over EthernetJames Wernicke

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance

FD.IO Vector Packet ProcessingKernel TLV

What's hot (20)

Hacking QNX

“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...

Introduction to CUDA

Openstack 101

Fluentd vs. Logstash for OpenStack Log Management

Qt on Real Time Operating Systems

Static partitioning virtualization on RISC-V

GPU Computing

Introduction to GPU Programming

RDMA programming design and case studies – for better performance distributed...

THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io

DPDK KNI interface

Understanding the Android System Server

Project ACRN hypervisor introduction

IPMI is dead, Long live Redfish

GPU Programming

Understanding DPDK

Implementation & Comparison Of Rdma Over Ethernet

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

FD.IO Vector Packet Processing

Similar to OpenCL Heterogeneous Parallel Computing

"Making OpenCV Code Run Fast," a Presentation from IntelEdge AI and Vision Alliance

Intro to C++ - languageJussi Pohjolainen

C # (C Sharp).pptxSnapeSever

MattsonTutorialSC14.pdfGeorge Papaioannou

MattsonTutorialSC14.pptxgopikahari7

Building High Performance Android Applications in Java and C++Kenneth Geisshirt

introduction to node.jsorkaplan

Secure coding for developerssluge

25-MPI-OpenMP.pptxGopalPatidar13

Implement Runtime Environments for HSA using LLVMNational Cheng Kung University

Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz

GolangFelipe Mamud

Tips and tricks for building high performance android apps using native codeKenneth Geisshirt

Apache Thriftknight1128

Exploring the Programming Models for the LUMI Supercomputer George Markomanolis

C++ amp on linuxMiller Lee

270_1_CIntro_Up_To_Functions.pptUdhayaKumar175069

Survey of programming language getting started in Cummeafruz

270 1 c_intro_up_to_functionsray143eddie

270_1_CIntro_Up_To_Functions.pptAlefya1

Similar to OpenCL Heterogeneous Parallel Computing (20)

"Making OpenCV Code Run Fast," a Presentation from Intel

Intro to C++ - language

C # (C Sharp).pptx

MattsonTutorialSC14.pdf

MattsonTutorialSC14.pptx

Building High Performance Android Applications in Java and C++

introduction to node.js

Secure coding for developers

25-MPI-OpenMP.pptx

Implement Runtime Environments for HSA using LLVM

Big Data, Data Lake, Fast Data - Dataserialiation-Formats

Golang

Tips and tricks for building high performance android apps using native code

Apache Thrift

Exploring the Programming Models for the LUMI Supercomputer

C++ amp on linux

270_1_CIntro_Up_To_Functions.ppt

Survey of programming language getting started in C

270 1 c_intro_up_to_functions

270_1_CIntro_Up_To_Functions.ppt

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

chapter--4-software-project-planning.pptkotipi9215

5 Signs You Need a Fashion PLM Software.pdfWave PLM

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

What is Fashion PLM and Why Do You Need ItWave PLM

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

cybersecurity notes for mca students for learningVitsRangannavar

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Asset Management Software - InfographicHr365.us smith

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)

The Evolution of Karaoke From Analog to App.pdf

chapter--4-software-project-planning.ppt

5 Signs You Need a Fashion PLM Software.pdf

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Engage Usergroup 2024 - The Good The Bad_The Ugly

Unit 1.1 Excite Part 1, class 9, cbse...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

What is Fashion PLM and Why Do You Need It

why an Opensea Clone Script might be your perfect match.pdf

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Advancing Engineering with AI through the Next Generation of Strategic Projec...

cybersecurity notes for mca students for learning

XpertSolvers: Your Partner in Building Innovative Software Solutions

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Project Based Learning (A.I).pptx detail explanation

Asset Management Software - Infographic

OpenCL Heterogeneous Parallel Computing

1. The Open Standard for Heterogeneous Parallel Programming

2. The Kronos Group https://www.khronos.org

3. Open means…

4. Many languages… • C/C++ - https://www.khronos.org/opencl/ • .NET - http://openclnet.codeplex.com/ • Python - http://mathema.tician.de/software/pyopencl/ • Java - http://www.jocl.org/ • Julia - https://github.com/JuliaGPU/OpenCL.jl

5. Many platforms… • AMD - CPUs, APUs, GPUs • NVIDIA - GPUs • INTEL - CPUs, GPUs • APPLE - CPUs • SAMSUMG - ARM processors • OTHERS - https://www.khronos.org/conformance/adopters/conforman t-products#opencl

6. Why GPUs? • Designed for Parallelism - Supports thousands of threads with no thread management cost • High Speed • Low Cost • Availability

7. How does it work? • Host code - Runs on CPU • Serial code (data pre-processing, sequential algorithms) • Reads data from input (files, databases, streams) • Transfers data from host to device (gpu) • Calls device code (kernels) • Copies data back from device to host • Device code - Runs on GPU • Independent parallel tasks called kernels • Same task acts on different pieces of data - SIMD - Data Parallelism • Different tasks act on different pieces of data - MIMD - Task Parallelism

8. Speed up - Amdahl’s Law

9. Computing Model

10. Computing Model • Compute Device = GPU • Compute Unit = Processor • Compute/Processing Element = Processor Core • A GPU can contain from hundreds up to thousands cores

11. Memory Model

12. Work-items/Work-groups • Work-item = Thread • Work-items are grouped into Work-groups • Work-items in the same Work-group can: • Share Data • Synchronize • Map work-items to better match the data structure

13. Work-items 1D Mapping

14. Work-items 2D Mapping

15. Matrix Multiplication • Matrix A[4,2] • Matrix B[2,3] • Matrix C[4,3] = A * B

16. Matrix Multiplication • For matrices A[128,128] and B[128,128] • Matrix C will have 16384 elements • We can launch 16384 work-items (threads) • The work-group size can be set to [16,16] • So we end up with 64 groups of 256 elements each

17. Kernel Code __kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB ) { //will range from 0 to 127 int i = get_global_id(0); //will range from 0 to 127 int j = get_global_id(1); float value=0; for ( int k = 0; k < widthA; k++) { value = value + A[k + j * widthA] * B[k*widthB + i]; } C[i + widthA * j] = value; }

18. Host Code /* Create Kernel Program from the source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); /* Build Kernel Program */ ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /* Create OpenCL Kernel */ kernel = clCreateKernel(program, "matrixMultiplication", &ret); /* Set OpenCL Kernel Arguments */ ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC); ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row); ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col); /* Execute OpenCL Kernel */ size_t globalThreads[2] = {widthA, heightB}; size_t localThreads[2] = {16,16}; clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0); /* Copy results from the memory buffer */ ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC * sizeof(float),Res, 0, NULL, NULL);

19. Limitations • Number of work-items (threads) • Group size (# of work-items, memory size) • Data transfer bandwidth • Device memory size

20. Be careful with… • Uncoalesced memory access • Branch divergence • Access to global memory • Data transfer between host and device

21. Demo

22. Thanks!

OpenCL Heterogeneous Parallel Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OpenCL Heterogeneous Parallel Computing

Similar to OpenCL Heterogeneous Parallel Computing (20)

More from João Paulo Leonidas Fernandes Dias da Silva

More from João Paulo Leonidas Fernandes Dias da Silva (7)

Recently uploaded

Recently uploaded (20)

OpenCL Heterogeneous Parallel Computing