SlideShare a Scribd company logo
1 of 22
The Open Standard for Heterogeneous Parallel Programming
The Kronos Group
https://www.khronos.org
Open means…
Many languages…
• C/C++ - https://www.khronos.org/opencl/
• .NET - http://openclnet.codeplex.com/
• Python - http://mathema.tician.de/software/pyopencl/
• Java - http://www.jocl.org/
• Julia - https://github.com/JuliaGPU/OpenCL.jl
Many platforms…
• AMD - CPUs, APUs, GPUs
• NVIDIA - GPUs
• INTEL - CPUs, GPUs
• APPLE - CPUs
• SAMSUMG - ARM processors
• OTHERS -
https://www.khronos.org/conformance/adopters/conforman
t-products#opencl
Why GPUs?
• Designed for Parallelism - Supports thousands of threads
with no thread management cost
• High Speed
• Low Cost
• Availability
How does it work?
• Host code - Runs on CPU
• Serial code (data pre-processing, sequential algorithms)
• Reads data from input (files, databases, streams)
• Transfers data from host to device (gpu)
• Calls device code (kernels)
• Copies data back from device to host
• Device code - Runs on GPU
• Independent parallel tasks called kernels
• Same task acts on different pieces of data - SIMD - Data Parallelism
• Different tasks act on different pieces of data - MIMD - Task Parallelism
Speed up - Amdahl’s Law
Computing Model
Computing Model
• Compute Device = GPU
• Compute Unit = Processor
• Compute/Processing Element = Processor Core
• A GPU can contain from hundreds up to thousands cores
Memory Model
Work-items/Work-groups
• Work-item = Thread
• Work-items are grouped into Work-groups
• Work-items in the same Work-group can:
• Share Data
• Synchronize
• Map work-items to better match the data structure
Work-items 1D Mapping
Work-items 2D Mapping
Matrix Multiplication
• Matrix A[4,2]
• Matrix B[2,3]
• Matrix C[4,3] = A * B
Matrix Multiplication
• For matrices A[128,128] and B[128,128]
• Matrix C will have 16384 elements
• We can launch 16384 work-items (threads)
• The work-group size can be set to [16,16]
• So we end up with 64 groups of 256 elements each
Kernel Code
__kernel
void matrixMultiplication(__global float* A, __global float* B, __global float* C, int
widthA, int widthB )
{
//will range from 0 to 127
int i = get_global_id(0);
//will range from 0 to 127
int j = get_global_id(1);
float value=0;
for ( int k = 0; k < widthA; k++)
{
value = value + A[k + j * widthA] * B[k*widthB + i];
}
C[i + widthA * j] = value;
}
Host Code
/* Create Kernel Program from the source */
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t
*)&source_size, &ret);
/* Build Kernel Program */
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
/* Create OpenCL Kernel */
kernel = clCreateKernel(program, "matrixMultiplication", &ret);
/* Set OpenCL Kernel Arguments */
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row);
ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col);
/* Execute OpenCL Kernel */
size_t globalThreads[2] = {widthA, heightB};
size_t localThreads[2] = {16,16};
clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0);
/* Copy results from the memory buffer */
ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC *
sizeof(float),Res, 0, NULL, NULL);
Limitations
• Number of work-items (threads)
• Group size (# of work-items, memory size)
• Data transfer bandwidth
• Device memory size
Be careful with…
• Uncoalesced memory access
• Branch divergence
• Access to global memory
• Data transfer between host and device
Demo
Thanks!

More Related Content

What's hot

“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...Edge AI and Vision Alliance
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Qt on Real Time Operating Systems
Qt on Real Time Operating SystemsQt on Real Time Operating Systems
Qt on Real Time Operating Systemsaccount inactive
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...NTT Software Innovation Center
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioDevOpsDays Tel Aviv
 
Understanding the Android System Server
Understanding the Android System ServerUnderstanding the Android System Server
Understanding the Android System ServerOpersys inc.
 
Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishBruno Cornec
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetJames Wernicke
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 

What's hot (20)

Hacking QNX
Hacking QNXHacking QNX
Hacking QNX
 
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log ManagementFluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
 
Qt on Real Time Operating Systems
Qt on Real Time Operating SystemsQt on Real Time Operating Systems
Qt on Real Time Operating Systems
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
Understanding the Android System Server
Understanding the Android System ServerUnderstanding the Android System Server
Understanding the Android System Server
 
Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live Redfish
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 

Similar to OpenCL Heterogeneous Parallel Computing

"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from IntelEdge AI and Vision Alliance
 
C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptxSnapeSever
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
Building High Performance Android Applications in Java and C++
Building High Performance Android Applications in Java and C++Building High Performance Android Applications in Java and C++
Building High Performance Android Applications in Java and C++Kenneth Geisshirt
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developerssluge
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
Tips and tricks for building high performance android apps using native code
Tips and tricks for building high performance android apps using native codeTips and tricks for building high performance android apps using native code
Tips and tricks for building high performance android apps using native codeKenneth Geisshirt
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.pptUdhayaKumar175069
 
Survey of programming language getting started in C
Survey of programming language getting started in CSurvey of programming language getting started in C
Survey of programming language getting started in Cummeafruz
 
270 1 c_intro_up_to_functions
270 1 c_intro_up_to_functions270 1 c_intro_up_to_functions
270 1 c_intro_up_to_functionsray143eddie
 
270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.pptAlefya1
 

Similar to OpenCL Heterogeneous Parallel Computing (20)

"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel
 
Intro to C++ - language
Intro to C++ - languageIntro to C++ - language
Intro to C++ - language
 
C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptx
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Building High Performance Android Applications in Java and C++
Building High Performance Android Applications in Java and C++Building High Performance Android Applications in Java and C++
Building High Performance Android Applications in Java and C++
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developers
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Golang
GolangGolang
Golang
 
Tips and tricks for building high performance android apps using native code
Tips and tricks for building high performance android apps using native codeTips and tricks for building high performance android apps using native code
Tips and tricks for building high performance android apps using native code
 
Apache Thrift
Apache ThriftApache Thrift
Apache Thrift
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt
 
Survey of programming language getting started in C
Survey of programming language getting started in CSurvey of programming language getting started in C
Survey of programming language getting started in C
 
270 1 c_intro_up_to_functions
270 1 c_intro_up_to_functions270 1 c_intro_up_to_functions
270 1 c_intro_up_to_functions
 
270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt270_1_CIntro_Up_To_Functions.ppt
270_1_CIntro_Up_To_Functions.ppt
 

More from João Paulo Leonidas Fernandes Dias da Silva (7)

Apache spark intro
Apache spark introApache spark intro
Apache spark intro
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Query driven development
Query driven developmentQuery driven development
Query driven development
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 
Unit testing basics
Unit testing basicsUnit testing basics
Unit testing basics
 
Qcon Rio 2015 - Data Lakes Workshop
Qcon Rio 2015 - Data Lakes WorkshopQcon Rio 2015 - Data Lakes Workshop
Qcon Rio 2015 - Data Lakes Workshop
 

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

OpenCL Heterogeneous Parallel Computing

  • 1. The Open Standard for Heterogeneous Parallel Programming
  • 4. Many languages… • C/C++ - https://www.khronos.org/opencl/ • .NET - http://openclnet.codeplex.com/ • Python - http://mathema.tician.de/software/pyopencl/ • Java - http://www.jocl.org/ • Julia - https://github.com/JuliaGPU/OpenCL.jl
  • 5. Many platforms… • AMD - CPUs, APUs, GPUs • NVIDIA - GPUs • INTEL - CPUs, GPUs • APPLE - CPUs • SAMSUMG - ARM processors • OTHERS - https://www.khronos.org/conformance/adopters/conforman t-products#opencl
  • 6. Why GPUs? • Designed for Parallelism - Supports thousands of threads with no thread management cost • High Speed • Low Cost • Availability
  • 7. How does it work? • Host code - Runs on CPU • Serial code (data pre-processing, sequential algorithms) • Reads data from input (files, databases, streams) • Transfers data from host to device (gpu) • Calls device code (kernels) • Copies data back from device to host • Device code - Runs on GPU • Independent parallel tasks called kernels • Same task acts on different pieces of data - SIMD - Data Parallelism • Different tasks act on different pieces of data - MIMD - Task Parallelism
  • 8. Speed up - Amdahl’s Law
  • 10. Computing Model • Compute Device = GPU • Compute Unit = Processor • Compute/Processing Element = Processor Core • A GPU can contain from hundreds up to thousands cores
  • 12. Work-items/Work-groups • Work-item = Thread • Work-items are grouped into Work-groups • Work-items in the same Work-group can: • Share Data • Synchronize • Map work-items to better match the data structure
  • 15. Matrix Multiplication • Matrix A[4,2] • Matrix B[2,3] • Matrix C[4,3] = A * B
  • 16. Matrix Multiplication • For matrices A[128,128] and B[128,128] • Matrix C will have 16384 elements • We can launch 16384 work-items (threads) • The work-group size can be set to [16,16] • So we end up with 64 groups of 256 elements each
  • 17. Kernel Code __kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB ) { //will range from 0 to 127 int i = get_global_id(0); //will range from 0 to 127 int j = get_global_id(1); float value=0; for ( int k = 0; k < widthA; k++) { value = value + A[k + j * widthA] * B[k*widthB + i]; } C[i + widthA * j] = value; }
  • 18. Host Code /* Create Kernel Program from the source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); /* Build Kernel Program */ ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /* Create OpenCL Kernel */ kernel = clCreateKernel(program, "matrixMultiplication", &ret); /* Set OpenCL Kernel Arguments */ ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC); ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row); ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col); /* Execute OpenCL Kernel */ size_t globalThreads[2] = {widthA, heightB}; size_t localThreads[2] = {16,16}; clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0); /* Copy results from the memory buffer */ ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC * sizeof(float),Res, 0, NULL, NULL);
  • 19. Limitations • Number of work-items (threads) • Group size (# of work-items, memory size) • Data transfer bandwidth • Device memory size
  • 20. Be careful with… • Uncoalesced memory access • Branch divergence • Access to global memory • Data transfer between host and device
  • 21. Demo