SlideShare a Scribd company logo
OpenCL Buffers and Complete Examples Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change how data is handled between host and device, like page-locked I/O and so on The aim of this lecture is to cover required OpenCL host code for buffer management and provide simple examples  Code for context and buffer management discussed in examples in this lecture serves as templates for more complicated kernels  This allows the next 3 lectures to be focused solely on kernel optimizations like blocking, thread grouping and so on Examples covered Simple image rotation example Simple non-blocking matrix-matrix multiplication 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics Using OpenCL buffers Declaring buffers Enqueue reading and writing of buffers Simple but complete examples Image Rotation Non-blocking Matrix Multiplication 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Creating OpenCL Buffers Data used by OpenCL devices is stored in a “buffer” on the device An OpenCL buffer object is created using the following function Data can implicitly be copied to the device using a host pointer parameter In this case copy to device is invoked when kernel is enqueued cl_mem  bufferobj = clCreateBuffer (	 			cl_context context,		//Context name 		 	cl_mem_flags flags,		//Memory flags 		 	size_t size,			//Memory size allocated in buffer 		 	void *host_ptr,			//Host data   			cl_int *errcode)			//Returned error code 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory Flags Memory flag field in clCreateBuffer() allows us to define characteristics of the buffer object  5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Copying Buffers to Device clEnqueueWriteBuffer() is used to write a buffer object to device memory (from the host) Provides more control over copy process than using host pointer functionality of clCreateBuffer() Allows waiting for events and blocking cl_intclEnqueueWriteBuffer (	 cl_command_queuequeue,			//Command queue to device cl_mem buffer,					//OpenCL Buffer Object cl_boolblocking_read,				//Blocking/Non-Blocking Flag size_t offset,						//Offset into buffer to write to size_t cb,							//Size of data void *ptr,							//Host pointer cl_uintnum_in_wait_list,				//Number of events in wait list const cl_event *  event_wait_list,		//Array of events to wait for cl_event *event)					//Event handler for this function 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Copying Buffers to Host clEnqueueReadBuffer() is used to read from a buffer object from device to host memory Similar to clEnqueueWriteBuffer() The vector addition example discussed in Lecture 2 and 3 provide simple code snipped for moving data to and from devices cl_int clEnqueueReadBuffer (	 cl_command_queuequeue,			//Command queue to device cl_mem buffer,					//OpenCL Buffer Object cl_boolblocking_read,				//Blocking/Non-Blocking Flag size_t offset,						//Offset to copy from size_t cb,							//Size of data void *ptr,							//Host pointer cl_uintnum_in_wait_list,				//Number of events in wait list const cl_event *  event_wait_list,		//Array of events to wait for cl_event *event)					//Event handler for this function 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example 1 - Image Rotation A common image processing routine  Applications in matching, alignment, etc. New coordinates of point (x1,y1) when rotated  by an angle Θ around (x0,y0) By rotating the image about the origin (0,0) we get  Each coordinate for every point in the image can be calculated independently Original Image Rotated Image (90o) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Image Rotation Input: To copy to device Image (2D Matrix of floats) Rotation parameters Image dimensions Output: From device Rotated Image Main Steps Copy image to device by enqueueing a write to a buffer on the device from the host Run the Image rotation  kernel on input image Copy output image to host by enqueueing a read from a buffer on the device 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
The OpenCL Kernel Parallel portion of the algorithm off-loaded to device Most thought provoking part of coding process Steps to be done in Image Rotation kernel Obtain coordinates of work item in work group Read rotation parameters Calculate destination coordinates Read input and write rotated output at calculated coordinates Parallel kernel is not always this obvious.  Profiling of an application is often necessary to find the bottlenecks and locate the data parallelism In this example grid of output image decomposed into work items  Not all parts of the input image copied to the output image after rotation, corners of I/P image could be lost after rotation 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Kernel __kernel  void image_rotate( 	__global float * src_data, __global float * dest_data,	//Data in global memory int W,    int H,								//Image Dimensions 	float sinTheta, float cosTheta )					//Rotation Parameters {     	//Thread gets its index within index space 	const int ix = get_global_id(0);  	const intiy = get_global_id(1);     	//Calculate location of data to move into ix and iy– Output decomposition as mentioned 	float xpos = (  ((float) ix)*cosTheta + ((float)iy )*sinTheta);     	float ypos = (  ((float) iy)*cosTheta -  ((float)ix)*sinTheta);     	if ((	 ((int)xpos>=0) && ((int)xpos< W)))		//Bound Checking  		&& (((int)ypos>=0) && ((int)ypos< H)))   	{ 		//Read (xpos,ypos) src_data and store at (ix,iy) in dest_data dest_data[iy*W+ix]=  src_data[(int)(floor(ypos*W+xpos))];  	 } } 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Step0: Initialize Device Query Platform Declare context  Choose a device from context Using device and context create a  command queue Platform Layer Query Devices Command Queue cl_context myctx = clCreateContextFromType (			   0, CL_DEVICE_TYPE_GPU,  			   NULL, NULL, &ciErrNum); Create Buffers Compile Program ciErrNum = clGetDeviceIDs(0,  			CL_DEVICE_TYPE_GPU,  			1, &device,  cl_uint *num_devices) Compiler Compile Kernel Set Arguments cl_commandqueuemyqueue ; myqueue = clCreateCommandQueue(  myctx, device, 0, &ciErrNum); Runtime Layer Execute Kernel 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Step1: Create Buffers Create buffers on device Input data is read-only Output data is write-only Transfer input data to the device Query Platform Platform Layer Query Devices cl_mem d_ip = clCreateBuffer( myctx, CL_MEM_READ_ONLY, mem_size,  			NULL, &ciErrNum); Command Queue Create Buffers cl_memd_op = clCreateBuffer( myctx, CL_MEM_WRITE_ONLY, mem_size,  			NULL, &ciErrNum); Compile Program Compiler Compile Kernel ciErrNum = clEnqueueWriteBuffer(	 myqueue , d_ip, CL_TRUE, 		0, mem_size, (void *)src_image, 		0, NULL,  NULL) Set Arguments Runtime Layer Execute Kernel 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Step2: Build Program, Select Kernel Query Platform  // create the program    cl_programmyprog = clCreateProgramWithSource( myctx,1, (const char **)&source,  				&program_length, &ciErrNum); Platform Layer Query Devices Command Queue // build the program     ciErrNum = clBuildProgram( myprog, 0,  				NULL, NULL, NULL, NULL); Create Buffers Compile Program Compiler //Use the “image_rotate” function as the kernel  cl_kernelmykernel = clCreateKernel ( myprog , “image_rotate” , error_code) Compile Kernel Set Arguments Runtime Layer Execute Kernel 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Step3: Set Arguments, Enqueue Kernel // Set Arguments clSetKernelArg(mykernel, 0, sizeof(cl_mem), 		 (void *)&d_ip);  clSetKernelArg(mykernel, 1, sizeof(cl_mem),  		(void *)&d_op); clSetKernelArg(mykernel, 2, sizeof(cl_int), 		 (void *)&W); ... //Set local and global workgroup sizes size_t localws[2] = {16,16} ;  size_t globalws[2] = {W, H};//Assume divisible by 16 // execute kernel clEnqueueNDRangeKernel(  myqueue , myKernel,  		2, 0, globalws, localws,                  0, NULL, NULL); Query Platform Platform Layer Query Devices Command Queue Create Buffers Compile Program Compiler Compile Kernel Set Arguments Runtime Layer Execute Kernel 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Step4: Read Back Result Only necessary for data  required on the host Data output from one kernel can be reused for another kernel  Avoid redundant host-device IO Query Platform Platform Layer Query Devices Command Queue Create Buffers Compile Program // copy results from device back to host clEnqueueReadBuffer( myctx, d_op,  		CL_TRUE,		//Blocking Read Back 		0, mem_size,  (void *) op_data,                     NULL, NULL, NULL); Compiler Compile Kernel Set Arguments Runtime Layer Execute Kernel 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Timing OpenCL provides “events” which can be used for timing kernels Events will be discussed in detail in Lecture 11 We pass an event to the OpenCL enqueue kernel function to capture timestamps Code snippet provided can be used to time a kernel Add profiling enable flag to create command queue   By taking differences of the start and end timestamps we discount  overheads like time spent in the command queue cl_event event_timer; clEnqueueNDRangeKernel(  myqueue , myKernel,  		2, 0, globalws, localws,                  	0, NULL, &event_timer); unsigned long starttime, endtime; clGetEventProfilingInfo( event_time,  	CL_PROFILING_COMMAND_START,  sizeof(cl_ulong), &starttime, NULL); clGetEventProfilingInfo(event_time, 	CL_PROFILING_COMMAND_END,  sizeof(cl_ulong), &endtime, NULL); unsigned long elapsed =   (unsigned long)(endtime - starttime); 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example 2Matrix Multiplication
Basic Matrix Multiplication Non-blocking matrix multiplication Doesn’t use local memory Each element of matrix reads its own data independently Serial matrix multiplication Reuse code from image rotation Create context, command queues and compile program Only need one more input memory object for 2nd matrix for(inti = 0; i < Ha; i++) for(intj = 0; j < Wb; j++){ c[i][j] = 0; for(intk = 0; k < Wa; k++)	 c[i][j] +=  a[i][k] + b[k][j] 	} 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Simple Matrix Multiplication Wb __kernel void simpleMultiply(  		__global float* c, intWa, intWb,  		__global float* a, __global float* b) { 	//Get global position in Y direction int row = get_global_id(1); 	//Get global position in X direction intcol   = get_global_id(0);  	float sum = 0.0f;  	//Calculate result of one element 	for (inti = 0; i < Wa; i++)  { 		sum +=  a[row*Wa+i] * b[i*Wb+col]; 	} c[row*Wb+col] = sum; } B Hb col A C row Ha Wb Wa 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary We have studied the use of OpenCL buffer objects  A complete program in OpenCL has been written We have understood how an OpenCL work-item can be used to work on a single output element (seen with rotation and matrix multiplication) While the previously discussed examples are correct data parallel programs their performance can be drastically improved Next Lecture Study the GPU memory subsystem to understand how data must be managed to obtain performance for data parallel programs Understand possible optimizations for programs running on data parallel hardware like GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

What's hot

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
Karel Ha
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
Sasha Goldshtein
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
Narann29
 
Making a Process
Making a ProcessMaking a Process
Making a Process
David Evans
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
David Evans
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
Karel Ha
 
ikh331-06-distributed-programming
ikh331-06-distributed-programmingikh331-06-distributed-programming
ikh331-06-distributed-programming
Anung Ariwibowo
 
Parallel k nn on gpu architecture using opencl
Parallel k nn on gpu architecture using openclParallel k nn on gpu architecture using opencl
Parallel k nn on gpu architecture using opencl
eSAT Publishing House
 
Microkernel Development
Microkernel DevelopmentMicrokernel Development
Microkernel Development
Rodrigo Almeida
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
kenluck2001
 
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel Space
David Evans
 
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
iosrjce
 
Scheduling
SchedulingScheduling
Scheduling
David Evans
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
bosc
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Model
kenluck2001
 
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)
David Evans
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
An introduction to thrust CUDA
An introduction to thrust CUDAAn introduction to thrust CUDA
An introduction to thrust CUDA
Adrien Wattez
 

What's hot (19)

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Making a Process
Making a ProcessMaking a Process
Making a Process
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
ikh331-06-distributed-programming
ikh331-06-distributed-programmingikh331-06-distributed-programming
ikh331-06-distributed-programming
 
Parallel k nn on gpu architecture using opencl
Parallel k nn on gpu architecture using openclParallel k nn on gpu architecture using opencl
Parallel k nn on gpu architecture using opencl
 
Microkernel Development
Microkernel DevelopmentMicrokernel Development
Microkernel Development
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel Space
 
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
 
Scheduling
SchedulingScheduling
Scheduling
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Model
 
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
An introduction to thrust CUDA
An introduction to thrust CUDAAn introduction to thrust CUDA
An introduction to thrust CUDA
 

Similar to Lec05 buffers basic_examples

srgoc
srgocsrgoc
Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2
ppd1961
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
chiportal
 
Book
BookBook
Book
luis_lmro
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
gopikahari7
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe464163
 
Objective c beginner's guide
Objective c beginner's guideObjective c beginner's guide
Objective c beginner's guide
Tiago Faller
 
Introduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptxIntroduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptx
poongothai11
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
George Papaioannou
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
ugur candan
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
coolmirza143
 
Data structure week 1
Data structure week 1Data structure week 1
Data structure week 1
karmuhtam
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
Electronic Arts / DICE
 
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the bfinalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
ChereCheek752
 
Dynamic Memory Allocation.pptx
Dynamic Memory Allocation.pptxDynamic Memory Allocation.pptx
Dynamic Memory Allocation.pptx
ssuser688516
 
Samsung WebCL Prototype API
Samsung WebCL Prototype APISamsung WebCL Prototype API
Samsung WebCL Prototype API
Ryo Jin
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
Paul Chao
 
C# Unit 2 notes
C# Unit 2 notesC# Unit 2 notes
C# Unit 2 notes
Sudarshan Dhondaley
 
DevOps Enabling Your Team
DevOps Enabling Your TeamDevOps Enabling Your Team
DevOps Enabling Your Team
GR8Conf
 

Similar to Lec05 buffers basic_examples (20)

srgoc
srgocsrgoc
srgoc
 
Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Book
BookBook
Book
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Objective c beginner's guide
Objective c beginner's guideObjective c beginner's guide
Objective c beginner's guide
 
Introduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptxIntroduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptx
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
Data structure week 1
Data structure week 1Data structure week 1
Data structure week 1
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the bfinalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
finalprojtemplatev5finalprojtemplate.gitignore# Ignore the b
 
Dynamic Memory Allocation.pptx
Dynamic Memory Allocation.pptxDynamic Memory Allocation.pptx
Dynamic Memory Allocation.pptx
 
Samsung WebCL Prototype API
Samsung WebCL Prototype APISamsung WebCL Prototype API
Samsung WebCL Prototype API
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
C# Unit 2 notes
C# Unit 2 notesC# Unit 2 notes
C# Unit 2 notes
 
DevOps Enabling Your Team
DevOps Enabling Your TeamDevOps Enabling Your Team
DevOps Enabling Your Team
 

Recently uploaded

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 

Recently uploaded (20)

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 

Lec05 buffers basic_examples

  • 1. OpenCL Buffers and Complete Examples Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change how data is handled between host and device, like page-locked I/O and so on The aim of this lecture is to cover required OpenCL host code for buffer management and provide simple examples Code for context and buffer management discussed in examples in this lecture serves as templates for more complicated kernels This allows the next 3 lectures to be focused solely on kernel optimizations like blocking, thread grouping and so on Examples covered Simple image rotation example Simple non-blocking matrix-matrix multiplication 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics Using OpenCL buffers Declaring buffers Enqueue reading and writing of buffers Simple but complete examples Image Rotation Non-blocking Matrix Multiplication 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Creating OpenCL Buffers Data used by OpenCL devices is stored in a “buffer” on the device An OpenCL buffer object is created using the following function Data can implicitly be copied to the device using a host pointer parameter In this case copy to device is invoked when kernel is enqueued cl_mem bufferobj = clCreateBuffer ( cl_context context, //Context name cl_mem_flags flags, //Memory flags size_t size, //Memory size allocated in buffer void *host_ptr, //Host data cl_int *errcode) //Returned error code 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Memory Flags Memory flag field in clCreateBuffer() allows us to define characteristics of the buffer object 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Copying Buffers to Device clEnqueueWriteBuffer() is used to write a buffer object to device memory (from the host) Provides more control over copy process than using host pointer functionality of clCreateBuffer() Allows waiting for events and blocking cl_intclEnqueueWriteBuffer ( cl_command_queuequeue, //Command queue to device cl_mem buffer, //OpenCL Buffer Object cl_boolblocking_read, //Blocking/Non-Blocking Flag size_t offset, //Offset into buffer to write to size_t cb, //Size of data void *ptr, //Host pointer cl_uintnum_in_wait_list, //Number of events in wait list const cl_event * event_wait_list, //Array of events to wait for cl_event *event) //Event handler for this function 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Copying Buffers to Host clEnqueueReadBuffer() is used to read from a buffer object from device to host memory Similar to clEnqueueWriteBuffer() The vector addition example discussed in Lecture 2 and 3 provide simple code snipped for moving data to and from devices cl_int clEnqueueReadBuffer ( cl_command_queuequeue, //Command queue to device cl_mem buffer, //OpenCL Buffer Object cl_boolblocking_read, //Blocking/Non-Blocking Flag size_t offset, //Offset to copy from size_t cb, //Size of data void *ptr, //Host pointer cl_uintnum_in_wait_list, //Number of events in wait list const cl_event * event_wait_list, //Array of events to wait for cl_event *event) //Event handler for this function 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Example 1 - Image Rotation A common image processing routine Applications in matching, alignment, etc. New coordinates of point (x1,y1) when rotated by an angle Θ around (x0,y0) By rotating the image about the origin (0,0) we get Each coordinate for every point in the image can be calculated independently Original Image Rotated Image (90o) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Image Rotation Input: To copy to device Image (2D Matrix of floats) Rotation parameters Image dimensions Output: From device Rotated Image Main Steps Copy image to device by enqueueing a write to a buffer on the device from the host Run the Image rotation kernel on input image Copy output image to host by enqueueing a read from a buffer on the device 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. The OpenCL Kernel Parallel portion of the algorithm off-loaded to device Most thought provoking part of coding process Steps to be done in Image Rotation kernel Obtain coordinates of work item in work group Read rotation parameters Calculate destination coordinates Read input and write rotated output at calculated coordinates Parallel kernel is not always this obvious. Profiling of an application is often necessary to find the bottlenecks and locate the data parallelism In this example grid of output image decomposed into work items Not all parts of the input image copied to the output image after rotation, corners of I/P image could be lost after rotation 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. OpenCL Kernel __kernel void image_rotate( __global float * src_data, __global float * dest_data, //Data in global memory int W, int H, //Image Dimensions float sinTheta, float cosTheta ) //Rotation Parameters { //Thread gets its index within index space const int ix = get_global_id(0); const intiy = get_global_id(1); //Calculate location of data to move into ix and iy– Output decomposition as mentioned float xpos = ( ((float) ix)*cosTheta + ((float)iy )*sinTheta); float ypos = ( ((float) iy)*cosTheta - ((float)ix)*sinTheta); if (( ((int)xpos>=0) && ((int)xpos< W))) //Bound Checking && (((int)ypos>=0) && ((int)ypos< H))) { //Read (xpos,ypos) src_data and store at (ix,iy) in dest_data dest_data[iy*W+ix]= src_data[(int)(floor(ypos*W+xpos))]; } } 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Step0: Initialize Device Query Platform Declare context Choose a device from context Using device and context create a command queue Platform Layer Query Devices Command Queue cl_context myctx = clCreateContextFromType ( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &ciErrNum); Create Buffers Compile Program ciErrNum = clGetDeviceIDs(0, CL_DEVICE_TYPE_GPU, 1, &device, cl_uint *num_devices) Compiler Compile Kernel Set Arguments cl_commandqueuemyqueue ; myqueue = clCreateCommandQueue( myctx, device, 0, &ciErrNum); Runtime Layer Execute Kernel 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Step1: Create Buffers Create buffers on device Input data is read-only Output data is write-only Transfer input data to the device Query Platform Platform Layer Query Devices cl_mem d_ip = clCreateBuffer( myctx, CL_MEM_READ_ONLY, mem_size, NULL, &ciErrNum); Command Queue Create Buffers cl_memd_op = clCreateBuffer( myctx, CL_MEM_WRITE_ONLY, mem_size, NULL, &ciErrNum); Compile Program Compiler Compile Kernel ciErrNum = clEnqueueWriteBuffer( myqueue , d_ip, CL_TRUE, 0, mem_size, (void *)src_image, 0, NULL, NULL) Set Arguments Runtime Layer Execute Kernel 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Step2: Build Program, Select Kernel Query Platform // create the program cl_programmyprog = clCreateProgramWithSource( myctx,1, (const char **)&source, &program_length, &ciErrNum); Platform Layer Query Devices Command Queue // build the program ciErrNum = clBuildProgram( myprog, 0, NULL, NULL, NULL, NULL); Create Buffers Compile Program Compiler //Use the “image_rotate” function as the kernel cl_kernelmykernel = clCreateKernel ( myprog , “image_rotate” , error_code) Compile Kernel Set Arguments Runtime Layer Execute Kernel 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Step3: Set Arguments, Enqueue Kernel // Set Arguments clSetKernelArg(mykernel, 0, sizeof(cl_mem), (void *)&d_ip); clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *)&d_op); clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&W); ... //Set local and global workgroup sizes size_t localws[2] = {16,16} ; size_t globalws[2] = {W, H};//Assume divisible by 16 // execute kernel clEnqueueNDRangeKernel( myqueue , myKernel, 2, 0, globalws, localws, 0, NULL, NULL); Query Platform Platform Layer Query Devices Command Queue Create Buffers Compile Program Compiler Compile Kernel Set Arguments Runtime Layer Execute Kernel 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Step4: Read Back Result Only necessary for data required on the host Data output from one kernel can be reused for another kernel Avoid redundant host-device IO Query Platform Platform Layer Query Devices Command Queue Create Buffers Compile Program // copy results from device back to host clEnqueueReadBuffer( myctx, d_op, CL_TRUE, //Blocking Read Back 0, mem_size, (void *) op_data, NULL, NULL, NULL); Compiler Compile Kernel Set Arguments Runtime Layer Execute Kernel 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. OpenCL Timing OpenCL provides “events” which can be used for timing kernels Events will be discussed in detail in Lecture 11 We pass an event to the OpenCL enqueue kernel function to capture timestamps Code snippet provided can be used to time a kernel Add profiling enable flag to create command queue By taking differences of the start and end timestamps we discount overheads like time spent in the command queue cl_event event_timer; clEnqueueNDRangeKernel( myqueue , myKernel, 2, 0, globalws, localws, 0, NULL, &event_timer); unsigned long starttime, endtime; clGetEventProfilingInfo( event_time, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &starttime, NULL); clGetEventProfilingInfo(event_time, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &endtime, NULL); unsigned long elapsed = (unsigned long)(endtime - starttime); 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Basic Matrix Multiplication Non-blocking matrix multiplication Doesn’t use local memory Each element of matrix reads its own data independently Serial matrix multiplication Reuse code from image rotation Create context, command queues and compile program Only need one more input memory object for 2nd matrix for(inti = 0; i < Ha; i++) for(intj = 0; j < Wb; j++){ c[i][j] = 0; for(intk = 0; k < Wa; k++) c[i][j] += a[i][k] + b[k][j] } 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Simple Matrix Multiplication Wb __kernel void simpleMultiply( __global float* c, intWa, intWb, __global float* a, __global float* b) { //Get global position in Y direction int row = get_global_id(1); //Get global position in X direction intcol = get_global_id(0); float sum = 0.0f; //Calculate result of one element for (inti = 0; i < Wa; i++) { sum += a[row*Wa+i] * b[i*Wb+col]; } c[row*Wb+col] = sum; } B Hb col A C row Ha Wb Wa 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Summary We have studied the use of OpenCL buffer objects A complete program in OpenCL has been written We have understood how an OpenCL work-item can be used to work on a single output element (seen with rotation and matrix multiplication) While the previously discussed examples are correct data parallel programs their performance can be drastically improved Next Lecture Study the GPU memory subsystem to understand how data must be managed to obtain performance for data parallel programs Understand possible optimizations for programs running on data parallel hardware like GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  1. OpenCL allows specifying a host pointer when we declare a object. In this case copy to device is invoked when kernel is enqueued.
  2. These memory flags are used to specify how a buffer is allocated
  3. Point to mention here is that OpenCL event handling capabilities are not available when we do copying using the host pointer parameter of clcreatebuffer because the copy is done implicitly.So if profiling is going to be used, this function call is necessary instead of passing host pointers
  4. Similar to clEnqueueWriteBuffer
  5. Simple data parallel algorithm and exampleEach work item handles one pixel.
  6. Pseudo-code for OpenCL kernels
  7. This is a simple example and improvements are possible in the accessing of memory because reads will not be perfectly aligned.However the aim of this example is to show how a data – space like a image can be mapped to a grid of work items and how each work item can process its piece of data independently of other work items.
  8. This is the essential boiler plate code that can be reused for any OpenCL program.This creates the context, command queue and picks a device
  9. IP and OP buffers
  10. These steps can be reused for all OpenCL functions.
  11. Execute kernel on GPU
  12. Read back output
  13. By taking the difference between start and end times, we can check execution duration of kernels without overhead as gettimeofday would have
  14. Simple serial 3-loop matrix multiplication
  15. OpenCL getting started guide example.This example also shows how each work item can use its own section of input data (Matrix A and B ) to calculate the result for its own position in Matrix C