Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230 GPU Programming Models Stevina Dias* Sherrin Benjamin* Mitchell D’silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile ParleAbstract The CPU, the brains of the computer is programming a GPU namely Accelerator andbeing enhanced by another part of the computer CUDA.– the Graphics Processing Unit (GPU), which isits soul. GPUs are optimized for taking huge II. ACCELERATORbatches of data and performing the same Accelerator is a high-level data paralleloperation repeatedly and quickly, which is not library which uses parallel processors such as thethe case with PC microprocessors. A CPU GPU or multi core CPU to speed up execution. Thecombined with GPU, can deliver better system Accelerator application programming interfaceperformance, price, and power. The version 2 of (API) helps in implementing a wide variety of array-Microsoft’s Accelerator offers an efficient way processing operations by supporting a functionalfor applications to implement array-processing programming model. All the details of parallelizingoperations. These operations use the parallel and running the computation on the selected targetprocessing capabilities of multi-processor processor, including GPUs and multi core CPUs arecomputers. CUDA is a parallel computing handled by the accelerator. The Accelerator API isplatform and programming model created by almost processor independent, so the same array-NVIDIA. CUDA provides both lower and a processing code runs on any supported processorhigher level APIs. with only minor changes [3].Keywords– GPU, GPU Programming CUDA, The basic steps in Accelerator programming are:CUDA API, CGMA ratio, CUDA Memory Model, i. Create data arrays.Accelerator, Data Parallel Arrays ii. Load the data arrays into Accelerator data- parallel array objects.I. INTRODUCTION iii. Process the data-parallel array objects using a Computers have chips that facilitate the series of Accelerator operations.display of images to monitors. Each chip has iv. Create a result object.different functionality. Intel‘s integrated graphics v. Evaluate the result object on a target.controller provides basic graphics that can displayapplications like Microsoft PowerPoint, low- A. Data Parallel Array Objectsresolution video and basic games. The GPU is a Applications do not have direct access toprogrammable and powerful computational device the array data and cannot manipulate the array bythat goes far beyond basic graphics controller index [3]. Applications implement array processingfunctions [1]. GPUs are special-purpose processors schemes by applying Accelerator operations to data-used for the display of three dimensional scenes. The parallel array objects—which represent an entirecapabilities of GPU are being harnessed to array as a unit—rather than working with individualaccelerate computational workloads in areas such as array modelling, cutting-edge scientific researchand oil and gas exploration. The GPU can now take 5 data-parallel array objects supported byon many multimedia tasks, such as speed up Adobe Accelerator v2 are:Flash video, translating video to different formats, i. BoolParallelArrayimage recognition, virus pattern matching and many ii. DoubleParallelArraymore. But the challenge is to solve those problems iii. Float4ParallelArraythat have an inherent parallel nature – video iv. FloatParallelArrayprocessing, image analysis, signal processing. The v. IntParallelArrayCPU is composed of few cores and a huge cachememory that can handle a few software threads at a An example for Float4ParallelArray is as follows:time. Conversely, a GPU is composed of hundreds typedef structof cores that can handle thousands of threads { float a;simultaneously [2]. This ability of a GPU can float b;accelerate some software by 100x over a CPU alone. float c;The GPU achieves this acceleration while being float d; } float4;more powerful and cost-efficient as compared to aCPU. This paper describes 2 different ways of 225 | P a g e
  2. 2. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230 Float4 is an Accelerator structure that contains a for (int i = 0; i < arraySize; i++) // (2) quadruplet of float values. It is used primarily in { graphics programming. float4 xyz (0.7f, 5.0f, 4.2f, Array1[i] = (float)(Math.Sin((double)i / 10.0) + 9.0f). The API includes a large collection of ranf.NextDouble() / 5.0); operations that applications can use to manipulate Array2[i] = (float)(Math.Sin((double)i / 10.0) + the contents of arrays and to combine the arrays in ranf.NextDouble() / 5.0); various ways. Most operations take one or more } data-parallel array objects as input, and return the FloatArray fpInput1 = new FloatArray (Array1); processed data as a new data-parallel object. // (3) Accelerator operations work with copies of the input FloatArray fpInput2 = new FloatArray (Array2); objects; they do not modify the original objects. FloatArray fpStacked = PArray.Subtract(fpInput1, Operation inputs are typically data-parallel objects, fpInput2); // (4) but some operations can also take constant values. FloatArray fpOutput = PArray.Divide(fpStacked, 2); Array3 = result.ToArray1D(fpOutput); // (5) Table 1 Type of Accelerator Operations for (int i = 0; i < arrayLength; i++) // (6)Operation Explanation {Creation Create new data-parallel array objects and Console.WriteLine(Array3[i].ToString()); //(7)and convert them from one type to another. }}}}conversionElement- Operate on each element of one or more data- The above code uses two commonly used types:wise parallel array objects and return a new object ParallelArrays and FloatParallelArray. with the same dimensions as the originals. ParallelArrays represented by PArray, contains the  Add. Accelerator operation methods [3].  Abs FloatParallelArray represented by FloatArray,  Subtract contains floating point arrays in the Accelerator  Divide , etc environment.Reduction Reduce the rank of a data-parallel array object by applying a function across 1D or more 1. Create a target object: Each processor that dimensions. supports Accelerator has one or more target objects. Target Objects convert Accelerator operations toTransform Transform the organization of the elements in processor-specific code and run them on the a data-parallel array object. These operations processor. StackArrays uses DX9Target, which runs reorganize the data in the object, but do not Accelerator operations on any DirectX 9-compatible require computation. For example: GPU, using the DirectX 9 API.  Transpose  Pad 2. Create input data arrays: the input arraysLinear Perform standard matrix operations on data- for StackArrays are generated by the application, butalgebra parallel array objects, including matrix they can also be created from stored data. multiplication, scalar product, and outer product. 3. Load each input array: Each input array is loaded into an Accelerator data parallel array B. Sample Code object. using System; using Microsoft.ParallelArrays; 4. Array Processing: Use one or more Accelerator using FloatArray = operations to process the arrays. Microsoft.ParallelArrays.FloatParallelArray; using PArray = 5. Evaluation: evaluate the results of the series of Microsoft.ParallelArrays.ParallelArrays; operations by passing the result object namespace SubArrays to the target‘s ToArray1D method. ToArray1D { evaluates the 1D arrays and ToArray2D, class Calculation evaluates 2-D arrays. { staticvoid Main(string[] args) 6. Process the result: Stack Arrays displays { the elements of the processed array in the int arraySize = 100; console window. Random ranf = newRandom(); float[] Array1 = newfloat[arraySize]; float[] Array2 = newfloat[arraySize]; float[] Array3 = newfloat[arraySize]; DX9Target result = newDX9Target(); // (1) 226 | P a g e
  3. 3. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230 CUDA is a high level language which stands for Compute Unified Device Architecture. It C. Accelerator Architecture is a parallel computing platform and programming model created by NVIDIA. The CUDA platform is accessible to software developers through CUDA-Applications Unmanaged Applications Managed Applications accelerated libraries, compiler directives and extensions to industry-standard programming languages, including C, C++ and FORTRAN. Managed API Native API Programmers use ‗C/C++‘ with CUDA extensions toAcceleratorLibrary express parallelism, data locality, and thread cooperation, which is compiled with ―nvcc‖, NVIDIA‘s LLVM-based C/C++ compiler, to code algorithms for execution on the GPU.CUDA enables Multi-Core CPU heterogeneous systems(i.e., CPU+GPU). CPU & GPU are separate devices with separate DRAMsTarget GPU- Other Scale to 100‘s of cores, 1000‘s of parallel threads. DirectX Accelerator OtherRuntimes Specific Processors CUDA is an Extension to the C Programming Resource Resource Language [4]. Manager Managers GENERICProcessors GPUs Multi-core CPU C/C++ CUDA Application Figure 1: Accelerator Architecture 1) The Accelerator Library and APIs The Accelerator library is used by the NVCC CPU Code applications to implement the processor-independent aspects of Accelerator programming. The library exposes two APIs namely Managed and Native. C++ applications use the native API, which is PTX Code implemented as unmanaged C++ classes. Managed applications use the managed API. 2) Targets Accelerator needs a set of target objects. SPECIALIZED Each target translates Accelerator operations and PTX to Target data objects into a suitable form for its associated Compiler processor and runs the computation on the processor. A processor can have multiple targets, each of which accesses the processor in a different way. For example, most GPUs will probably have DirectX 9 and DirectX 11 targets, and possibly a vendor- implemented target to support processor-specific technologies. Each target consists of a small API, G80 …. GPU which are called by applications to interact with the target. The most commonly used methods areToArray1D and ToArray2D. Figure 2: Compilation Procedure of CUDA 3) Processor Hardware The version 2 of Accelerator can run CUDA Programming Basics consists of CUDA API operations on different targets. Currently it includes basics, Programming model and its Memory model targets for: Multicore x64 CPUs, GPUs, by using DirectX 9 A. CUDA API Basics: There are three basic APIs in CUDA namely, Function type qualifiers, Variable type qualifiers and Built-in variables. Function type III. CUDA qualifiers are used to specify execution on host or device. Variable type qualifiers are used to specify 227 | P a g e
  4. 4. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230the memory location on the device. Four built-in blockDim dim3 dimensions of the blockvariables are used to specify the grid and block thread index within thedimensions and the block and thread indices. threadIdx uint3 block1) Function Type Qualifiers 4) Execution Configuration (EC):Table 2 Function type Qualifier EC must be specified for any call to a Function Type Executed on Callable __global__ function. Defines the dimension of the Qualifier from grid and blocks specified by inserting an expression __device__ Device Device between function name and argument list: __global__ Device Host __host__ Host Host Function Declaration: __global__ void function_name(float* parameter)2) Variable Type QualifierTable 3 Variable Type Qualifier Function Call: Variable Locatio Lifetime Accessible function_name <<< Dg, Db, Ns >>> (parameter) Type n from Where Dg, Db, Ns are: Dg is of type dim3, Qualifier dimension and size of the grid Dg.x * Dg.y = __device__ Global lifetime of all the number of blocks being launched; Db is of type memory an threads dim3, dimension and size of each block Db.x * Db.y space applicatio within the * Db.z = number of threads per block. n grid and from the B. Programming Model host The GPU is seen as a computing device to through the execute a portion of an application that has to be runtime executed several times, can be isolated as a function library and works independently on different data. Such a __constant_ Constant lifetime of all the function can be compiled to run on the device. The _ memory an threads resulting program is called a Kernel. The batch of space applicatio within the threads that executes a kernel is organized as a grid n grid and of thread blocks. from the host Following are steps in CUDA code through the Step 1: Initialize the device (GPU) runtime Step 2: Allocate memory on the device library Step 3: Copy the data from host array to device array (optionally Step 4: Execute kernel on device used Step 5: Copy data from device (GPU) to host together Step 6: Free allocated memories on the device and with host __device__ ) Kernel Code ( A kernel is a __shared__ Shared lifetime of all the function callable from the host and executed on the memory an block threads CUDA device -- simultaneously by many threads in space within the parallel. Calling a kernel involves specifying the block name of the kernel plus an execution configuration. (optionally used C. The Memory Model together Each CUDA device has several memories with as shown in Figure 3. These memories can be used __device__ by programmers to achieve high CGMA (Compute ) to Global Memory Access) ratio and thus high execution speed in their kernels. CGMA ratio is the3) Built in Variables number of floating-point calculations performed forTable 4 Built in Variables each access to the global memory within a region of Built in Type Explanation a CUDA program. Variables that reside in shared variables memories and registers can be accessed at very high gridDim dim3 dimensions of the grid speed in a parallel manner. Each thread can only block index within the access registers that are allocated to them. A kernel blockIdx uint3 function uses registers to store frequently accessed grid variables. These variables are private to each thread. 228 | P a g e
  5. 5. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230Shared memories are allocated to thread blocks. All capacities are implementation dependent. Once theirthreads in a block can access variables in the shared capacities are exceeded, they become limitingmemory locations of the block. Threads can share factors for the number of threads that can betheir results via Shared memories. The global assigned to each SM [5].memory can be accessed by all the threads atanytime of program execution. The constant Table 2 Different types of CUDA Memorymemory allows faster and more parallel data accesspaths for CUDA kernel execution than the global Memor Locatio Cachememory. Texture memory is read from kernels using Access Who y n ddevice functions called texture fetches. The first Oneparameter of a texture fetch specifies an object Read/Writ Local Off-chip No Threacalled a texture reference. A texture reference e ddefines which part of texture memory is fetched. All Read/writ On-chip thread Shared N/A e s in a block All Read/writ thread Global Off-chip No e s + CPU All Constan thread Off-chip Yes Read t s + CPU All Off-chip Yes thread Texture Read s + CPU D. An example of a CUDA program The following program calculates and prints the square root of first 1000 integers. #include <stdio.h> // (1) #include <cuda.h> #include <conio.h> __global__ void sq_rt(float*a,int N) // (2) {intidx=blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=sqrt(a[idx]);} int main(void) // (3) {float*arr_host,*arr_dev; // (4) const int C=100; // (5) size_t length=C*sizeof(float); arr_host= (float*)malloc(length); // (6) cudaMalloc((void**)&arr_dev,length);// (7)Figure 3: CUDA Memory Model for(int i=0;i<C;i++)CUDA defines registers, shared memory, and arr_host[i]=(float)i;constant memory that can be accessed at higherspeed and in a more parallel manner than the global cudaMemcpy(arr_dev,arr_host,memory. Using these memories effectively will length,cudaMemcpyHostToDevice);// (8)likely require re-design of the algorithm. It isimportant for CUDA programmers to be aware ofthe limited sizes of these special memories. Their 229 | P a g e
  6. 6. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.225-230int blk_size=4; // (9) engineering.A GPU allows you to run the singleint n_blk=C/blk_size+(C%blk_size==0); processor applications in parallel processingsq_rt<<<n_blk,blk_size>>>(arr_dev,C); environment. In this paper, we introduced GPU computing, CUDA and Accelerator programmingcudaMemcpy(arr_host,arr_dev,length,cudaMemcpy environment by explaining, step-by-step, how toDeviceToHost); // (10) implement an efficient GPU-enabled application. The above mentioned programming models help to for (int i=0;i<C;i++) //(11) achieve the goal of Parallel Processing. Due to the printf("%dt%fn",i,arr_host[i]); presence of these programming models and many more models, the gaming applications have become free (arr_host); // (12) more lively and creative. cudaFree(arr_dev); getch();} REFERENCES [1] Include header files difference-between-a-cpu-and-a-gpu/2. Kernel function that executes on the CUDA [2] device main ( ) routine, that the CPU must find files/Cruz _gpuComputing09.pdf4. Defines pointers to host and device arrays [3]5. Defines other variables used in the program, size_t is an unsigned integer type of at least 16 elerator_i ntro.docx 11/12/2012 bit 4:06 PM6. Allocate array on the host [4] Allocate array on device (DRAM of the GPU) 11/12/2012 3:52 PM8. Copy the data from host array to device array. [5]9. Kernel Call, Execution Configuration Retrieve result from device to host in the host uda_docs/NVIDIA_CUDA_ProgrammingG memory uide_2.3.pdf11. Print result [6] De Donno, D.; Esposito, A.; Tarricone, L.;12. Free allocated memories on the device and host Catarinucci, L. Antennas and Propagation Magazine, IEEE Volume: 52, Issue: 3, ―Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD [EM Programmers Notebook]‖Figure 4: Illustration of codeCONCLUSIONThe stagnation of CPU clock speeds has broughtparallel computing, multi-core architectures, andhardware-acceleration techniques back to theforefront of computational science and 230 | P a g e