SlideShare a Scribd company logo
1 of 6
Download to read offline
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
      Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                      Vol. 3, Issue 1, January -February 2013, pp.225-230
                                GPU Programming Models
      Stevina Dias* Sherrin Benjamin* Mitchell D’silva* Lynette Lopes*
                                           *Assistant Professor
                           Dwarkadas J Sanghavi College of Engineering, Vile Parle

Abstract
         The CPU, the brains of the computer is          programming a GPU namely Accelerator and
being enhanced by another part of the computer           CUDA.
– the Graphics Processing Unit (GPU), which is
its soul. GPUs are optimized for taking huge             II. ACCELERATOR
batches of data and performing the same                           Accelerator is a high-level data parallel
operation repeatedly and quickly, which is not           library which uses parallel processors such as the
the case with PC microprocessors. A CPU                  GPU or multi core CPU to speed up execution. The
combined with GPU, can deliver better system             Accelerator application programming interface
performance, price, and power. The version 2 of          (API) helps in implementing a wide variety of array-
Microsoft’s Accelerator offers an efficient way          processing operations by supporting a functional
for applications to implement array-processing           programming model. All the details of parallelizing
operations. These operations use the parallel            and running the computation on the selected target
processing capabilities of multi-processor               processor, including GPUs and multi core CPUs are
computers. CUDA is a parallel computing                  handled by the accelerator. The Accelerator API is
platform and programming model created by                almost processor independent, so the same array-
NVIDIA. CUDA provides both lower and a                   processing code runs on any supported processor
higher level APIs.                                       with only minor changes [3].

Keywords– GPU, GPU Programming CUDA,                      The basic steps in Accelerator programming are:
CUDA API, CGMA ratio, CUDA Memory Model,                   i. Create data arrays.
Accelerator, Data Parallel Arrays                         ii. Load the data arrays into Accelerator data-
                                                              parallel array objects.
I. INTRODUCTION                                          iii. Process the data-parallel array objects using a
          Computers have chips that facilitate the            series of Accelerator operations.
display of images to monitors. Each chip has             iv. Create a result object.
different functionality. Intel‘s integrated graphics      v. Evaluate the result object on a target.
controller provides basic graphics that can display
applications like Microsoft PowerPoint, low-             A. Data Parallel Array Objects
resolution video and basic games. The GPU is a                    Applications do not have direct access to
programmable and powerful computational device           the array data and cannot manipulate the array by
that goes far beyond basic graphics controller           index [3]. Applications implement array processing
functions [1]. GPUs are special-purpose processors       schemes by applying Accelerator operations to data-
used for the display of three dimensional scenes. The    parallel array objects—which represent an entire
capabilities of GPU are being harnessed to               array as a unit—rather than working with individual
accelerate computational workloads in areas such as      array elements.
financial modelling, cutting-edge scientific research
and oil and gas exploration. The GPU can now take         5 data-parallel array objects       supported      by
on many multimedia tasks, such as speed up Adobe          Accelerator v2 are:
Flash video, translating video to different formats,       i. BoolParallelArray
image recognition, virus pattern matching and many        ii. DoubleParallelArray
more. But the challenge is to solve those problems       iii. Float4ParallelArray
that have an inherent parallel nature – video            iv. FloatParallelArray
processing, image analysis, signal processing. The        v. IntParallelArray
CPU is composed of few cores and a huge cache
memory that can handle a few software threads at a       An example for Float4ParallelArray is as follows:
time. Conversely, a GPU is composed of hundreds          typedef struct
of cores that can handle thousands of threads             { float a;
simultaneously [2]. This ability of a GPU can                      float b;
accelerate some software by 100x over a CPU alone.                 float c;
The GPU achieves this acceleration while being                     float d; } float4;
more powerful and cost-efficient as compared to a
CPU. This paper describes 2 different ways of


                                                                                              225 | P a g e
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
                 Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                                 Vol. 3, Issue 1, January -February 2013, pp.225-230
          Float4 is an Accelerator structure that contains a      for (int i = 0; i < arraySize; i++)            // (2)
          quadruplet of float values. It is used primarily in         {
          graphics programming. float4 xyz (0.7f, 5.0f, 4.2f,           Array1[i] = (float)(Math.Sin((double)i / 10.0) +
          9.0f). The API includes a large collection of           ranf.NextDouble() / 5.0);
          operations that applications can use to manipulate            Array2[i] = (float)(Math.Sin((double)i / 10.0) +
          the contents of arrays and to combine the arrays in     ranf.NextDouble() / 5.0);
          various ways. Most operations take one or more              }
          data-parallel array objects as input, and return the    FloatArray fpInput1 = new FloatArray (Array1);
          processed data as a new data-parallel object.                                                         // (3)
          Accelerator operations work with copies of the input    FloatArray fpInput2 = new FloatArray (Array2);
          objects; they do not modify the original objects.       FloatArray fpStacked = PArray.Subtract(fpInput1,
          Operation inputs are typically data-parallel objects,   fpInput2);                                    // (4)
          but some operations can also take constant values.      FloatArray fpOutput = PArray.Divide(fpStacked, 2);
                                                                   Array3 = result.ToArray1D(fpOutput);         // (5)
       Table 1 Type of Accelerator Operations                     for (int i = 0; i < arrayLength; i++)         // (6)
Operation Explanation                                             {
Creation    Create new data-parallel array objects and            Console.WriteLine(Array3[i].ToString()); //(7)
and         convert them from one type to another.                }}}}
conversion
Element-    Operate on each element of one or more data-          The above code uses two commonly used types:
wise        parallel array objects and return a new object        ParallelArrays        and       FloatParallelArray.
            with the same dimensions as the originals.            ParallelArrays represented by PArray, contains the
             Add.                                                Accelerator       operation     methods        [3].
             Abs                                                 FloatParallelArray represented by FloatArray,
             Subtract                                            contains floating point arrays in the Accelerator
             Divide , etc                                        environment.
Reduction Reduce the rank of a data-parallel array object
            by applying a function across 1D or more              1.       Create a target object: Each processor that
            dimensions.                                           supports Accelerator has one or more target objects.
                                                                  Target Objects convert Accelerator operations to
Transform       Transform the organization of the elements in     processor-specific code and run them on the
                a data-parallel array object. These operations    processor. StackArrays uses DX9Target, which runs
                reorganize the data in the object, but do not     Accelerator operations on any DirectX 9-compatible
                require computation. For example:                 GPU, using the DirectX 9 API.
                 Transpose
                 Pad                                             2.       Create input data arrays: the input arrays
Linear          Perform standard matrix operations on data-       for StackArrays are generated by the application, but
algebra         parallel array objects, including matrix          they can also be created from stored data.
                multiplication, scalar product, and outer
                product.                                          3. Load each input array: Each input array is
                                                                  loaded into an Accelerator data parallel array
          B. Sample Code                                          object.
          using System;
          using Microsoft.ParallelArrays;                         4. Array Processing: Use one or more Accelerator
          using FloatArray =                                       operations to process the arrays.
          Microsoft.ParallelArrays.FloatParallelArray;
          using PArray =                                          5. Evaluation: evaluate the results of the series of
          Microsoft.ParallelArrays.ParallelArrays;                operations        by passing      the result object
          namespace SubArrays                                     to the target‘s ToArray1D method. ToArray1D
          {                                                       evaluates the 1D arrays and       ToArray2D,
          class Calculation                                       evaluates 2-D arrays.
            {
          staticvoid Main(string[] args)                          6.      Process the result: Stack Arrays displays
              {                                                   the elements of the processed    array in the
          int arraySize = 100;                                    console window.
          Random ranf = newRandom();
          float[] Array1 = newfloat[arraySize];
          float[] Array2 = newfloat[arraySize];
          float[] Array3 = newfloat[arraySize];
          DX9Target result = newDX9Target();           // (1)



                                                                                                         226 | P a g e
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
                      Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                                      Vol. 3, Issue 1, January -February 2013, pp.225-230
                                                                                               CUDA is a high level language which
                                                                                      stands for Compute Unified Device Architecture. It
               C. Accelerator Architecture                                            is a parallel computing platform and programming
                                                                                      model created by NVIDIA. The CUDA platform is
                                                                                      accessible to software developers through CUDA-
Applications           Unmanaged Applications           Managed Applications          accelerated libraries, compiler directives and
                                                                                      extensions to industry-standard programming
                                                                                      languages, including C, C++ and FORTRAN.
                             Managed API                      Native API              Programmers use ‗C/C++‘ with CUDA extensions to
Accelerator
Library                                                                               express parallelism, data locality, and thread
                                                                                      cooperation, which is compiled with ―nvcc‖,
                                                                                      NVIDIA‘s LLVM-based C/C++ compiler, to code
                                                                                      algorithms for execution on the GPU.CUDA enables
                                                              Multi-Core CPU          heterogeneous systems(i.e., CPU+GPU). CPU &
                                                                                      GPU are separate devices with separate DRAMs
Target                         GPU-          Other                                    Scale to 100‘s of cores, 1000‘s of parallel threads.
                  DirectX                               Accelerator         Other
Runtimes                      Specific     Processors                                 CUDA is an Extension to the C Programming
                                                         Resource          Resource   Language [4].
                                                         Manager           Managers

                                                                                                                                   GENERIC
Processors                      GPUs                           Multi-core CPU                      C/C++ CUDA
                                                                                                    Application
               Figure 1: Accelerator Architecture

               1) The Accelerator Library and APIs
                        The Accelerator library is used by the                                         NVCC                       CPU Code
               applications to implement the processor-independent
               aspects of Accelerator programming. The library
               exposes two APIs namely Managed and Native. C++
               applications use the native API, which is
                                                                                                     PTX Code
               implemented as unmanaged C++ classes. Managed
               applications use the managed API.

               2) Targets
                        Accelerator needs a set of target objects.                                                            SPECIALIZED
               Each target translates Accelerator operations and                                  PTX to Target
               data objects into a suitable form for its associated
                                                                                                    Compiler
               processor and runs the computation on the processor.
               A processor can have multiple targets, each of which
               accesses the processor in a different way. For
               example, most GPUs will probably have DirectX 9
               and DirectX 11 targets, and possibly a vendor-
               implemented target to support processor-specific
               technologies. Each target consists of a small API,                           G80          ….         GPU
               which are called by applications to interact with the
               target. The most commonly used methods
               areToArray1D and ToArray2D.
                                                                                      Figure 2: Compilation Procedure of CUDA
               3) Processor Hardware
                        The version 2 of Accelerator can run                          CUDA Programming Basics consists of CUDA API
               operations on different targets. Currently it includes                 basics, Programming model and its Memory model
               targets for: Multicore x64 CPUs, GPUs, by using
               DirectX 9                                                              A. CUDA API Basics:
                                                                                                There are three basic APIs in CUDA
                                                                                      namely, Function type qualifiers, Variable type
                                                                                      qualifiers and Built-in variables. Function type
               III. CUDA                                                              qualifiers are used to specify execution on host or
                                                                                      device. Variable type qualifiers are used to specify



                                                                                                                           227 | P a g e
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
      Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                      Vol. 3, Issue 1, January -February 2013, pp.225-230
the memory location on the device. Four built-in          blockDim          dim3     dimensions of the block
variables are used to specify the grid and block                                     thread index within the
dimensions and the block and thread indices.              threadIdx         uint3
                                                                                     block

1) Function Type Qualifiers                              4) Execution Configuration (EC):
Table 2 Function type Qualifier                                   EC must be specified for any call to a
 Function Type Executed on              Callable         __global__ function. Defines the dimension of the
 Qualifier                              from             grid and blocks specified by inserting an expression
 __device__        Device               Device           between function name and argument list:
 __global__        Device               Host
 __host__          Host                 Host             Function Declaration:
                                                         __global__ void function_name(float* parameter)
2) Variable Type Qualifier
Table 3 Variable Type Qualifier                          Function Call:
 Variable      Locatio Lifetime           Accessible     function_name <<< Dg, Db, Ns >>> (parameter)
 Type          n                          from           Where Dg, Db, Ns are: Dg is of type dim3,
 Qualifier                                               dimension and size of the grid Dg.x * Dg.y =
 __device__    Global    lifetime of      all      the   number of blocks being launched; Db is of type
               memory an                  threads        dim3, dimension and size of each block Db.x * Db.y
               space     applicatio       within the     * Db.z = number of threads per block.
                         n                grid    and
                                          from     the   B. Programming Model
                                          host                    The GPU is seen as a computing device to
                                          through the    execute a portion of an application that has to be
                                          runtime        executed several times, can be isolated as a function
                                          library        and works independently on different data. Such a
 __constant_   Constant   lifetime of     all      the   function can be compiled to run on the device. The
 _             memory     an              threads        resulting program is called a Kernel. The batch of
               space      applicatio      within the     threads that executes a kernel is organized as a grid
                          n               grid    and    of thread blocks.
                                          from     the
                                          host           Following are steps in CUDA code
                                          through the    Step 1: Initialize the device (GPU)
                                          runtime        Step 2: Allocate memory on the device
                                          library        Step 3: Copy the data from host array to device array
                                          (optionally    Step 4: Execute kernel on device
                                          used           Step 5: Copy data from device (GPU) to host
                                          together       Step 6: Free allocated memories on the device and
                                          with           host
                                          __device__
                                          )                        Kernel Code (xx_kernel.cu): A kernel is a
 __shared__    Shared     lifetime of     all      the   function callable from the host and executed on the
               memory     an block        threads        CUDA device -- simultaneously by many threads in
               space                      within the     parallel. Calling a kernel involves specifying the
                                          block          name of the kernel plus an execution configuration.
                                          (optionally
                                          used           C. The Memory Model
                                          together                Each CUDA device has several memories
                                          with           as shown in Figure 3. These memories can be used
                                          __device__     by programmers to achieve high CGMA (Compute
                                          )              to Global Memory Access) ratio and thus high
                                                         execution speed in their kernels. CGMA ratio is the
3) Built in Variables                                    number of floating-point calculations performed for
Table 4 Built in Variables                               each access to the global memory within a region of
 Built         in Type Explanation                       a CUDA program. Variables that reside in shared
 variables                                               memories and registers can be accessed at very high
 gridDim           dim3 dimensions of the grid           speed in a parallel manner. Each thread can only
                           block index within the        access registers that are allocated to them. A kernel
 blockIdx          uint3                                 function uses registers to store frequently accessed
                           grid
                                                         variables. These variables are private to each thread.



                                                                                                228 | P a g e
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
      Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                      Vol. 3, Issue 1, January -February 2013, pp.225-230
Shared memories are allocated to thread blocks. All     capacities are implementation dependent. Once their
threads in a block can access variables in the shared   capacities are exceeded, they become limiting
memory locations of the block. Threads can share        factors for the number of threads that can be
their results via Shared memories. The global           assigned to each SM [5].
memory can be accessed by all the threads at
anytime of program execution. The constant              Table 2 Different types of CUDA Memory
memory allows faster and more parallel data access
paths for CUDA kernel execution than the global         Memor       Locatio     Cache
memory. Texture memory is read from kernels using                                         Access        Who
                                                        y           n           d
device functions called texture fetches. The first                                                      One
parameter of a texture fetch specifies an object                                          Read/Writ
                                                        Local       Off-chip    No                      Threa
called a texture reference. A texture reference                                           e
                                                                                                        d
defines which part of texture memory is fetched.                                                        All
                                                                                          Read/writ
                                                                    On-chip                             thread
                                                        Shared                  N/A       e
                                                                                                        s in a
                                                                                                        block
                                                                                                        All
                                                                                          Read/writ     thread
                                                        Global      Off-chip    No
                                                                                          e             s    +
                                                                                                        CPU
                                                                                                        All
                                                        Constan                                         thread
                                                                    Off-chip    Yes       Read
                                                        t                                               s    +
                                                                                                        CPU
                                                                                                        All
                                                                    Off-chip    Yes                     thread
                                                        Texture                           Read
                                                                                                        s    +
                                                                                                        CPU

                                                        D. An example of a CUDA program
                                                        The following program calculates and prints the
                                                        square root of first 1000 integers.

                                                        #include <stdio.h>                             //   (1)
                                                        #include                                      <cuda.h>
                                                        #include <conio.h>

                                                        __global__ void sq_rt(float*a,int N)     // (2)
                                                        {intidx=blockIdx.x*blockDim.x+threadIdx.x;
                                                                if(idx<N)
                                                                         a[idx]=sqrt(a[idx]);}

                                                        int main(void)                                 // (3)

                                                        {float*arr_host,*arr_dev;                      // (4)

                                                                  const int C=100;                     // (5)
                                                                  size_t length=C*sizeof(float);
                                                                  arr_host= (float*)malloc(length);    // (6)

                                                                  cudaMalloc((void**)&arr_dev,length);// (7)
Figure 3: CUDA Memory Model
                                                                  for(int i=0;i<C;i++)
CUDA defines registers, shared memory, and                        arr_host[i]=(float)i;
constant memory that can be accessed at higher
speed and in a more parallel manner than the global               cudaMemcpy(arr_dev,arr_host,
memory. Using these memories effectively will                     length,cudaMemcpyHostToDevice);// (8)
likely require re-design of the algorithm. It is
important for CUDA programmers to be aware of
the limited sizes of these special memories. Their



                                                                                                 229 | P a g e
Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of
       Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                       Vol. 3, Issue 1, January -February 2013, pp.225-230
int blk_size=4;                          //   (9)       engineering.A GPU allows you to run the single
int             n_blk=C/blk_size+(C%blk_size==0);       processor applications in parallel processing
sq_rt<<<n_blk,blk_size>>>(arr_dev,C);                   environment. In this paper, we introduced GPU
                                                        computing, CUDA and Accelerator programming
cudaMemcpy(arr_host,arr_dev,length,cudaMemcpy           environment by explaining, step-by-step, how to
DeviceToHost);                          // (10)         implement an efficient GPU-enabled application.
                                                        The above mentioned programming models help to
        for (int i=0;i<C;i++)               //(11)      achieve the goal of Parallel Processing. Due to the
        printf("%dt%fn",i,arr_host[i]);               presence of these programming models and many
                                                        more models, the gaming applications have become
        free (arr_host);                    //   (12)   more lively and creative.
        cudaFree(arr_dev);
        getch();}                                       REFERENCES
                                                          [1]   http://blogs.nvidia.com/2009/12/whats-the-
1. Include header files                                         difference-between-a-cpu-and-a-gpu/
2. Kernel function that executes on the CUDA              [2]
    device                                                      http://barbagroup.bu.edu/gpuatbu/Program_
3. main ( ) routine, that the CPU must find                     files/Cruz        _gpuComputing09.pdf
4. Defines pointers to host and device arrays             [3]
5. Defines other variables used in the program,                 research.microsoft.com/en.../accelerator/acc
    size_t is an unsigned integer type of at least 16           elerator_i        ntro.docx     11/12/2012
    bit                                                         4:06 PM
6. Allocate array on the host                             [4]   http://en.wikipedia.org/wiki/CUDA
7. Allocate array on device (DRAM of the GPU)                   11/12/2012 3:52 PM
8. Copy the data from host array to device array.         [5]
9. Kernel Call, Execution Configuration                         www.cs.duke.edu/courses/fall09/cps296.3/c
10. Retrieve result from device to host in the host             uda_docs/NVIDIA_CUDA_ProgrammingG
    memory                                                      uide_2.3.pdf
11. Print result                                          [6]   De Donno, D.; Esposito, A.; Tarricone, L.;
12. Free allocated memories on the device and host              Catarinucci, L. Antennas and Propagation
                                                                Magazine,          IEEE Volume: 52, Issue:
                                                                3, ―Introduction to GPU Computing and
                                                                CUDA Programming: A Case Study on
                                                                FDTD [EM Programmer's Notebook]‖




Figure 4: Illustration of code

CONCLUSION
The stagnation of CPU clock speeds has brought
parallel computing, multi-core architectures, and
hardware-acceleration techniques back to the
forefront   of     computational  science     and



                                                                                             230 | P a g e

More Related Content

What's hot

OLAP Reporting In CR v2
OLAP Reporting In CR v2OLAP Reporting In CR v2
OLAP Reporting In CR v2Mickey Wong
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.comRavi Raj
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 

What's hot (20)

Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
OLAP Reporting In CR v2
OLAP Reporting In CR v2OLAP Reporting In CR v2
OLAP Reporting In CR v2
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Java performance
Java performanceJava performance
Java performance
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Spark
SparkSpark
Spark
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
LectureNotes-06-DSA
LectureNotes-06-DSALectureNotes-06-DSA
LectureNotes-06-DSA
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 

Viewers also liked

Viewers also liked (20)

Eo31927930
Eo31927930Eo31927930
Eo31927930
 
Ag31238244
Ag31238244Ag31238244
Ag31238244
 
Dl31758761
Dl31758761Dl31758761
Dl31758761
 
Bx31498502
Bx31498502Bx31498502
Bx31498502
 
Irã
IrãIrã
Irã
 
Quem procura, acha
Quem procura, achaQuem procura, acha
Quem procura, acha
 
Generator
GeneratorGenerator
Generator
 
As tic unha nova achega ó mundo
As tic unha nova achega ó mundoAs tic unha nova achega ó mundo
As tic unha nova achega ó mundo
 
A biodiversidade da fauna e da flora.
A biodiversidade da fauna e da flora. A biodiversidade da fauna e da flora.
A biodiversidade da fauna e da flora.
 
A cruz de cristo, a manifestação da graça e da justiça divina
A cruz de cristo, a manifestação da graça e da justiça divinaA cruz de cristo, a manifestação da graça e da justiça divina
A cruz de cristo, a manifestação da graça e da justiça divina
 
Reparacion de oxidos en zona motor
Reparacion de oxidos en zona motorReparacion de oxidos en zona motor
Reparacion de oxidos en zona motor
 
Marketing in cinemas
Marketing in cinemasMarketing in cinemas
Marketing in cinemas
 
Funcions
FuncionsFuncions
Funcions
 
1000 casas
1000 casas1000 casas
1000 casas
 
Radio Trunking Digital Gota
Radio Trunking Digital GotaRadio Trunking Digital Gota
Radio Trunking Digital Gota
 
Aaradhya signature
Aaradhya signatureAaradhya signature
Aaradhya signature
 
El futbol 1r..key
El futbol 1r..keyEl futbol 1r..key
El futbol 1r..key
 
Libro ped 202
Libro ped 202Libro ped 202
Libro ped 202
 
Como montar nuestro home studio
Como montar nuestro home studioComo montar nuestro home studio
Como montar nuestro home studio
 
Sacudite!!
Sacudite!!Sacudite!!
Sacudite!!
 

Similar to Ae31225230

Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTESOBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTESsuthi
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAMakoto Yui
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Newest topic of spider 20131016 in Buenos Aires Argentina
Newest topic of spider 20131016 in Buenos Aires ArgentinaNewest topic of spider 20131016 in Buenos Aires Argentina
Newest topic of spider 20131016 in Buenos Aires ArgentinaKentoku
 

Similar to Ae31225230 (20)

Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Gephi Toolkit Tutorial
Gephi Toolkit TutorialGephi Toolkit Tutorial
Gephi Toolkit Tutorial
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTESOBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
 
Z04404159163
Z04404159163Z04404159163
Z04404159163
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
GraphQL & Ratpack
GraphQL & RatpackGraphQL & Ratpack
GraphQL & Ratpack
 
GCF
GCFGCF
GCF
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Newest topic of spider 20131016 in Buenos Aires Argentina
Newest topic of spider 20131016 in Buenos Aires ArgentinaNewest topic of spider 20131016 in Buenos Aires Argentina
Newest topic of spider 20131016 in Buenos Aires Argentina
 

Ae31225230

  • 1. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 GPU Programming Models Stevina Dias* Sherrin Benjamin* Mitchell D’silva* Lynette Lopes* *Assistant Professor Dwarkadas J Sanghavi College of Engineering, Vile Parle Abstract The CPU, the brains of the computer is programming a GPU namely Accelerator and being enhanced by another part of the computer CUDA. – the Graphics Processing Unit (GPU), which is its soul. GPUs are optimized for taking huge II. ACCELERATOR batches of data and performing the same Accelerator is a high-level data parallel operation repeatedly and quickly, which is not library which uses parallel processors such as the the case with PC microprocessors. A CPU GPU or multi core CPU to speed up execution. The combined with GPU, can deliver better system Accelerator application programming interface performance, price, and power. The version 2 of (API) helps in implementing a wide variety of array- Microsoft’s Accelerator offers an efficient way processing operations by supporting a functional for applications to implement array-processing programming model. All the details of parallelizing operations. These operations use the parallel and running the computation on the selected target processing capabilities of multi-processor processor, including GPUs and multi core CPUs are computers. CUDA is a parallel computing handled by the accelerator. The Accelerator API is platform and programming model created by almost processor independent, so the same array- NVIDIA. CUDA provides both lower and a processing code runs on any supported processor higher level APIs. with only minor changes [3]. Keywords– GPU, GPU Programming CUDA, The basic steps in Accelerator programming are: CUDA API, CGMA ratio, CUDA Memory Model, i. Create data arrays. Accelerator, Data Parallel Arrays ii. Load the data arrays into Accelerator data- parallel array objects. I. INTRODUCTION iii. Process the data-parallel array objects using a Computers have chips that facilitate the series of Accelerator operations. display of images to monitors. Each chip has iv. Create a result object. different functionality. Intel‘s integrated graphics v. Evaluate the result object on a target. controller provides basic graphics that can display applications like Microsoft PowerPoint, low- A. Data Parallel Array Objects resolution video and basic games. The GPU is a Applications do not have direct access to programmable and powerful computational device the array data and cannot manipulate the array by that goes far beyond basic graphics controller index [3]. Applications implement array processing functions [1]. GPUs are special-purpose processors schemes by applying Accelerator operations to data- used for the display of three dimensional scenes. The parallel array objects—which represent an entire capabilities of GPU are being harnessed to array as a unit—rather than working with individual accelerate computational workloads in areas such as array elements. financial modelling, cutting-edge scientific research and oil and gas exploration. The GPU can now take 5 data-parallel array objects supported by on many multimedia tasks, such as speed up Adobe Accelerator v2 are: Flash video, translating video to different formats, i. BoolParallelArray image recognition, virus pattern matching and many ii. DoubleParallelArray more. But the challenge is to solve those problems iii. Float4ParallelArray that have an inherent parallel nature – video iv. FloatParallelArray processing, image analysis, signal processing. The v. IntParallelArray CPU is composed of few cores and a huge cache memory that can handle a few software threads at a An example for Float4ParallelArray is as follows: time. Conversely, a GPU is composed of hundreds typedef struct of cores that can handle thousands of threads { float a; simultaneously [2]. This ability of a GPU can float b; accelerate some software by 100x over a CPU alone. float c; The GPU achieves this acceleration while being float d; } float4; more powerful and cost-efficient as compared to a CPU. This paper describes 2 different ways of 225 | P a g e
  • 2. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 Float4 is an Accelerator structure that contains a for (int i = 0; i < arraySize; i++) // (2) quadruplet of float values. It is used primarily in { graphics programming. float4 xyz (0.7f, 5.0f, 4.2f, Array1[i] = (float)(Math.Sin((double)i / 10.0) + 9.0f). The API includes a large collection of ranf.NextDouble() / 5.0); operations that applications can use to manipulate Array2[i] = (float)(Math.Sin((double)i / 10.0) + the contents of arrays and to combine the arrays in ranf.NextDouble() / 5.0); various ways. Most operations take one or more } data-parallel array objects as input, and return the FloatArray fpInput1 = new FloatArray (Array1); processed data as a new data-parallel object. // (3) Accelerator operations work with copies of the input FloatArray fpInput2 = new FloatArray (Array2); objects; they do not modify the original objects. FloatArray fpStacked = PArray.Subtract(fpInput1, Operation inputs are typically data-parallel objects, fpInput2); // (4) but some operations can also take constant values. FloatArray fpOutput = PArray.Divide(fpStacked, 2); Array3 = result.ToArray1D(fpOutput); // (5) Table 1 Type of Accelerator Operations for (int i = 0; i < arrayLength; i++) // (6) Operation Explanation { Creation Create new data-parallel array objects and Console.WriteLine(Array3[i].ToString()); //(7) and convert them from one type to another. }}}} conversion Element- Operate on each element of one or more data- The above code uses two commonly used types: wise parallel array objects and return a new object ParallelArrays and FloatParallelArray. with the same dimensions as the originals. ParallelArrays represented by PArray, contains the  Add. Accelerator operation methods [3].  Abs FloatParallelArray represented by FloatArray,  Subtract contains floating point arrays in the Accelerator  Divide , etc environment. Reduction Reduce the rank of a data-parallel array object by applying a function across 1D or more 1. Create a target object: Each processor that dimensions. supports Accelerator has one or more target objects. Target Objects convert Accelerator operations to Transform Transform the organization of the elements in processor-specific code and run them on the a data-parallel array object. These operations processor. StackArrays uses DX9Target, which runs reorganize the data in the object, but do not Accelerator operations on any DirectX 9-compatible require computation. For example: GPU, using the DirectX 9 API.  Transpose  Pad 2. Create input data arrays: the input arrays Linear Perform standard matrix operations on data- for StackArrays are generated by the application, but algebra parallel array objects, including matrix they can also be created from stored data. multiplication, scalar product, and outer product. 3. Load each input array: Each input array is loaded into an Accelerator data parallel array B. Sample Code object. using System; using Microsoft.ParallelArrays; 4. Array Processing: Use one or more Accelerator using FloatArray = operations to process the arrays. Microsoft.ParallelArrays.FloatParallelArray; using PArray = 5. Evaluation: evaluate the results of the series of Microsoft.ParallelArrays.ParallelArrays; operations by passing the result object namespace SubArrays to the target‘s ToArray1D method. ToArray1D { evaluates the 1D arrays and ToArray2D, class Calculation evaluates 2-D arrays. { staticvoid Main(string[] args) 6. Process the result: Stack Arrays displays { the elements of the processed array in the int arraySize = 100; console window. Random ranf = newRandom(); float[] Array1 = newfloat[arraySize]; float[] Array2 = newfloat[arraySize]; float[] Array3 = newfloat[arraySize]; DX9Target result = newDX9Target(); // (1) 226 | P a g e
  • 3. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 CUDA is a high level language which stands for Compute Unified Device Architecture. It C. Accelerator Architecture is a parallel computing platform and programming model created by NVIDIA. The CUDA platform is accessible to software developers through CUDA- Applications Unmanaged Applications Managed Applications accelerated libraries, compiler directives and extensions to industry-standard programming languages, including C, C++ and FORTRAN. Managed API Native API Programmers use ‗C/C++‘ with CUDA extensions to Accelerator Library express parallelism, data locality, and thread cooperation, which is compiled with ―nvcc‖, NVIDIA‘s LLVM-based C/C++ compiler, to code algorithms for execution on the GPU.CUDA enables Multi-Core CPU heterogeneous systems(i.e., CPU+GPU). CPU & GPU are separate devices with separate DRAMs Target GPU- Other Scale to 100‘s of cores, 1000‘s of parallel threads. DirectX Accelerator Other Runtimes Specific Processors CUDA is an Extension to the C Programming Resource Resource Language [4]. Manager Managers GENERIC Processors GPUs Multi-core CPU C/C++ CUDA Application Figure 1: Accelerator Architecture 1) The Accelerator Library and APIs The Accelerator library is used by the NVCC CPU Code applications to implement the processor-independent aspects of Accelerator programming. The library exposes two APIs namely Managed and Native. C++ applications use the native API, which is PTX Code implemented as unmanaged C++ classes. Managed applications use the managed API. 2) Targets Accelerator needs a set of target objects. SPECIALIZED Each target translates Accelerator operations and PTX to Target data objects into a suitable form for its associated Compiler processor and runs the computation on the processor. A processor can have multiple targets, each of which accesses the processor in a different way. For example, most GPUs will probably have DirectX 9 and DirectX 11 targets, and possibly a vendor- implemented target to support processor-specific technologies. Each target consists of a small API, G80 …. GPU which are called by applications to interact with the target. The most commonly used methods areToArray1D and ToArray2D. Figure 2: Compilation Procedure of CUDA 3) Processor Hardware The version 2 of Accelerator can run CUDA Programming Basics consists of CUDA API operations on different targets. Currently it includes basics, Programming model and its Memory model targets for: Multicore x64 CPUs, GPUs, by using DirectX 9 A. CUDA API Basics: There are three basic APIs in CUDA namely, Function type qualifiers, Variable type qualifiers and Built-in variables. Function type III. CUDA qualifiers are used to specify execution on host or device. Variable type qualifiers are used to specify 227 | P a g e
  • 4. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 the memory location on the device. Four built-in blockDim dim3 dimensions of the block variables are used to specify the grid and block thread index within the dimensions and the block and thread indices. threadIdx uint3 block 1) Function Type Qualifiers 4) Execution Configuration (EC): Table 2 Function type Qualifier EC must be specified for any call to a Function Type Executed on Callable __global__ function. Defines the dimension of the Qualifier from grid and blocks specified by inserting an expression __device__ Device Device between function name and argument list: __global__ Device Host __host__ Host Host Function Declaration: __global__ void function_name(float* parameter) 2) Variable Type Qualifier Table 3 Variable Type Qualifier Function Call: Variable Locatio Lifetime Accessible function_name <<< Dg, Db, Ns >>> (parameter) Type n from Where Dg, Db, Ns are: Dg is of type dim3, Qualifier dimension and size of the grid Dg.x * Dg.y = __device__ Global lifetime of all the number of blocks being launched; Db is of type memory an threads dim3, dimension and size of each block Db.x * Db.y space applicatio within the * Db.z = number of threads per block. n grid and from the B. Programming Model host The GPU is seen as a computing device to through the execute a portion of an application that has to be runtime executed several times, can be isolated as a function library and works independently on different data. Such a __constant_ Constant lifetime of all the function can be compiled to run on the device. The _ memory an threads resulting program is called a Kernel. The batch of space applicatio within the threads that executes a kernel is organized as a grid n grid and of thread blocks. from the host Following are steps in CUDA code through the Step 1: Initialize the device (GPU) runtime Step 2: Allocate memory on the device library Step 3: Copy the data from host array to device array (optionally Step 4: Execute kernel on device used Step 5: Copy data from device (GPU) to host together Step 6: Free allocated memories on the device and with host __device__ ) Kernel Code (xx_kernel.cu): A kernel is a __shared__ Shared lifetime of all the function callable from the host and executed on the memory an block threads CUDA device -- simultaneously by many threads in space within the parallel. Calling a kernel involves specifying the block name of the kernel plus an execution configuration. (optionally used C. The Memory Model together Each CUDA device has several memories with as shown in Figure 3. These memories can be used __device__ by programmers to achieve high CGMA (Compute ) to Global Memory Access) ratio and thus high execution speed in their kernels. CGMA ratio is the 3) Built in Variables number of floating-point calculations performed for Table 4 Built in Variables each access to the global memory within a region of Built in Type Explanation a CUDA program. Variables that reside in shared variables memories and registers can be accessed at very high gridDim dim3 dimensions of the grid speed in a parallel manner. Each thread can only block index within the access registers that are allocated to them. A kernel blockIdx uint3 function uses registers to store frequently accessed grid variables. These variables are private to each thread. 228 | P a g e
  • 5. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 Shared memories are allocated to thread blocks. All capacities are implementation dependent. Once their threads in a block can access variables in the shared capacities are exceeded, they become limiting memory locations of the block. Threads can share factors for the number of threads that can be their results via Shared memories. The global assigned to each SM [5]. memory can be accessed by all the threads at anytime of program execution. The constant Table 2 Different types of CUDA Memory memory allows faster and more parallel data access paths for CUDA kernel execution than the global Memor Locatio Cache memory. Texture memory is read from kernels using Access Who y n d device functions called texture fetches. The first One parameter of a texture fetch specifies an object Read/Writ Local Off-chip No Threa called a texture reference. A texture reference e d defines which part of texture memory is fetched. All Read/writ On-chip thread Shared N/A e s in a block All Read/writ thread Global Off-chip No e s + CPU All Constan thread Off-chip Yes Read t s + CPU All Off-chip Yes thread Texture Read s + CPU D. An example of a CUDA program The following program calculates and prints the square root of first 1000 integers. #include <stdio.h> // (1) #include <cuda.h> #include <conio.h> __global__ void sq_rt(float*a,int N) // (2) {intidx=blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=sqrt(a[idx]);} int main(void) // (3) {float*arr_host,*arr_dev; // (4) const int C=100; // (5) size_t length=C*sizeof(float); arr_host= (float*)malloc(length); // (6) cudaMalloc((void**)&arr_dev,length);// (7) Figure 3: CUDA Memory Model for(int i=0;i<C;i++) CUDA defines registers, shared memory, and arr_host[i]=(float)i; constant memory that can be accessed at higher speed and in a more parallel manner than the global cudaMemcpy(arr_dev,arr_host, memory. Using these memories effectively will length,cudaMemcpyHostToDevice);// (8) likely require re-design of the algorithm. It is important for CUDA programmers to be aware of the limited sizes of these special memories. Their 229 | P a g e
  • 6. Stevina Dias, Sherrin Benjamin, Mitchell D’silva, Lynette Lopes / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.225-230 int blk_size=4; // (9) engineering.A GPU allows you to run the single int n_blk=C/blk_size+(C%blk_size==0); processor applications in parallel processing sq_rt<<<n_blk,blk_size>>>(arr_dev,C); environment. In this paper, we introduced GPU computing, CUDA and Accelerator programming cudaMemcpy(arr_host,arr_dev,length,cudaMemcpy environment by explaining, step-by-step, how to DeviceToHost); // (10) implement an efficient GPU-enabled application. The above mentioned programming models help to for (int i=0;i<C;i++) //(11) achieve the goal of Parallel Processing. Due to the printf("%dt%fn",i,arr_host[i]); presence of these programming models and many more models, the gaming applications have become free (arr_host); // (12) more lively and creative. cudaFree(arr_dev); getch();} REFERENCES [1] http://blogs.nvidia.com/2009/12/whats-the- 1. Include header files difference-between-a-cpu-and-a-gpu/ 2. Kernel function that executes on the CUDA [2] device http://barbagroup.bu.edu/gpuatbu/Program_ 3. main ( ) routine, that the CPU must find files/Cruz _gpuComputing09.pdf 4. Defines pointers to host and device arrays [3] 5. Defines other variables used in the program, research.microsoft.com/en.../accelerator/acc size_t is an unsigned integer type of at least 16 elerator_i ntro.docx 11/12/2012 bit 4:06 PM 6. Allocate array on the host [4] http://en.wikipedia.org/wiki/CUDA 7. Allocate array on device (DRAM of the GPU) 11/12/2012 3:52 PM 8. Copy the data from host array to device array. [5] 9. Kernel Call, Execution Configuration www.cs.duke.edu/courses/fall09/cps296.3/c 10. Retrieve result from device to host in the host uda_docs/NVIDIA_CUDA_ProgrammingG memory uide_2.3.pdf 11. Print result [6] De Donno, D.; Esposito, A.; Tarricone, L.; 12. Free allocated memories on the device and host Catarinucci, L. Antennas and Propagation Magazine, IEEE Volume: 52, Issue: 3, ―Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD [EM Programmer's Notebook]‖ Figure 4: Illustration of code CONCLUSION The stagnation of CPU clock speeds has brought parallel computing, multi-core architectures, and hardware-acceleration techniques back to the forefront of computational science and 230 | P a g e