Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The International Journal of Engineering and Science (The IJES)


Published on

  • Be the first to comment

  • Be the first to like this

The International Journal of Engineering and Science (The IJES)

  1. 1. The International Journal of Engineering And Science (IJES)||Volume|| 2 ||Issue|| 01||Pages|| 250-253 ||2013||ISSN: 2319 – 1813 ISBN: 2319 – 1805 Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study Ishan Rajani1, G Nanda Gopal2 1. PG student at Department of Computer Engineering, Noble Group of Institution, Gujarat, India 2. Asst. Prof. at Department of Computer Engineering, Noble Group of Institution, Gujarat, India------------------------------------------------------------Abstract----------------------------------------------------Over the last decades parallel-distributed computing becomes most popular than traditional centralizedcomputing. In distributed computing performance up-gradation is achieved by distributing workloads across theparticipating nodes. One of the most important factors for improving the performance of this type of system is toreduce average and standard deviation of job response time. Runtime insertion of new tasks of various sizes todifferent nodes is one of the main reasons of Load unbalancing. Among the several latest concepts of Parallel-Distributed Processing CPU-GPU Utilization is focused here. How the ideal portion of the CPU can be utilizedfor GPU process and visa-versa. This paper also introduces the heterogeneous computing work flow integrationfocused on CPU-GPU. The purposed system exploits the coarse-grain warp level parallelism. It is alsoelaborated here that by using which architectures and frameworks developers are racing in the field ofheterogeneous computing.Index Terms: Heterogeneous Computing, Coarse-Grained warp level parallelism, standard deviation of jobresponse time---------------------------------------------------------------------------------------------------------------------------------------Date of Submission: 28th December, 2012 Date of Publication: Date 20 th January 2013--------------------------------------------------------------------------------------------------------------------------------------- I. IntroductionThe goal of proper work utilization in emerge performance of operations by mostly focusing ontechniques of multi core processing is one of the Single instruction multiple data (SIMT) modelsmost important aspects from the massively parallel which has concept very similar to single instructionsystems (MMP). Emerging parallel-distributed multiple data (SIMD) architecture of Flynn’stechnology has picked its highest progress in classification. For this NVIDIA has mentionedprocessing power, data storage capacity, circuit architecture named unified graphics computingintegration scale etc. in last several few years but architecture (UGCA), here on the basis of thatstill it doesn’t satisfy the increasing need of concept, and by the researches done on recentadvanced multi core processors. As we have decades we have proposed architecture of thatchosen participation of CPU ideal source in the concept with entire work flow.activities of GPU, for this GPU computing hasemerged in recent years as a viable execution II. Related Workplatform for throughput oriented applications or With the concept of general purposecodes at multiple granularities by using several graphics units, manufacturer of GMA likearchitecture developed by different GMA NVIDIA, Intel, ARM and AMD etc. has developedmanufacturers. specified architecture for this integration. For In general GPUs started its era as example, AMD/ATi GPU developed Brook+ forindependent units for code execution but after that general purpose programming and, NVIDIA hasinitialization phase of graphics unit several developed Compute Unified Device Architecturedevelopments has done for CPU-GPU integration. (CUDA) which provides greater programmingVarious manufacturers have developed thread level environment for general purpose graphics devices.parallelism for their graphics Accelerators. Foremerging a powerful computing paradigm General- GPGPU programming paradigm of CUDAPurpose computing on Graphics Processing Units increases performance, but still it doesn’t satisfies(GPGPU) has been enhanced to manage hundreds extreme level scheduling of tasks allocated to GPUof processing cores [6][13]. It helps to enhance [12]. For establishing such kind of situation a The IJES Page 250
  2. 2. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Studyway is to exploit a technique which utilizes idealGPU recourses which will reduce the completeruntime of the it’s processing. For upgrading theirsource utilization process recently other developersalso released some integrated solutions for abovementioned system. Some most popular of them areIntel’s Sandy Bridge, AMD’s Fusion APUs, andARM’s MALI. This paper proposes Heterogeneousenvironment for CPU-GPU integration. CPU-GPUChip integration offers several advantages thantraditional systems like, reducing communicationcost, proper memory resource utilization of bothbecause rather than using separate memory space italso focuses on the shared memory mechanism,warp level parallelism for faster and efficientperformance[5] [7]. Warps can be defined as thegroups of threads; most of the warp structuresinclude 32 threads into one warp structure [7].Researches on warp mechanism for threading alsonoticed that for efficient control flow execution onGPUs via dynamic warp formation and large warpmicro architectures, in which a common program Figure 1.Unified Graphics Computing Architecturecounter is used for execution of all the threads of asingle warp together [5]. The scheduler selects UGCA uses PTX translator provided forwarps which is ready to execute, issues the next converting PTX instructions into LLVM instructioninstruction to the active threads of that warp. register. LLVM instruction register is used for a kernel context. Proposed system has the advantage III. Design that initialization of GPU, memory checking and, As we have discussed CUDA is used to PTX to LLVM compilation are performed inestablish virtual kernel between distinguished parallel.sources of CPU-GPU it is also mentioned in givenarchitecture, these kernels will use preloaded data IV. Workflowafter decomposition from the memory repositories In the first phase of UGCA architecture,like buffer memory or temporary cache memory. Nvcc compiler translates code written in CUDA,The key design problem is caused by the fact that into PTX, then graphics driver indicates CUDAthe computation results of the CPU side are stored compiler which translates PTX to LLVM whichinto the main memory that is different from the can be run on processing core. The most importantdevice memory. To overcome this problem, in our functionality of Workload distribution module is itarchitecture preserve two different memory generates two additional execution configurationsaddresses in a pointer variable at a time. An for each (CPU and GPU). Then for performingefficient design architecture of the proposed UGCA further operation this module delivers the generatedarchitecture is shown in figure1. execution configuration, to the CPU and GPU loaders. Mainly UGCA contains two phases in itsworkflow. First phase of process includes task As shown in figure 1 top module involvesdecomposition module and mapping of all all the configuration information of both CPU anddecomposed tasks also it manages the distribution GPU kernels and basic PTX assembly code whichratio to the kernel configuration information. And will further converted into LLVM code forsecond phase is designed to translate the Parallel intermediate representation. The input of workloadThread Execution (PTX) code into the LLVM distribution module is the kernel configurationcode. Here, PTX is pseudo assembly language code information and the output specifies two differentused in CUDA. As shown in fig. 1 second portions of the kernel space, here dimension of gridprocedure is designed to translate PTX code into is used forthe LLVM intermediate representation. On GPU workload distribution module. Proposeddevice, runtime system passes the PTX code work distribution can split the kernel according tothrough the CUDA device driver, which means that the granularity of thread block. It also determinesthe GPU executes the kernel in the original manner the amount of the thread blocks to be detachedusing the PTX-JIT compilation. from the grid considering the dimension of the The IJES Page 251
  3. 3. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Studyand workload distribution ratio. We are focusing oncoarse warp level parallelism, so in this condition V. Comparisonsometimes number of threads doesn’t follow As we have discussed for abovepreferred bock size, bound tests should be inserted architecture, CUDA has been provided a massiveto kernel [5]. Otherwise threads may access parallelism for GPU as well as non GPUmemory out of index. For declaring block programming also [1]. Now let us compare withboundries some preprogrammed semantics are other parallel programming environment accordingavailable in cuda library. to their performance. CUDA environment is only available to NVIDIA GPU device, whereas the_global_ void kernel(int n, int* x ) Open CL standard may be implemented to run on{ any vendor’s hardware, such as traditional CPUs, int idx = threadidx.x + blockidx.x*blockdim.x; and GPUs [2] [3]. One of the most effective if(i < n) advantages of OpenCL compare to CUDA is its { programming portability. But while comparison is // core programming subject of matter then CUDA gives more speedups } by using its thread based warp level parallelism [4]. } Because of the multi-vendor compatibility of OpenCL makes enough difference between both of How many block should be covered in a these that pose significant challenges towardsblock can be declared by inbuilt schematics of realizing a robust CUDA to OpenCL source toCUDA library like threadidx and blockidx. In source translation [2]. Figure 2, shows comparisonabove mentioned code, threadidx.x indicates the of both CUDA and OpenCL for execution ofnumber associated with each thread in a block. various algorithms like fast Fourier transformation,Block size of thread can be mapped with number of double-precision matrix multiplication [9],threads multiples of 32 like 192, 256 etc. Similarly molecular dynamics, and simple 3D graphicsblock size can be declared for two dimension, three programs.dimension and so on. Convention mentioned aboveis very useful for two dimensional matrices [9].Here blockidx.x refers to the label associated with ablock grid. The blockidx.x is similar to the threadindex except it refers to the number associatedwith the block. Lets take an example forelaborating the concept of blockdim.x, suppose wewant to load an array of 10 values in to a kernelusing two blocks, each one having size of 5 threadsin each. A major problem to do this kind ofoperation is that, our thread index may only goesfor 0 to 4 as mentioned threads in one block. Hereto overcome this problem we can use a thirdparameter given in CUDA stated as threaddix.x.This holds the size of the block, like in our case Figure 2. GFLOPS required to execute differentthreadidx.x = 5. algorithms on CUDA and OpenCL A program written for CUDA application Performance of various environmentsshould assign memory spaces in the device according to required GFLOPS is analyzed bymemory of the graphics hardware. Then their results. NVIDIA Tesla C2050 GPU computingmemory addresses are used for the input and output processors perform around 515 GFLOPS in doubledata. In this UGCA model, data can be copied precision calculations, and the AMD FireStreambetween the host memory and the dedicated 9270 peaks at 240 GFLOPS [14][19]. In singlememory on the device. For this, the host system precision performance, Nvidia Tesla C2050should preserve pointer addresses pointing to the computingprocessors perform around 1.03locations in the device memory. The key design TFLOPS and the AMD FireStream 9270 cardsproblem is caused by the fact that the peak at 1.2 TFLOPS.Computation results of the CPU side are stored intothe main memory that is different from the device VI. Conclusionmemory. The paper has introduced architecture of the concept which is widely adopted for coarse grained warp level parallelism and also going to be adopted as the advanced concept of warp level parallelism. For example Success of The IJES Page 252
  4. 4. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative StudyUGCA framework can be exampled with NVIDIA in parallel distributed computing system: MPI andGeForce 9400 GT device. After the drastic success PVM”, Aligarh Muslim Universityof mentioned graphics device. This architecture is [9] Ali Pinar and Cevdet Aykanat; “Sparse Matrixgoing to be focused for more future enhancements. Decomposition with Optimal LoadBalancing ”,We believe the cooperative heterogeneous TR06533 Bilkent, Ankara, Turkeycomputing can be utilized in heterogeneous multi- [10] NVIDIA, Nvidia cuda sdks.core processors which are expected to include even [11] N. Bell and M. Garland. Cusp; “Genetic parallelmore GPU cores as well as CPU cores. algorithms for sparse matrix and graphAs future work, we will first develop a dynamic conputations”, 2010control scheme on deciding the workload [12] J. Nickolls and W. Dally. The gpu computing era.distribution ratio, warp scheduling, also plan to Micro, IEEE, 30(2):56-69, april 2010design more effective thread block distribution [13] S. Huang, S. Xiao, W. Feng; “On the Energytechnique considering data access patterns and Efficiency of Graphics Processing Units forthread divergence technique. Scientific Computing” [14] Andrew J. Page: “Adaptive Scheduling inReferences Heterogeneous Distributed Computing Systems,[1] Ziming Zhong, Rychkov, V., Lastovetsky, A. National University of Ireland, Maynooth "Data Partitioning on Heterogeneous Multicore [15] Getting Started with cuda and Multi-GPU Systems Using Functional [16] Nvidia GeForce Graphics Card, Performance Models of Data-Parallel Applications", Cluster Computing (CLUSTER), [17] Nvidia Tesla GPGPU system, 2012 IEEE International Conference on , vol., no., pp.191-199, 24-28 Sept. [18] The Supercomputing Blog[2] Sathre, P.; Gardner, M.; Wu-chun Feng; "Lost in Translation: Challenges in Automating CUDA-to- 1-getting-started/ OpenCL Translation", Parallel Processing [19] Articles on advanced computing Workshops (ICPPW), 2012 41st International Conference on , vol., no., pp.89-96,10- 13,Sept.,2012[3] Scogland, T.R.W., Rountree B., Wu-chun Feng, de Supinski, B.R.; "Heterogeneous Task Scheduling for Accelerated OpenMP", Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International , vol., no., pp.144-155, 21-25May2012.[4] Okuyama, T., Ino, F., Hagihara, K., "A Task Parallel Algorithm for Computing the Costs of All-Pairs Shortest Paths on the CUDA-Compatible GPU", Parallel and Distributed Processing with Applications, 2008. ISPA 08. International Symposium, vol., no., pp.284-291[5] Goli, M., Garba, M.T., GonzalezVélez, H., "Streaming Dynamic Coarse-Grained CPU/GPU Workloads with Heterogeneous Pipelines in FastFlow" High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on , vol., no., pp.445-452, 25-27 June 2012[6] Manish Arora, “The architecture and Evolution of CPU-GPU Systems for General Purpose Computing ” , By University of California, San Diago[7] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, “ Improving GPU Performance via Large Warps and Two-Level Warp Scheduling ”, The University of Texas at Austin[8] Rafiqul Zaman Khan, Md Firoj Ali: “A comparative study on parallel programming The IJES Page 253