20120140505010
Upcoming SlideShare
Loading in...5
×
 

20120140505010

on

  • 89 views

 

Statistics

Views

Total Views
89
Views on SlideShare
89
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20120140505010 20120140505010 Document Transcript

  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 82 PERFORMANCE EVALUATION OF PARALLEL COMPUTING SYSTEMS Dr. Narayan Joshi Associate Professor, CSE Department, Institute of technology NIRMA University, Ahmedabad, Gujarat, India Parjanya Vyas CSE Department, Institute of technology, NIRMA University, Ahmedabad, Gujarat, India ABSTRACT Optimization of any computational problem’s algorithm using a sole high performance processor can improve the execution of any algorithm only up to an extent. To further improve the execution of the algorithm and thus to improve the system performance technique of parallel processing is widely adopted nowadays. Implementing parallel processing assists in attaining better system performance and computational efficiency while maintaining the clock frequency at normal level. Pthreads and CUDA are well-known techniques for CPU and GPU respectively. However, both of them possess interesting performance behavior. The paper evaluates their performance behavior in varying conditions. The results and chart outline the CUDA as a better approach. Furthermore, we present significant suggestions for attaining better performance with Pthreads and CUDA. Keywords: Parallel Computing, Multi-Core CPU, GPGPU, Pthreads, CUDA. 1. INTRODUCTION Before the era of multi-core processing units began, the only way to make CPU faster was to increase the clock frequency. More and more number of transistors was attached on a chip, to increase the performance. However, addition of more number of transistors on the processor chip keep on demanding more electricity and thereby causing increase in heat emission; which imposed constraint of limited clock frequency. In order to overcome this severe limitation and to achieve higher performance with processors, advancements in micro electronics technology enabled an era of INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND TECHNOLOGY (IJARET) ISSN 0976 - 6480 (Print) ISSN 0976 - 6499 (Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME: www.iaeme.com/ijaret.asp Journal Impact Factor (2014): 7.8273 (Calculated by GISI) www.jifactor.com IJARET © I A E M E
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 83 parallel processing proposing an idea of using more than one CPUs for a common address space. The processors which support parallel programming are called multi-core processors. Computational problems which include vast amount of data and operations that are entirely independent of the intermediate result of each other, then such problems can be solved much more efficiently with parallel programming then sequential programming. Ideally a parallel program will apply every operation simultaneously on all the data elements and will produce the final result in the time, in which sequential program can produce result of the operation on only one data element. Hence, when dealing with the problems where operations to be performed are highly independent or partially dependent on the intermediate results, parallel computing will provide immense improvement in the efficiency over sequential programs. Section 2 describes the literature survey. Section 3presents the architecture and working of GPGPU. The CPU parallel programming approach is mentioned in section 4. Section 5 consists of comparative study of these two parallel programming approaches. In section 6 behavior of these two approaches is discussed. The concluding remarks and further course of study are describes in section 7. 2. RELATED WORK Because of its significant benefits, the technique of parallel computing has been widely adopted by research and development sectors to solve medium to large scale computational problems. Many researchers have studied the area of parallel programming including analysis of its various techniques in past. Jakimovska et al. have suggested optimal method for parallel programming with Pthreads and OpenMPI [3]. Moreover ShuaiChe et al. have presented a performance study of general purpose applications on graphics processor using CUDA [10]. They have compared GPU performance to both single core and multi core CPU performance. A unique approach towards gaining optimum performance is presented by Yi Yang et al; they have presented a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs as fused CPU–GPU architecture in their paper [11]. Abu Asaduzzaman et al. have presented a detailed power consumption analysis of OpenMPI and POSIX threads in [1]; they have summarized variations in the performance of MPI. Researchers have also worked on Pthreads optimization. One of such works is presented by Stevens and Chouliaras related to a parametarizable multiprocessor chip with hardware Pthreads support [4]. In another study published by Cerin et al. describes experimental analysis of various thread scheduling libraries [2]. 3. GPGPU PARALLEL PROGRAMMING A typical CUDA program follows below stated general steps [8]: 1. CPU allocates storage on GPU 2. CPU copies input data from CPU → GPU 3. CPU launches kernel(s) on GPU to process the input data 4. GPU does the processing on the data by multiple threads as defined by the CPU 5. CPU copies resultsfrom GPU to CPU The GPGPU based parallel programs follow the master-salve processing terminology. CPU acts as master which controls the sequence of steps whereas GPU is merely a collection of high number of slave processors resulting in efficient execution of multiple threads in parallel [9][13].
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. Figure 1 As shown in the figure 1, a GPGP units(SMs);the SMs comprising ofmany co an instruction at a time. All of these thread each SM unit also has its private memory which acts as a global memory for the thread processors belonging to an individual SM unit which is global to all SMs. Memory model of a GPGPU can Figure 4. CPU PARALLEL PROGRAMMING Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known CPU parallel programming techniques available comparative study in [5].As discussed in the introductory section above, the paper focuses on Pthread and CUDA, this section describes the Pthread library and its working International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 84 Figure 1: GPGPU memory model As shown in the figure 1, a GPGPU consists of various Streaming Multiproces the SMs comprising ofmany co-operating thread processors, each of which can execute an instruction at a time. All of these thread processors do have their private local memory also has its private memory which acts as a global memory for the thread processors belonging to an individual SM unit [14]. Furthermore, the GPU also contain its device memory, . Memory model of a GPGPU can isshown in figure 2 [16][17][18] Figure 2: GPGPU architecture 4. CPU PARALLEL PROGRAMMING Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known CPU parallel programming techniques available [7].EnsarAjkunic et al have discussed their .As discussed in the introductory section above, the paper focuses on Pthread and CUDA, this section describes the Pthread library and its working [12]. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), Streaming Multiprocessor thread processors, each of which can execute local memory; as well also has its private memory which acts as a global memory for the thread processors GPU also contain its device memory, 6][17][18]. Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known et al have discussed their .As discussed in the introductory section above, the paper focuses on
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. The Pthread library considered in this paper for p scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a pthread_create() function for user-level thread creation. The Pthread library scheduler is responsible for thread scheduling inside a process. As and when a particular thread is scheduled by the Pthread library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these kernel threads is done by the OS. Mapping of user level threads with kernel th unique. Its (associated kernel thread's) ID can change over time as each user level thread can be associated with different kernel thread to a kernel thread is done every time Each individual thread has its own copy of stack whereas, all threads of a process share same global memory (heap and data section). Many other global resources such as process ID, parent process ID, open file descriptors, etc. are shared among 5. STUDY The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU and GPU respectively.The test bed description of our experimental envir Hardware: CPU: Intel® CORETM i7-2670QM CPU @ 2.20 GHz Mainmemory: 4 GB GPU: Nvidia GEFORCE® GT 520MX GPU memory:1 GB Software: OS: Ubuntu 13.04 Kernel version 3.8.0-35-generic #50 GNU/LINUX Driver: NVIDIA-LINUX-x86_64-331.49 International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 85 The Pthread library considered in this paper for performance evaluation uses a two scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a level thread creation. The Pthread library scheduler is responsible e a process. As and when a particular thread is scheduled by the Pthread library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these kernel threads is done by the OS. Mapping of user level threads with kernel threads is NOT fixed or unique. Its (associated kernel thread's) ID can change over time as each user level thread can be associated with different kernel thread every time. As and when a new thread is scheduled, mapping to a kernel thread is done every time, which necessitates mode switching [15]. Figure 3: Pthreadsmodel Each individual thread has its own copy of stack whereas, all threads of a process share same global memory (heap and data section). Many other global resources such as process ID, parent riptors, etc. are shared among the threads of a single process. The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU The test bed description of our experimental environment is shown below: 2670QM CPU @ 2.20 GHz GT 520MX generic #50-Ubuntu SMP TUE DEC 3 01:24:59 x86_64 x86_64x86_64 331.49 International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), erformance evaluation uses a two-level scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a level thread creation. The Pthread library scheduler is responsible e a process. As and when a particular thread is scheduled by the Pthread library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these reads is NOT fixed or unique. Its (associated kernel thread's) ID can change over time as each user level thread can be . As and when a new thread is scheduled, mapping Each individual thread has its own copy of stack whereas, all threads of a process share same global memory (heap and data section). Many other global resources such as process ID, parent the threads of a single process. The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU onment is shown below: Ubuntu SMP TUE DEC 3 01:24:59 x86_64 x86_64x86_64
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 86 CUDA version: Nvidia-CUDA-6.0 Pthreadslibrary:libpthread 2.17 Pthreads Program: int M,N,P; /* dimensions of input matrices and answer matrix are, MxN, NxP and MxP respectively */ unsigned long longint mat1[1000][1000]; unsigned long longint mat2[1000][1000]; unsigned long longintans[1000][1000]; void* matmul(void *arg) { inti,*arr; arr = (int *)arg; for(i=0;i<N;i++) ans[*arr][*(arr+1)] += mat1[*arr][i] * mat2[i][*(arr+1)]; } int main(intargc, char *argv[]) { /*declare and initialize pthreads, pthread arguments and time variables*/ /*Initialize the input matrices*/ ... gettimeofday(&start,NULL); for(i=0;i<M;i++) { for(j=0;j<P;j++,k++) { arg[k][0]=i;arg[k][1]=j;/*Passing i and j as arguments in arg array*/ pthread_create(&p[i][j],NULL,matmul,(void *)(arg [k])); ... } } for(i=0;i<M;i++) { for(j=0;j<P;j++) pthread_join(p[i][j],NULL); ... } ... gettimeofday(&end,NULL); /*Determine the execution time between end and start*/ return 0; }
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 87 CUDA program: __global__ void matmul(unsigned long longint *d_outp, unsigned long longint *d_inp1, unsigned long longint *d_inp2, int M, int N, int P) { ... /*Get current working row and column number into integer variables r and c*/ if(r*c<=M*P)//If thread is of use then { unsigned long longint temp=0; for(inti=0; i<N; i++) temp+=d_inp1[(r)*N+(i)]*d_inp2[(i)*P+(c)]; d_outp[(r)*P+(c)]=temp; } } int main(intargc, char *argv[]) { /* Declare and initialize the constants M, N, P with dimensions*/ /* Determine the total number of threads and blocks and initialize thrd and blkaccordingly */ /* Initialize the input matrices */ constint sz1 = M * N * sizeof(unsigned long longint); constint sz2 = N * P * sizeof(unsigned long longint); constintanssz = M * P * sizeof(unsigned long longint); unsigned long longint *d_inp1, *d_inp2, *d_outp; gettimeofday(&start,NULL); cudaMalloc((void**) &d_inp1,sz1); cudaMalloc((void**) &d_inp2,sz2); cudaMalloc((void**) &d_outp,anssz); ... cudaMemcpy(d_inp1,inputarray1,sz1,cudaMemcpyHostToDevice); cudaMemcpy(d_inp2,inputarray2,sz2,cudaMemcpyHostToDevice); matmul<<<blk,dim3(thrd,thrd,1)>>>(d_outp,d_inp1,d_inp2,M,N,P); cudaMemcpy(outputarray,d_outp,anssz,cudaMemcpyDeviceToHost); ... gettimeofday(&end,NULL); /*Determine the execution time between end and start*/ cudaFree(d_inp1); cudaFree(d_inp2); cudaFree(d_outp);
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 88 return 0; } Implementation and results Here, in both of the cases CPU and GPGPU, the two input matrices are of MxN and NxP dimensions respectively; and therefore the dimension of resultant matrix is MxP. Considering the equal values of M, N, and P the experiments were carried out on CPU and GPGPU.The results in terms of time taken by each experiment on CPU and GPGPU are shown in Table 1. Table 1: Results in terms of time taken Dimensions Time taken in milliseconds Pthreads CUDA 50 52 178 55 63 183 60 71 184 65 80 184 70 94 183 75 107 182 80 130 180 85 156 184 90 168 177 95 195 175 6. DISCUSSION Chart 1: Line chart of results in Table 1
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 89 The chart 1 depicts the behavior of both the parallel programming approaches based on the results shown in table 1. The chart clearly shows that initially, for lower number of threads and dimensions, Pthreads start with high efficiency, i.e., very less amount of time. However, the gradual increase in dimensions and thus increase in number of Pthreads, keeps on approximately linear increment in the time required for experiment completion. On the other side however, the chart also represents that the execution time taken by CUDAfor the same set of experiments is nearly identical irrespective of the number of dimensions. The discussion made above about the chart is in support of justification of our earlier made discussion in section 2 that the CUDA approach is highly efficient in parallel executing multiple threads. The GPGPU unit possesses a very high number f processors to parallely execute every individual thread. However, the CPU approach remains less efficient in spite of the high performance CPU due to the constraint of limited number of cores in the CPU. Hence in GPGPU scheduling doesn’t become constraint and therefore, it solves a computational problem in nearly constant time irrespective of the number of dimensions and threads. In case of CPU as it is a high performance processor, it can solve a problem in lesser time than GPU for small number of threads but as the number of threads increases, scheduling and management of these threads must also be done by the CPU which is an overhead for the CPU. This overhead keeps on increasing with increase in the number of threads, which results into overall increase in total time taken by CPU for solving thecomputational problem with increase in the number of dimensions and threads. Chart 1 also depicts an exceptional behavior of GPGPU – CUDA takes nearly the same time in solving the computational problem even for less number of dimensions and threads; apart from the thread management operations, CUDA program involves two way CPU-GPU I/O operations. As stated in section 3, each time a new thread is scheduled, a mode switch is mandatory in addition to a context switch, as each user level thread is associated with a kernel level thread. These mode switches and context switches increase with increase in number of threads. This is in support to the justification about the increase in execution time which is represented by the Pthreads line in chart 1. Based on the discussion made above, some performance-centric suggestions are made here: Problems comprising of threads less than the threshold point should be assigned for execution using Pthreads to the CPU; or the problem may be submitted to the GPGPU as a wholeelse the problem may be divided to run jointly on CPU and GPGPU simultaneously. Furthermore, it may become desirable to adopt the fused CPU-GPU processing units [ieee1], to deploy a computational problem upon them. Another notable suggestion towards Pthreads library maintainers is given here. While solving the computational problem, the library may decide to divide the total threads between CPU andGPGPU with respect to the threshold point already discussed above. Moreover, one more suggestion to the CPU manufacturers and the operating system developers is to reserve and designate one specific core as a master core dedicated for scheduling, mode switching and context switching; it may result into freeing other cores i.e., slave cores from the scheduling and switching responsibilities thereby dedicating them only for solving of the computational problems and thus increase in overall performance. 7. CONCLUDING REMARKS A noble approach comprising of the performance determination of the CUDA and the Pthreads parallel programming techniques is presented in this paper. The extra-ordinary performance behavior of CUDA along with the threshold point is also highlighted in the paper. Furthermore,
  • International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME 90 significant suggestions pertaining to performance improvement with the CUDA and the Pthreads parallel programming approaches are also presented in this paper. In future we intend to continue our work in the direction of betterment of the open source Pthreads technique. REFERENCES 1. A. Asaduzzaman, F. Sibai, H. El-sayed(2013),"Performance and power comparisons of MPI VsPthread implementations on multicore systems", Innovations in Information Technology (IIT), 2013 9th International Conference on , vol., no., pp.1, 6. 2. C. Cerin, H. Fkaier, M. Jemni (2008), "Experimental Study of Thread Scheduling Libraries on Degraded CPU," Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on, vol., no., pp.697, 704. 3. D. Jakimovska, G. Jakimovski, A. Tentov, D. Bojchev (2012), “Performance estimation of parallel processing techniques on various platforms” Telecommunications Forum (TELFOR) 4. D. Stevens, V. Chouliaras (2010), "LE1: A Parameterizable VLIW Chip-Multiprocessor with Hardware PThreads Support," VLSI (ISVLSI), 2010 IEEE Computer Society Annual Symposium on , vol., no., pp.122, 126. 5. E.Ajkunic, H. Fatkic, E. Omerovic, K. Talic and N. Nosovic (2012), “A Comparison of Five Parallel Programming Models for C++”, MIPRO 2012, Opatija, Croatia. 6. F.Mueller (1993), “A Library Implementation of POSIX Threads under UNIX”, In Proceedings of the USENIX Conference, Florida State University, San Diego, CA, pp.29-41. 7. G. Narlikar, G. Blelloch, (1998), “Pthreads for dynamic and irregular parallelism. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (SC '98), IEEE Computer Society, Washington, DC, USA, 1-16. 8. M. Ujaldon (2012), "High performance computing and simulations on the GPU using CUDA," High Performance Computing and Simulation (HPCS), 2012 International Conference on, vol., no., pp.1, 7. 9. NVIDIA (2006), “NVIDIA GeForce 8800 GPU Architecture Overview”, TB-02787-001_v01. 10. S. Che, M. Boyer, J.Meng, D.Tarjan, J. Sheaffer, K.Skadron (2008), "A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA", Journal of Parallel and Distributed Computing. 11. Y. Yang, P. Xiang, M. Mantor, H. Zhou (2012), "CPU-assisted GPGPU on fused CPU-GPU architectures," High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, vol., no., pp. 1, 12. 12. B. Nichols, D. Buttlar, J. Farrell, “Pthread Programming”, O’Reilly Media Inc., USA. 13. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm 14. http://www.yuwang-cg.com/project1.html 15. http://man7.org/linux/man-pages/man7/pthreads.7.html 16. http://www.nvidia.in/object/cuda-parallel-computing-in.html 17. http://www.nvidia.in/object/nvidia-kepler-in.html 18. http://www.nvidia.in/object/gpu-computing-in.html 19. Aakash Shah, Gautami Nadkarni, Namita Rane and Divya Vijan, “Ubiqutous Computing Enabled in Daily Life”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 217 - 223, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 20. Vinod Kumar Yadav, Indrajeet Gupta, Brijesh Pandey and Sandeep Kumar Yadav, “Overlapped Clustering Approach for Maximizing the Service Reliability of Heterogeneous Distributed Computing Systems”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 31 - 44, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.