Performance of matrix multiplication on cluster The matrix multiplication is one of the most important computationalkernels in scientific computing. Consider the matrix multiplication product C =A×B where A, B, C are matrices of size n×n. We propose four parallel matrixmultiplication implementations on a cluster of workstations. These parallelimplementations are based on the master – worker model using dynamic blockdistribution scheme. Experiments are realized using the Message Passing Interface(MPI) library on a cluster of workstations. Moreover, we propose an analyticalprediction model that can be used to predict the performance metrics of theimplementations on a cluster of workstations. The developed performance modelhas been checked and it has been shown that this model is able to predict theparallel performance accurately.Performance Model Of The Matrix Implementations:In this section, we develop an analytical performance model to describe thecomputational behavior of the four parallel matrix multiplication implementationsof both kinds cluster. First of all, we consider the matrix multiplication product C =A×B where the three matrices A, B, and C are dense of size n×n.. The number ofworkstations in the cluster is denoted by p and we assume that p is power of 2. Theperformance modeling of the four implementations is presented in next subsectionsProcedure:The Program was modified in such as way that each time; it would complete themultiplication 30 times, and then give out an average. This was done four times,and each time, the time was measured using 1, 2, 4 and 8 nodes respectively.
Graph Explanation: In our experiments we implemented matrix multiplication using MPI. In order to avoid overflow exceptions for large matrix orders, small-valued non negative matrix elements were used. The experiments have been repeated using 1, 2, 4 and 8 hosts for both implementations with a total of 30TIME test runs and 1000 matrix.(second) Time 14 12 11.70022764 10 7.69745732 8 6.45429768 5.00715470 6 Time 4 2 0 1 processor 2 processors 4 processors 8 processors Number of Processor Matrix Multiplication with Cluster Although the algorithm runs faster on a larger number of hosts, the gain in the speedup factor is slower. For instance, the difference in execution time between 16 and 32 hosts is smaller than the difference between 8 and 16 hosts. This is due to the dominance of increased communication cost over the reduced in computation cost. The one processor takes 11.70022764 s. This means, when only one processor is given all parts to handle, it becomes slow performing. Then when 2
processors used then it take 7.69745732s. We see that it takes less than oneprocessor time. This shows the improving performance when more nodes are used.Next 4 takes 6.45429768 s and 8 processors takes 5.00715470s. We see that if weincrease the number of processor then it takes less time. But the 8 processorsperformance is not as good as expected, one reason of that can be overhead ofpassing messages between processors. From these values, it can be deduced that ifthe level is kept constant, and the number of nodes is gradually increased, due tooverhead, the required time may increase as well. But if the number of matrix issmall then the 1 processor will show the better performance because if we takesmall matrix then data passing will take more time than multiplication, but theaverage performance of 4 processors is better.Conclusion:The basic parallel matrix - vector multiplication implementation and a variation arepresented and implemented on a cluster platform. These implementations are basedon cluster platform considered in this paper .Further; we presented theexperimental results of the proposed implementations in the form of performancegraphs. We observed from the results that there is the performance degradation ofthe basic implementation. Moreover, from the experimental analysis we identifiedthe communication cost and the cost of reading of data from disk as the primaryfactors affecting performance of the basic parallel matrix vector implementation.Finally, we have introduced a performance model to analyze the performance ofthe proposed implementations on a cluster.