PARALLELIZING MATRIX
MULTIPLICATION IN
COMPILER DESIGN
BY
M.SRI NANDHINI,
II- MSC(CS),
NADAR SARASWATHI COLLEGE OF
ARTS & SCIENCE,
VADAPUTHUPATTI, THENI.
MATRIX MULTIPLICATION
• Let’s consider arbitrary matrices A and B.
Since the matrices are square matrices
n=m=p.
• So, the resultant matrix AB can be obtained
like this,
• (AB)ij = ∑ Aik Bkj.
GENERATE RANDOM SQUARE MATRIX
• Let’s get into implementation by creating random
matrices for multiplication. Here we are using
malloc function to allocate memory dynamically
at heap. Because when it comes to testing we
have to deal with matrices with different
dimensions.
• Here we have defined the data type as double
which can be changed according to the use case.
The #pragma omp parallel for statement will do
the loop parallelization which we can initialize the
matrix more efficiently.
TRADITIONAL MATRIX
MULTIPLICATION
• Without considering much about the
performance, the direct implementation of
the matrix multiplication is given below.
• Operations will occur in sequential manner for
each element at resultant matrix.
• Here matrix A and matrix B are input matrices
where matrix C is the resultant matrix. So, we
have to pass the resultant matrix into function
as a reference.
MATRIX MULTIPLICATION USING
PARALLEL FOR LOOPS
• When you are going to implement loop
parallelization in your algorithm, you can use a
library like OpenMP to make the hard work
easy or you can use your own implementation
with threads where you have to handle load
balancing, race conditions etc.
PROGRAM
double parallelMultiply(TYPE** matrixA,TYPE** matrix
B,TYPE** matrixC,int dimension)
{
struct timeval t0,t1;
gettimeofday(&t0,0);
#pragma omp parallel for
for(int i=0;i<dimension;i++){
for(int j=0;j<dimension;j++){
for(int k=0;k<dimension;k++){
matrixC[i][j] +=matrixA[i][k] * matrixB[k][j];
}
}
}
gettimeofday(&t1,0);
double elapsed= (t1.tv_sec-t0.tv_sec) * 1.0f +
(t1.tv_ usec-t0.tv_usec)/ 1000000.0f;
return elapsed;
}
OPTIMIZED MATRIX MULTIPLICATION
USING PARALLEL FOR LOOPS
• Since, our matrices are stored in heap, it is not
easy to access them as they stored in the
stack. It is not easy to access them as they
stored in the stack. It is better to bring those
data from heap to stack before start the
multiplication process. So, we need to set
containers initially for those data.
• TYPE flatA[MAX_DIM];
• TYPE flatB[MAX_DIM];
STEPS OF OPTIMIZED MATRIX
MULTIPLICATION IMPLEMENTATION
1.Put common calculation at one place:
Most of the time we do not consider small
calculations that redundant over the program
where performance is not required but clarity
is.
2. Cache friendly algorithm implementation:
We all know that memory has a linear
arrangement. So, every N-dimensional array
Cont…
ordered sequentially inside the memory .
3. Using stack vs Heap memory efficiently:
• It is fast access stack rather than heap
memory. But stack has limited memory. We
have stored the large input memories in heap
memory. For efficient intermediate
calculations we have used the stack with
predefined memory allocations.
Cont…
• Here we have launched 40 threads to do the
multiplication process. Since we are dealing
with dimensions of 200, 400, 600, 800, 1000,
1200, 1400, 1600, 1800 and 2000, workload
can be divided equally among threads.
• In omp we have explicitly declared that
matrixC as shared resource to avoid race
conditions.
THANK YOU!!!
Parallelizing  matrix multiplication

Parallelizing matrix multiplication

  • 1.
    PARALLELIZING MATRIX MULTIPLICATION IN COMPILERDESIGN BY M.SRI NANDHINI, II- MSC(CS), NADAR SARASWATHI COLLEGE OF ARTS & SCIENCE, VADAPUTHUPATTI, THENI.
  • 2.
    MATRIX MULTIPLICATION • Let’sconsider arbitrary matrices A and B. Since the matrices are square matrices n=m=p. • So, the resultant matrix AB can be obtained like this, • (AB)ij = ∑ Aik Bkj.
  • 3.
    GENERATE RANDOM SQUAREMATRIX • Let’s get into implementation by creating random matrices for multiplication. Here we are using malloc function to allocate memory dynamically at heap. Because when it comes to testing we have to deal with matrices with different dimensions. • Here we have defined the data type as double which can be changed according to the use case. The #pragma omp parallel for statement will do the loop parallelization which we can initialize the matrix more efficiently.
  • 4.
    TRADITIONAL MATRIX MULTIPLICATION • Withoutconsidering much about the performance, the direct implementation of the matrix multiplication is given below. • Operations will occur in sequential manner for each element at resultant matrix. • Here matrix A and matrix B are input matrices where matrix C is the resultant matrix. So, we have to pass the resultant matrix into function as a reference.
  • 5.
    MATRIX MULTIPLICATION USING PARALLELFOR LOOPS • When you are going to implement loop parallelization in your algorithm, you can use a library like OpenMP to make the hard work easy or you can use your own implementation with threads where you have to handle load balancing, race conditions etc.
  • 6.
    PROGRAM double parallelMultiply(TYPE** matrixA,TYPE**matrix B,TYPE** matrixC,int dimension) { struct timeval t0,t1; gettimeofday(&t0,0); #pragma omp parallel for for(int i=0;i<dimension;i++){ for(int j=0;j<dimension;j++){ for(int k=0;k<dimension;k++){ matrixC[i][j] +=matrixA[i][k] * matrixB[k][j];
  • 7.
    } } } gettimeofday(&t1,0); double elapsed= (t1.tv_sec-t0.tv_sec)* 1.0f + (t1.tv_ usec-t0.tv_usec)/ 1000000.0f; return elapsed; }
  • 8.
    OPTIMIZED MATRIX MULTIPLICATION USINGPARALLEL FOR LOOPS • Since, our matrices are stored in heap, it is not easy to access them as they stored in the stack. It is not easy to access them as they stored in the stack. It is better to bring those data from heap to stack before start the multiplication process. So, we need to set containers initially for those data. • TYPE flatA[MAX_DIM]; • TYPE flatB[MAX_DIM];
  • 9.
    STEPS OF OPTIMIZEDMATRIX MULTIPLICATION IMPLEMENTATION 1.Put common calculation at one place: Most of the time we do not consider small calculations that redundant over the program where performance is not required but clarity is. 2. Cache friendly algorithm implementation: We all know that memory has a linear arrangement. So, every N-dimensional array
  • 10.
    Cont… ordered sequentially insidethe memory . 3. Using stack vs Heap memory efficiently: • It is fast access stack rather than heap memory. But stack has limited memory. We have stored the large input memories in heap memory. For efficient intermediate calculations we have used the stack with predefined memory allocations.
  • 11.
    Cont… • Here wehave launched 40 threads to do the multiplication process. Since we are dealing with dimensions of 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800 and 2000, workload can be divided equally among threads. • In omp we have explicitly declared that matrixC as shared resource to avoid race conditions.
  • 12.