Nadar Saraswathi College Of Arts
and Science,Theni
Parallelizing Matrix Multiplication
Compiler Design
S.Subha Thilagam
II msc(cs)
Matrix Multiplication
• Let’s consider arbitrary matrices A and B. Since
the matrices are square matrices n = m = p.
• So, the resultant matrix AB can be obtained like
this,
Generate Random Square Matrix
• Let’s get into implementation by creating random
matrices for multiplication. Here we are
using malloc function to allocate memory dynamically
at heap. Because when it comes to testing we have to
deal with matrices with different dimensions.
• Here we have defined the data type as double which
can be changed according to the use case.
The #pragma omp parallel for statement will do the
loop parallelization which we can initialize the matrix
more efficiently.
Traditional Matrix Multiplication
• Without considering much about the performance, the
direct implementation of the matrix multiplication is
given below.
• Operations will occur in sequential manner for each
element at resultant matrix.
• Here matrixA and matrixB are input matrices
where matrixC is the resultant matrix. So, we have to
pass the resultant matrix into the function as a
reference.
Matrix Multiplication Using Parallel
For Loops
• When you are going implement loop parallelization in
your algorithm, you can use a library like OpenMP to
make the hardwork easy or you can use your own
implementation with threads where you have to handle
load balancing, race conditions etc.
• For this tutorial I am going to stick with the OpenMP
library.
• We just need to add several lines to make this thing
parallel.
double parallelMultiply(TYPE** matrixA, TYPE** matrixB, TYPE**
matrixC, int dimension){
struct timeval t0, t1;
gettimeofday(&t0, 0);
#pragma omp parallel for
for(int i=0; i<dimension; i++){
for(int j=0; j<dimension; j++){
for(int k=0; k<dimension; k++){
matrixC[i][j] += matrixA[i][k] * matrixB[k][j];
}
}
}
gettimeofday(&t1, 0);
double elapsed = (t1.tv_sec-t0.tv_sec) * 1.0f + (t1.tv_usec - t0.tv_usec) /
1000000.0f;
return elapsed;
}
Optimized Matrix Multiplication
Using Parallel For Loops
• Since, our matrices are stored in heap, it is not easy to
access them as they stored in the stack. It is better to
bring those data from heap to stack before start the
multiplication process. So, we need to set containers
initially for those data.
• TYPE flatA[MAX_DIM];
TYPE flatB[MAX_DIM];
Steps of optimized matrix multiplication
implementation is given below,
1.Put common calculation at one place
Most of the time we do not consider small calculations
that redundant over the program where performance is
not required but clarity is.
2. Cache friendly algorithm
implementation
• We all know that memory has a linear arrangement. So,
every N-dimensional array ordered sequentially inside
the memory. In this example we can convert 2
dimensional input matrices into row major and column
major 1 dimensional matrices.
3. Using Stack Vs Heap Memory
efficiently
• It is fast access stack rather than heap memory. But
stack has limited memory. We have stored the large
input memories in heap memory. For efficient
intermediate calculations we have used the stack with
predefined memory allocations.
• Here we have launched 40 threads to do the
multiplication process.Since we are dealing with
dimensions of 200, 400, 600, 800, 1000, 1200, 1400,
1600, 1800 and 2000, workload can be divided equally
among threads.
• In omp we have explicitly declared that matrixC as
shared resource to avoid race conditions.
Compiler Design

Compiler Design

  • 1.
    Nadar Saraswathi CollegeOf Arts and Science,Theni Parallelizing Matrix Multiplication Compiler Design S.Subha Thilagam II msc(cs)
  • 2.
    Matrix Multiplication • Let’sconsider arbitrary matrices A and B. Since the matrices are square matrices n = m = p. • So, the resultant matrix AB can be obtained like this,
  • 3.
    Generate Random SquareMatrix • Let’s get into implementation by creating random matrices for multiplication. Here we are using malloc function to allocate memory dynamically at heap. Because when it comes to testing we have to deal with matrices with different dimensions. • Here we have defined the data type as double which can be changed according to the use case. The #pragma omp parallel for statement will do the loop parallelization which we can initialize the matrix more efficiently.
  • 4.
    Traditional Matrix Multiplication •Without considering much about the performance, the direct implementation of the matrix multiplication is given below. • Operations will occur in sequential manner for each element at resultant matrix. • Here matrixA and matrixB are input matrices where matrixC is the resultant matrix. So, we have to pass the resultant matrix into the function as a reference.
  • 5.
    Matrix Multiplication UsingParallel For Loops • When you are going implement loop parallelization in your algorithm, you can use a library like OpenMP to make the hardwork easy or you can use your own implementation with threads where you have to handle load balancing, race conditions etc. • For this tutorial I am going to stick with the OpenMP library. • We just need to add several lines to make this thing parallel.
  • 6.
    double parallelMultiply(TYPE** matrixA,TYPE** matrixB, TYPE** matrixC, int dimension){ struct timeval t0, t1; gettimeofday(&t0, 0); #pragma omp parallel for for(int i=0; i<dimension; i++){ for(int j=0; j<dimension; j++){ for(int k=0; k<dimension; k++){ matrixC[i][j] += matrixA[i][k] * matrixB[k][j]; } } } gettimeofday(&t1, 0); double elapsed = (t1.tv_sec-t0.tv_sec) * 1.0f + (t1.tv_usec - t0.tv_usec) / 1000000.0f; return elapsed; }
  • 7.
    Optimized Matrix Multiplication UsingParallel For Loops • Since, our matrices are stored in heap, it is not easy to access them as they stored in the stack. It is better to bring those data from heap to stack before start the multiplication process. So, we need to set containers initially for those data. • TYPE flatA[MAX_DIM]; TYPE flatB[MAX_DIM];
  • 8.
    Steps of optimizedmatrix multiplication implementation is given below, 1.Put common calculation at one place Most of the time we do not consider small calculations that redundant over the program where performance is not required but clarity is.
  • 9.
    2. Cache friendlyalgorithm implementation • We all know that memory has a linear arrangement. So, every N-dimensional array ordered sequentially inside the memory. In this example we can convert 2 dimensional input matrices into row major and column major 1 dimensional matrices.
  • 10.
    3. Using StackVs Heap Memory efficiently • It is fast access stack rather than heap memory. But stack has limited memory. We have stored the large input memories in heap memory. For efficient intermediate calculations we have used the stack with predefined memory allocations. • Here we have launched 40 threads to do the multiplication process.Since we are dealing with dimensions of 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800 and 2000, workload can be divided equally among threads. • In omp we have explicitly declared that matrixC as shared resource to avoid race conditions.