2. MATRIX MULTIPLICATION
• Let’s consider arbitrary matrices A and B.
Since the matrices are square matrices
n=m=p.
• So, the resultant matrix AB can be obtained
like this,
• (AB)ij = ∑ Aik Bkj.
3. GENERATE RANDOM SQUARE MATRIX
• Let’s get into implementation by creating random
matrices for multiplication. Here we are using
malloc function to allocate memory dynamically
at heap. Because when it comes to testing we
have to deal with matrices with different
dimensions.
• Here we have defined the data type as double
which can be changed according to the use case.
The #pragma omp parallel for statement will do
the loop parallelization which we can initialize the
matrix more efficiently.
4. TRADITIONAL MATRIX
MULTIPLICATION
• Without considering much about the
performance, the direct implementation of
the matrix multiplication is given below.
• Operations will occur in sequential manner for
each element at resultant matrix.
• Here matrix A and matrix B are input matrices
where matrix C is the resultant matrix. So, we
have to pass the resultant matrix into function
as a reference.
5. MATRIX MULTIPLICATION USING
PARALLEL FOR LOOPS
• When you are going to implement loop
parallelization in your algorithm, you can use a
library like OpenMP to make the hard work
easy or you can use your own implementation
with threads where you have to handle load
balancing, race conditions etc.
8. OPTIMIZED MATRIX MULTIPLICATION
USING PARALLEL FOR LOOPS
• Since, our matrices are stored in heap, it is not
easy to access them as they stored in the
stack. It is not easy to access them as they
stored in the stack. It is better to bring those
data from heap to stack before start the
multiplication process. So, we need to set
containers initially for those data.
• TYPE flatA[MAX_DIM];
• TYPE flatB[MAX_DIM];
9. STEPS OF OPTIMIZED MATRIX
MULTIPLICATION IMPLEMENTATION
1.Put common calculation at one place:
Most of the time we do not consider small
calculations that redundant over the program
where performance is not required but clarity
is.
2. Cache friendly algorithm implementation:
We all know that memory has a linear
arrangement. So, every N-dimensional array
10. Cont…
ordered sequentially inside the memory .
3. Using stack vs Heap memory efficiently:
• It is fast access stack rather than heap
memory. But stack has limited memory. We
have stored the large input memories in heap
memory. For efficient intermediate
calculations we have used the stack with
predefined memory allocations.
11. Cont…
• Here we have launched 40 threads to do the
multiplication process. Since we are dealing
with dimensions of 200, 400, 600, 800, 1000,
1200, 1400, 1600, 1800 and 2000, workload
can be divided equally among threads.
• In omp we have explicitly declared that
matrixC as shared resource to avoid race
conditions.