SlideShare a Scribd company logo
Report on
Matrix Multiplication using Parallel Processing
Subject
Advance Computer Architecture
Submitted To
Mr. Asim Munir
Class
MSCS-F14
Prepared by: Sunawar Khan
Reg No: 813-MSCS-F14
Matrix Multiplication using Parallel Processing
Parallel Processing
Introduction to parallel processing
There are many application in day to day life that demand real time solution to problem. For
example weather forecasting has to done in a timely fashion etc.. If an expert system is used to
aid a physician in surgical procedures ,decisions have to be made within seconds. And so on.
Programs written for such applications have to perform an enormous amount of computation.
Even the fastest single-processor machine may not be able to come up with solutions within
tolerable time limits. Parallel Random Access Machines (PRAM) offer the potential of
decreasing the solution times enormously. E.g.
Say there are 100 numbers to be added and there are two persons A and B. Person A can add the
first 50 numbers. At the same time B can add the next 50 numbers. When they are done, one of
them can add two individual sums to get the final answer. So two people can add the 100
numbers in almost half the time required by one.
Here is another example of Computing Convex Hull using parallel processing
Take the set of points and divide the set into two halves Assume that recursive call computes the
convex hull of the two halves Conquer stage: take the two convex hulls and merge it to obtain
the convex hull for the entire set.
What is Computational Model
“A computational model is a mathematical model in computational science that requires
extensive computational resources to study the behavior of a complex system by computer
simulation.”
Random access machine model
Algorithms can be measured in a machine-independent way using the Random Access Machine
(RAM) model. This model assumes a single processor. In the RAM model, instructions are
executed one after the other, with no concurrent operations. This model of computation is an
abstraction that allows us to compare algorithms on the basis of performance. The assumptions
made in the RAM model to accomplish this are:
 Each simple operation takes 1 time step.
 Loops and subroutines are not simple operations.
 Each memory access takes one time step, and there is no shortage of memory.
For any given problem the running time of algorithms is assumed to be the number of time steps.
The space used by an algorithm is assumed to be the number of RAM memory cells.
A model of computation whose memory consists of an unbounded sequence of registers, each of
which may hold an integer. In this model, arithmetic operations are allowed to compute the
address of a memory register.
In the RAM (Random Access Machine) we assume that any of the following operation can be
done in one unit of time: addition, subtraction, multiplication, division, comparison, memory
access, assignment, and so on. This model widely accepted as a valid sequential model.
An important feature of parallel computing that is absent in sequential computing is need for
interprocessor communication. For example given any problem, the processor have to
communicate among themselves and agree on the subproblems each will work on. Also they
need to communicate to see whether everyone has finished its task so on.
PRAM Modelfor Single ProcessorMachine.
The designer of sequential algorithm typically formulates the algorithm using an abstract model
of computation called random access machine(RAM). In this model, the machine consist of a
single processor connected to a memory system. Each CPU include arithmetic operation, logical
operation, and memory access require one step access.
 Standard Random Access Machine
 Each Operation load, store, jump, add, etc takes one unit of time.
 Simply general one model.
 Basic model for sequential algorithm
 Unbounded number of local memory cells
 Each memory cell can hold an integer of unbounded size
 Instruction set included –simple operations, data operations, comparator, branches
 All operations take unit time
 Time complexity = number of instructions executed
 Space complexity = number of memory cells used
Parallel Machine or multiprocessor Model
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine,
or PRAM. A multiprocessor model is a generalization of the sequenatial RAM Model in which
there is more than one processor. Multiprocessor model can be classified into three basic types,
local memory machine, modular memory machine, and Parallel Random-Access Machien. Each
describe figure below
“The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation
which assumes that all the processors operate synchronously under a single clock and are able to
randomly access a large shared memory.”
 A natural extension of the Random Access Machine(RAM), serial architecture is the
Parallel Random Access Machine, or PRAM
 Pram Consist of p Processor and a global memory of unbounded size that is uniformly
accessible to all processors
 Processors share a common clock but may execute different instruction in each cycle
PRAM Architecture
The parallel version of RAM constitutes an abstract model of the class of global-memory parallel
processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection network
and taking the view that each processor can access any memory location in each machine cycle,
independent of what other processors are doing.
Processor i can do the following in three phases of one cycle:
 Fetch a value from address si in shared memory
 Perform computations on data held in local registers
 3. Store a value into address di in shared memory
Is an abstract machine for designing the algorithms applicable to parallel computers
 Unbounded collection of RAM processors P0, P1, …,
 Processors don’t have tape
 Each processor has unbounded registers
 Unbounded collection of share memory cells
 All processors can access all memory cells in unit time
 All communication via shared memory
 Two or more processors may read simultaneously from the same cell
 A write conflict occurs when two or more processors try to write simultaneously into
the same cell
Classification of PRAM Model
The classification in is reminiscent of Flynn’s classification and offers yet another example of
the quest to invent four-letter abbreviations/acronyms in computer architecture! Note that here,
too, one of the categories is not very useful, because if concurrent writes are allowed, there is no
logical reason for excluding the less problematic concurrent reads. EREW PRAM is the most
realistic of the four submodels (to the extent that thousands of processors concurrently accessing
thousands of memory locations within a shared memory address space of millions or even
billions of locations can be considered realistic!). CRCW PRAM is the least restrictive
submodel, but has the disadvantage of requiring a conflict resolution mechanism to define the
effect of concurrent writes (more on this below). The default submodel, which is assumed when
nothing is said about the submodel, is CREW PRAM. For most computations, it is fairly easy to
organize the algorithm steps so that concurrent writes to the same location are never attempted.
CRCW PRAM is further classified according to how concurrent writes are handled. Here are a
few example submodels based on the semantics of concurrent writes in CRCW PRAM:
EREW
Least “powerful”,
most “realistic”
CREW
Default
ERCW
Not useful
CRCW
Most “powerful”,
further subdivided
Reads from same location
Writestosamelocation
Exclusive
Concurrent
Concurrent
Exclusive
CRCW SubModel
• Undefined: In case of multiple writes, the value written is undefined (CRCW-U).
• Detecting: A special code representing “detected collision” is written (CRCW-D).
Common: Multiple writes allowed only if all store the same value (CRCW-C). This is
sometimes called the consistent-write submodel.
• Random: The value written is randomly chosen from among those offered (CRCWR).
This is sometimes called the arbitrary-write submodel.
• Priority: The processor with the lowest index succeeds in writing its value (CRCW-P).
• Max/Min: The largest/smallest of the multiple values is written (CRCW-M).
• Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR
(CRCW-X), or some other combination of the multiple values is written.
Physical Complexity
 Processors and memories are connected via switches.
 Since these switches must operate in O(1) time at the level of words, for a system of p
Processors and m words, the switch is O(mp).
 Clearly, for meaningful values of p and m, a true PRAM is not realizable
Relationship between different PRAM Submodel
The following relationships have been established between some of the PRAM submodels:
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
Even though all CRCW submodels are strictly more powerful than the EREW submodel, the
latter can simulate the most powerful CRCW submodel listed above with at most logarithmic
slowdown.
A p-processor CRCW-P (priority) PRAM can be simulated by a p-processor EREW PRAM with
a slowdown factor of Θ(log p). EREW PRAM to sort or find the smallest of p values in Θ(log p)
time, as we shall see later. To avoid concurrent writes, each processor writes an ID-address-value
triple into its corresponding element of a scratch list of size p, with the p processors then
cooperating to sort the list by the destination addresses, partition the list into segments
corresponding to common addresses (which are now adjacent in the sorted list), do a reduction
operation within each segment to remove all writes to the same location except the one with the
smallest processor ID, and finally write the surviving address-value pairs into memory. This final
write operation will clearly be of the exclusive variety.
Matrix Multiplication
Introduction to Matrix Multiplication.
We will show how to implement matrix multiplication C=C+A*B on several of the
communication networks discussed in the last lecture, and develop performance models to
predict how long they take. We will see that the performance depends on several factors.
we discuss PRAM matrix multiplication algorithms as representative examples of the class of
numerical problems. Matrix multiplication is quite important in its own right and is also used as
a building block in many other parallel algorithms. For example, matrix multiplication is useful
in solving graph problems when the graphs are represented by their adjacency or weight
matrices. Given m × m matrices A and B, with elements a i jand b ij,, their product C is defined
as
The following O(m³)-step sequential algorithm can be used for multiplying m × m matrices:
Sequential matrix multiplication
for i = 0 to m – 1 do
for j = 0 to m – 1 do
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
endfor
endfor
Consider n*n matric multiplication with n3 processors
 Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n
O(log n) time.
 On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n
copies of a and b, which takes O(log n) time with n Processors
 Total time is still O(log n)
 Memory requirement is ofcourse much higher for the EREW PRAM
 Complexity: Θ(n3)
 Better Algorithm that improve slightly
 Multiplication by block
PRAM matrix multiplication using m2 processors
Proc (i, j), 0  i, j < m, do
begin
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
end
Because multiple processors will be reading the same row of A or the same column of B, the
above naive implementation of the algorithm would require the CREW submodel. However, it is
possible to convert the algorithm to an EREW PRAM algorithm by skewing the memory
accesses (how?).
matrix multiplication can be done in Θ(m²) time by using Processor i to compute the m elements
in Row i of the product matrix C in turn.
PRAM Matrix Multiplication with m Processors
for j = 0 to m – 1 Proc i, 0  i < m, do
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
endfor
m processors read A at once (no concurrent)
- All m processors read same column of B at same time (concurrent read should be allowed)
- if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory
access
More Efficient Matrix Multiplication (for NUMA)
On the Cm* NUMA-type shared-memory multiprocessor, a research prototype machine built at
Carnegie-Mellon University in the 1980s, this block matrix multiplication algorithm exhibited
good, but sublinear, speed-up. With 16 processors, the speed-up was only 5 in multiplying 24 ×
24 matrices. However, the speed-up improved to about 9 when larger 36 × 36 (48 × 48) matrices
were multiplied. It is interesting to note that improved locality of the block matrix multiplication
algorithm can also improve the running time on a uniprocessor, or distributed shared-memory
multiprocessor with caches, in view of higher cache hit rates.
Detail of Block Matrix Multiplication
A multiply-add computation on q  q blocks needs 2q 2 = 2m 2/p memory accesses and 2q 3
arithmetic operations So, q arithmetic operations are done per memory access.
iq + q - 1
iq + a
iq + 1
iq
jq jq + b jq + q - 1
kq + c
kq + c
iq + q - 1
iq + a
iq + 1
iq
jq jq + 1 jq + b jq + q - 1
Multiply
Add
Elements of
block (i, j)
in matrix C
Elements of
block (k, j)
in matrix B
Element of
block (i, k)
in matrix A
jq + 1
A Simple Parallel Algorithm
Example for n numbers addition:
1. We start with 4 processors and each of them adds 2 items in the first step.
2. The number of items is halved at every subsequent step. Hence logn steps are required
for adding n numbers.
3. The processor requirement is O(n).
CREW Cost
Let P(n) = O(n2)
 Read n2 processors Aij all cells at once in = O(1)
 Read n2 processors Bij all cells in = O(1)
 Each processor multiply Aij* Bij in = O(1)
 Parallel Sum to get Cij = O(logn)
 Store Value Cij = O(1)
 T(n) = O(log n)
 P(n) = O(n2)
 W(n) = O(n2log n) = total # of all operations
EREW Cost
Let P(n) = O(n2)
 Read n2 processors Aij all cells at once in = O(1)
 Cannot read n2 processors Bij all cells in = O(1)
 Concurrent read is not allowed
 Skew the memory – replicate – or - Parallel processor read
 O(logn)
 Each processor multiply Aij* Bij in = O(1)
 Parallel Sum to get Cij = O(logn)
 Store Value Cij = O(1)
 T(n) = O(log n)
 P(n) = O(n2)
 W(n) = O(n2log n) = total # of all operations
Advantages of PRAM Model
 PRAM removes algorithmic details concerning synchronization and communicating,
allowing the algorithm designer to focus on problem properties
 A PRAM algorithm includes an explicit understanding of the operation performed at each
time unit and an explicit allocation of processors to jobs at each time unit.
 PRAM designer paradigm have turned out to be robust and have been mapped efficiently
onto many other parallel models and even network models
Disadvantage of PRAM Model
 Model Inaccuracies
unbounded local memory(register)
All operation takes unit time
processors run in lock steps
 Unaccounted costs
Non-local memory access
Latency
Bandwidth
Memory Access Contention
Conclusion
 PRAM algorithm is the source of most fundamental ideas
 It’s a source of inspiration for algorithms
 PRAM is simple and easy to understand.
The improved locality of block matrix multiplication can also improve the running time on a
uniprocessor, or distributed shared-memory multiprocessor with caches.
Reason: Higher Cache Higher Hit Rates

More Related Content

What's hot

Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
Joy Dutta
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
Sampath Kumar S
 
Typical Embedded System
Typical Embedded SystemTypical Embedded System
Typical Embedded System
anand hd
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
Suvendu Kumar Dash
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
Rishabh Soni
 
4.6 halting problem
4.6 halting problem4.6 halting problem
4.6 halting problem
Sampath Kumar S
 
Compiler Chapter 1
Compiler Chapter 1Compiler Chapter 1
Compiler Chapter 1
Huawei Technologies
 
Ec8791 lpc2148 timer unit
Ec8791 lpc2148 timer unitEc8791 lpc2148 timer unit
Ec8791 lpc2148 timer unit
RajalakshmiSermadurai
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recovery
Tech_MX
 
Set associative mapping
Set associative mappingSet associative mapping
Set associative mapping
Ashik Khan
 
Single source Shortest path algorithm with example
Single source Shortest path algorithm with exampleSingle source Shortest path algorithm with example
Single source Shortest path algorithm with example
VINITACHAUHAN21
 
Applications of microprocessor
Applications of microprocessorApplications of microprocessor
Applications of microprocessor
Anjali Agrawal
 
Keypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 MicrocontrollerKeypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 Microcontroller
Sudhanshu Janwadkar
 
If then rule in fuzzy logic and fuzzy implications
If then rule  in fuzzy logic and fuzzy implicationsIf then rule  in fuzzy logic and fuzzy implications
If then rule in fuzzy logic and fuzzy implications
Siksha 'O' Anusandhan (Deemed to be University )
 
Chapter 4 Embedded System: Application and Domain Specific
Chapter 4 Embedded System: Application and Domain SpecificChapter 4 Embedded System: Application and Domain Specific
Chapter 4 Embedded System: Application and Domain Specific
Moe Moe Myint
 
Greedy Algorithm
Greedy AlgorithmGreedy Algorithm
Greedy Algorithm
Waqar Akram
 
Symbol table in compiler Design
Symbol table in compiler DesignSymbol table in compiler Design
Symbol table in compiler Design
Kuppusamy P
 
Flowchart of GA
Flowchart of GAFlowchart of GA
Flowchart of GA
Ishucs
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
Simplilearn
 
Unit II Arm7 Thumb Instruction
Unit II Arm7 Thumb InstructionUnit II Arm7 Thumb Instruction
Unit II Arm7 Thumb Instruction
Dr. Pankaj Zope
 

What's hot (20)

Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
 
Typical Embedded System
Typical Embedded SystemTypical Embedded System
Typical Embedded System
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
4.6 halting problem
4.6 halting problem4.6 halting problem
4.6 halting problem
 
Compiler Chapter 1
Compiler Chapter 1Compiler Chapter 1
Compiler Chapter 1
 
Ec8791 lpc2148 timer unit
Ec8791 lpc2148 timer unitEc8791 lpc2148 timer unit
Ec8791 lpc2148 timer unit
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recovery
 
Set associative mapping
Set associative mappingSet associative mapping
Set associative mapping
 
Single source Shortest path algorithm with example
Single source Shortest path algorithm with exampleSingle source Shortest path algorithm with example
Single source Shortest path algorithm with example
 
Applications of microprocessor
Applications of microprocessorApplications of microprocessor
Applications of microprocessor
 
Keypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 MicrocontrollerKeypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 Microcontroller
 
If then rule in fuzzy logic and fuzzy implications
If then rule  in fuzzy logic and fuzzy implicationsIf then rule  in fuzzy logic and fuzzy implications
If then rule in fuzzy logic and fuzzy implications
 
Chapter 4 Embedded System: Application and Domain Specific
Chapter 4 Embedded System: Application and Domain SpecificChapter 4 Embedded System: Application and Domain Specific
Chapter 4 Embedded System: Application and Domain Specific
 
Greedy Algorithm
Greedy AlgorithmGreedy Algorithm
Greedy Algorithm
 
Symbol table in compiler Design
Symbol table in compiler DesignSymbol table in compiler Design
Symbol table in compiler Design
 
Flowchart of GA
Flowchart of GAFlowchart of GA
Flowchart of GA
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
 
Unit II Arm7 Thumb Instruction
Unit II Arm7 Thumb InstructionUnit II Arm7 Thumb Instruction
Unit II Arm7 Thumb Instruction
 

Viewers also liked

Qp mobile &amp; pervasive 2015
Qp mobile &amp; pervasive 2015Qp mobile &amp; pervasive 2015
Qp mobile &amp; pervasive 2015
MNM Jain Engineering College
 
Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
Survival of the Fittest: Using Genetic Algorithm for Data Mining OptimizationSurvival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
Or Levi
 
Introduction to Genetic Algorithms
Introduction to Genetic AlgorithmsIntroduction to Genetic Algorithms
Introduction to Genetic Algorithms
Ahmed Othman
 
cache memory
cache memorycache memory
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
garima931
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
Nobal Niraula
 
Genetic algorithm
Genetic algorithm Genetic algorithm
Genetic algorithm
Rabiya Khalid
 

Viewers also liked (7)

Qp mobile &amp; pervasive 2015
Qp mobile &amp; pervasive 2015Qp mobile &amp; pervasive 2015
Qp mobile &amp; pervasive 2015
 
Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
Survival of the Fittest: Using Genetic Algorithm for Data Mining OptimizationSurvival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
 
Introduction to Genetic Algorithms
Introduction to Genetic AlgorithmsIntroduction to Genetic Algorithms
Introduction to Genetic Algorithms
 
cache memory
cache memorycache memory
cache memory
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
 
Genetic algorithm
Genetic algorithm Genetic algorithm
Genetic algorithm
 

Similar to Matrix Multiplication Report

Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
International Islamic University
 
Compiler design
Compiler designCompiler design
Compiler design
renukarenuka9
 
unit 2 hpc.pptx
unit 2 hpc.pptxunit 2 hpc.pptx
unit 2 hpc.pptx
gopal467344
 
Complier design
Complier design Complier design
Complier design
shreeuva
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptx
HAIDERALICH3
 
Lecture1
Lecture1Lecture1
Lecture1
Asad Abbas
 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162
IJERA Editor
 
Pram model
Pram modelPram model
Pram model
MANASYJAYASURYA
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and money
Laurent Oberholzer
 
Adaptive bank management[1]
Adaptive bank management[1]Adaptive bank management[1]
Adaptive bank management[1]
Durga Prasada Rao M
 
1844 1849
1844 18491844 1849
1844 1849
Editor IJARCET
 
1844 1849
1844 18491844 1849
1844 1849
Editor IJARCET
 
Cache memory
Cache memoryCache memory
Cache memory
Muhammad Imran
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
jamesmathew007
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
ashishmulchandani
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
ssuser413a98
 
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGDYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
cscpconf
 
Chapter 10
Chapter 10Chapter 10
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
Danish Javed
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
IJCSEIT Journal
 

Similar to Matrix Multiplication Report (20)

Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
 
Compiler design
Compiler designCompiler design
Compiler design
 
unit 2 hpc.pptx
unit 2 hpc.pptxunit 2 hpc.pptx
unit 2 hpc.pptx
 
Complier design
Complier design Complier design
Complier design
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptx
 
Lecture1
Lecture1Lecture1
Lecture1
 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162
 
Pram model
Pram modelPram model
Pram model
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and money
 
Adaptive bank management[1]
Adaptive bank management[1]Adaptive bank management[1]
Adaptive bank management[1]
 
1844 1849
1844 18491844 1849
1844 1849
 
1844 1849
1844 18491844 1849
1844 1849
 
Cache memory
Cache memoryCache memory
Cache memory
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGDYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
 

More from International Islamic University

Hash tables
Hash tablesHash tables
Binary Search Tree
Binary Search TreeBinary Search Tree
Graph 1
Graph 1Graph 1
Graph 2
Graph 2Graph 2
Graph 3
Graph 3Graph 3
Greedy algorithm
Greedy algorithmGreedy algorithm
Dynamic programming
Dynamic programmingDynamic programming
Quick sort
Quick sortQuick sort
Merge sort
Merge sortMerge sort
Linear timesorting
Linear timesortingLinear timesorting
Facial Expression Recognitino
Facial Expression RecognitinoFacial Expression Recognitino
Facial Expression Recognitino
International Islamic University
 
Lecture#4
Lecture#4Lecture#4
Lecture#3
Lecture#3 Lecture#3
Lecture#2
Lecture#2 Lecture#2
Case study
Case studyCase study
Arrays
ArraysArrays
Pcb
PcbPcb
Data transmission
Data transmissionData transmission
Basic organization of computer
Basic organization of computerBasic organization of computer
Basic organization of computer
International Islamic University
 
Sorting techniques
Sorting techniquesSorting techniques

More from International Islamic University (20)

Hash tables
Hash tablesHash tables
Hash tables
 
Binary Search Tree
Binary Search TreeBinary Search Tree
Binary Search Tree
 
Graph 1
Graph 1Graph 1
Graph 1
 
Graph 2
Graph 2Graph 2
Graph 2
 
Graph 3
Graph 3Graph 3
Graph 3
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Quick sort
Quick sortQuick sort
Quick sort
 
Merge sort
Merge sortMerge sort
Merge sort
 
Linear timesorting
Linear timesortingLinear timesorting
Linear timesorting
 
Facial Expression Recognitino
Facial Expression RecognitinoFacial Expression Recognitino
Facial Expression Recognitino
 
Lecture#4
Lecture#4Lecture#4
Lecture#4
 
Lecture#3
Lecture#3 Lecture#3
Lecture#3
 
Lecture#2
Lecture#2 Lecture#2
Lecture#2
 
Case study
Case studyCase study
Case study
 
Arrays
ArraysArrays
Arrays
 
Pcb
PcbPcb
Pcb
 
Data transmission
Data transmissionData transmission
Data transmission
 
Basic organization of computer
Basic organization of computerBasic organization of computer
Basic organization of computer
 
Sorting techniques
Sorting techniquesSorting techniques
Sorting techniques
 

Recently uploaded

HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
How to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in useHow to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in use
Celine George
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
Nguyen Thanh Tu Collection
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
TechSoup
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
Celine George
 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
nitinpv4ai
 

Recently uploaded (20)

HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
How to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in useHow to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in use
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 8 - CẢ NĂM - FRIENDS PLUS - NĂM HỌC 2023-2024 (B...
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
 

Matrix Multiplication Report

  • 1. Report on Matrix Multiplication using Parallel Processing Subject Advance Computer Architecture Submitted To Mr. Asim Munir Class MSCS-F14 Prepared by: Sunawar Khan Reg No: 813-MSCS-F14
  • 2. Matrix Multiplication using Parallel Processing Parallel Processing Introduction to parallel processing There are many application in day to day life that demand real time solution to problem. For example weather forecasting has to done in a timely fashion etc.. If an expert system is used to aid a physician in surgical procedures ,decisions have to be made within seconds. And so on. Programs written for such applications have to perform an enormous amount of computation. Even the fastest single-processor machine may not be able to come up with solutions within tolerable time limits. Parallel Random Access Machines (PRAM) offer the potential of decreasing the solution times enormously. E.g. Say there are 100 numbers to be added and there are two persons A and B. Person A can add the first 50 numbers. At the same time B can add the next 50 numbers. When they are done, one of them can add two individual sums to get the final answer. So two people can add the 100 numbers in almost half the time required by one. Here is another example of Computing Convex Hull using parallel processing
  • 3. Take the set of points and divide the set into two halves Assume that recursive call computes the convex hull of the two halves Conquer stage: take the two convex hulls and merge it to obtain the convex hull for the entire set. What is Computational Model “A computational model is a mathematical model in computational science that requires extensive computational resources to study the behavior of a complex system by computer simulation.” Random access machine model Algorithms can be measured in a machine-independent way using the Random Access Machine (RAM) model. This model assumes a single processor. In the RAM model, instructions are executed one after the other, with no concurrent operations. This model of computation is an abstraction that allows us to compare algorithms on the basis of performance. The assumptions made in the RAM model to accomplish this are:  Each simple operation takes 1 time step.  Loops and subroutines are not simple operations.  Each memory access takes one time step, and there is no shortage of memory. For any given problem the running time of algorithms is assumed to be the number of time steps. The space used by an algorithm is assumed to be the number of RAM memory cells. A model of computation whose memory consists of an unbounded sequence of registers, each of which may hold an integer. In this model, arithmetic operations are allowed to compute the address of a memory register. In the RAM (Random Access Machine) we assume that any of the following operation can be done in one unit of time: addition, subtraction, multiplication, division, comparison, memory access, assignment, and so on. This model widely accepted as a valid sequential model. An important feature of parallel computing that is absent in sequential computing is need for interprocessor communication. For example given any problem, the processor have to communicate among themselves and agree on the subproblems each will work on. Also they need to communicate to see whether everyone has finished its task so on.
  • 4. PRAM Modelfor Single ProcessorMachine. The designer of sequential algorithm typically formulates the algorithm using an abstract model of computation called random access machine(RAM). In this model, the machine consist of a single processor connected to a memory system. Each CPU include arithmetic operation, logical operation, and memory access require one step access.  Standard Random Access Machine  Each Operation load, store, jump, add, etc takes one unit of time.  Simply general one model.  Basic model for sequential algorithm  Unbounded number of local memory cells  Each memory cell can hold an integer of unbounded size  Instruction set included –simple operations, data operations, comparator, branches  All operations take unit time  Time complexity = number of instructions executed  Space complexity = number of memory cells used Parallel Machine or multiprocessor Model We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or PRAM. A multiprocessor model is a generalization of the sequenatial RAM Model in which there is more than one processor. Multiprocessor model can be classified into three basic types, local memory machine, modular memory machine, and Parallel Random-Access Machien. Each describe figure below
  • 5. “The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation which assumes that all the processors operate synchronously under a single clock and are able to randomly access a large shared memory.”  A natural extension of the Random Access Machine(RAM), serial architecture is the Parallel Random Access Machine, or PRAM  Pram Consist of p Processor and a global memory of unbounded size that is uniformly accessible to all processors  Processors share a common clock but may execute different instruction in each cycle PRAM Architecture The parallel version of RAM constitutes an abstract model of the class of global-memory parallel processors. The abstraction consists of ignoring the details of the processor-to-memory interconnection network
  • 6. and taking the view that each processor can access any memory location in each machine cycle, independent of what other processors are doing. Processor i can do the following in three phases of one cycle:  Fetch a value from address si in shared memory  Perform computations on data held in local registers  3. Store a value into address di in shared memory Is an abstract machine for designing the algorithms applicable to parallel computers  Unbounded collection of RAM processors P0, P1, …,  Processors don’t have tape  Each processor has unbounded registers  Unbounded collection of share memory cells  All processors can access all memory cells in unit time  All communication via shared memory
  • 7.  Two or more processors may read simultaneously from the same cell  A write conflict occurs when two or more processors try to write simultaneously into the same cell Classification of PRAM Model The classification in is reminiscent of Flynn’s classification and offers yet another example of the quest to invent four-letter abbreviations/acronyms in computer architecture! Note that here, too, one of the categories is not very useful, because if concurrent writes are allowed, there is no logical reason for excluding the less problematic concurrent reads. EREW PRAM is the most realistic of the four submodels (to the extent that thousands of processors concurrently accessing thousands of memory locations within a shared memory address space of millions or even billions of locations can be considered realistic!). CRCW PRAM is the least restrictive submodel, but has the disadvantage of requiring a conflict resolution mechanism to define the effect of concurrent writes (more on this below). The default submodel, which is assumed when nothing is said about the submodel, is CREW PRAM. For most computations, it is fairly easy to organize the algorithm steps so that concurrent writes to the same location are never attempted. CRCW PRAM is further classified according to how concurrent writes are handled. Here are a few example submodels based on the semantics of concurrent writes in CRCW PRAM: EREW Least “powerful”, most “realistic” CREW Default ERCW Not useful CRCW Most “powerful”, further subdivided Reads from same location Writestosamelocation Exclusive Concurrent Concurrent Exclusive
  • 8. CRCW SubModel • Undefined: In case of multiple writes, the value written is undefined (CRCW-U). • Detecting: A special code representing “detected collision” is written (CRCW-D). Common: Multiple writes allowed only if all store the same value (CRCW-C). This is sometimes called the consistent-write submodel. • Random: The value written is randomly chosen from among those offered (CRCWR). This is sometimes called the arbitrary-write submodel. • Priority: The processor with the lowest index succeeds in writing its value (CRCW-P). • Max/Min: The largest/smallest of the multiple values is written (CRCW-M). • Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or some other combination of the multiple values is written. Physical Complexity  Processors and memories are connected via switches.  Since these switches must operate in O(1) time at the level of words, for a system of p Processors and m words, the switch is O(mp).  Clearly, for meaningful values of p and m, a true PRAM is not realizable
  • 9. Relationship between different PRAM Submodel The following relationships have been established between some of the PRAM submodels: EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P Even though all CRCW submodels are strictly more powerful than the EREW submodel, the latter can simulate the most powerful CRCW submodel listed above with at most logarithmic slowdown. A p-processor CRCW-P (priority) PRAM can be simulated by a p-processor EREW PRAM with a slowdown factor of Θ(log p). EREW PRAM to sort or find the smallest of p values in Θ(log p) time, as we shall see later. To avoid concurrent writes, each processor writes an ID-address-value triple into its corresponding element of a scratch list of size p, with the p processors then cooperating to sort the list by the destination addresses, partition the list into segments corresponding to common addresses (which are now adjacent in the sorted list), do a reduction operation within each segment to remove all writes to the same location except the one with the smallest processor ID, and finally write the surviving address-value pairs into memory. This final write operation will clearly be of the exclusive variety. Matrix Multiplication Introduction to Matrix Multiplication. We will show how to implement matrix multiplication C=C+A*B on several of the communication networks discussed in the last lecture, and develop performance models to predict how long they take. We will see that the performance depends on several factors. we discuss PRAM matrix multiplication algorithms as representative examples of the class of numerical problems. Matrix multiplication is quite important in its own right and is also used as a building block in many other parallel algorithms. For example, matrix multiplication is useful in solving graph problems when the graphs are represented by their adjacency or weight matrices. Given m × m matrices A and B, with elements a i jand b ij,, their product C is defined as
  • 10. The following O(m³)-step sequential algorithm can be used for multiplying m × m matrices: Sequential matrix multiplication for i = 0 to m – 1 do for j = 0 to m – 1 do t := 0 for k = 0 to m – 1 do t := t + aikbkj endfor cij := t endfor endfor Consider n*n matric multiplication with n3 processors  Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n O(log n) time.
  • 11.  On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n copies of a and b, which takes O(log n) time with n Processors  Total time is still O(log n)  Memory requirement is ofcourse much higher for the EREW PRAM  Complexity: Θ(n3)  Better Algorithm that improve slightly  Multiplication by block PRAM matrix multiplication using m2 processors Proc (i, j), 0  i, j < m, do begin t := 0 for k = 0 to m – 1 do t := t + aikbkj endfor cij := t end Because multiple processors will be reading the same row of A or the same column of B, the above naive implementation of the algorithm would require the CREW submodel. However, it is possible to convert the algorithm to an EREW PRAM algorithm by skewing the memory accesses (how?). matrix multiplication can be done in Θ(m²) time by using Processor i to compute the m elements in Row i of the product matrix C in turn. PRAM Matrix Multiplication with m Processors for j = 0 to m – 1 Proc i, 0  i < m, do t := 0 for k = 0 to m – 1 do t := t + aikbkj
  • 12. endfor cij := t endfor m processors read A at once (no concurrent) - All m processors read same column of B at same time (concurrent read should be allowed) - if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory access More Efficient Matrix Multiplication (for NUMA) On the Cm* NUMA-type shared-memory multiprocessor, a research prototype machine built at Carnegie-Mellon University in the 1980s, this block matrix multiplication algorithm exhibited good, but sublinear, speed-up. With 16 processors, the speed-up was only 5 in multiplying 24 × 24 matrices. However, the speed-up improved to about 9 when larger 36 × 36 (48 × 48) matrices were multiplied. It is interesting to note that improved locality of the block matrix multiplication algorithm can also improve the running time on a uniprocessor, or distributed shared-memory multiprocessor with caches, in view of higher cache hit rates.
  • 13. Detail of Block Matrix Multiplication A multiply-add computation on q  q blocks needs 2q 2 = 2m 2/p memory accesses and 2q 3 arithmetic operations So, q arithmetic operations are done per memory access. iq + q - 1 iq + a iq + 1 iq jq jq + b jq + q - 1 kq + c kq + c iq + q - 1 iq + a iq + 1 iq jq jq + 1 jq + b jq + q - 1 Multiply Add Elements of block (i, j) in matrix C Elements of block (k, j) in matrix B Element of block (i, k) in matrix A jq + 1
  • 14. A Simple Parallel Algorithm Example for n numbers addition: 1. We start with 4 processors and each of them adds 2 items in the first step. 2. The number of items is halved at every subsequent step. Hence logn steps are required for adding n numbers. 3. The processor requirement is O(n). CREW Cost Let P(n) = O(n2)  Read n2 processors Aij all cells at once in = O(1)  Read n2 processors Bij all cells in = O(1)  Each processor multiply Aij* Bij in = O(1)  Parallel Sum to get Cij = O(logn)  Store Value Cij = O(1)  T(n) = O(log n)  P(n) = O(n2)  W(n) = O(n2log n) = total # of all operations EREW Cost Let P(n) = O(n2)  Read n2 processors Aij all cells at once in = O(1)  Cannot read n2 processors Bij all cells in = O(1)  Concurrent read is not allowed  Skew the memory – replicate – or - Parallel processor read  O(logn)  Each processor multiply Aij* Bij in = O(1)  Parallel Sum to get Cij = O(logn)  Store Value Cij = O(1)  T(n) = O(log n)  P(n) = O(n2)  W(n) = O(n2log n) = total # of all operations
  • 15. Advantages of PRAM Model  PRAM removes algorithmic details concerning synchronization and communicating, allowing the algorithm designer to focus on problem properties  A PRAM algorithm includes an explicit understanding of the operation performed at each time unit and an explicit allocation of processors to jobs at each time unit.  PRAM designer paradigm have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models Disadvantage of PRAM Model  Model Inaccuracies unbounded local memory(register) All operation takes unit time processors run in lock steps  Unaccounted costs Non-local memory access Latency Bandwidth Memory Access Contention Conclusion  PRAM algorithm is the source of most fundamental ideas  It’s a source of inspiration for algorithms  PRAM is simple and easy to understand. The improved locality of block matrix multiplication can also improve the running time on a uniprocessor, or distributed shared-memory multiprocessor with caches. Reason: Higher Cache Higher Hit Rates