Fundamental Concept of Parallel Processing

Distributed Computing
EG 3113 CT Diploma in Computer Engineering
5th Semester
Unit 1: Fundamental concept of Parallel Processing
Lecture by : Er. Ashish K.C(Khatri)

Parallel Computing:
• A parallel computing is the simultaneous use of multiple computer resources to
solve a computational problem:
- A problem is broken into discrete parts that can be solved concurrently.
- Each part is further broken down into a series of instructions.
- Instructions from each part execute simultaneously on different processors.
- An overall control/coordination mechanism is employed.
8/16/2022 Distributed Computing Notes © Er. Ashish K.C(Khatri) 2

• The computational problem should be able to :
- be broken apart into discrete pieces of work that can be solved simultaneously.
- execute multiple program instructions at any moment in time.
- be solved in less time with multiple compute resources than with a single
compute resources.
• The compute resources are typically:
- a single computer with multiple processors/core
- an arbitrary number of such computers connected by a network.

• Parallel computer architecture is the method of organizing all the resources to
maximize the performance and the programmability within the limits given by
technology and the cost at any instance of time.
• It adds new dimension in the development of computer system by using more and
more number of processors.
• Parallel processors are computer system consisting of multiple processing units
connected via some interconnection network plus the software needed to make the
processing units work together.

Advantages of parallel computing:
• It saves time and money as many resources working together will reduce the time
and cut potential costs.
• It can be impractical to solve larger problems on serial computing
• It can take advantages of non local resources when the local resources are finite.
• It makes better work of hardware
• It provide concorrency

• The whole real world runs in dynamic nature i.e many things happen at a certain
time but at different places concurrently.
• This data is extensively huge to manage.
• Real world data needs more dynamic simulation and modeling and for achieving
the same, parallel computing is the key
• Ensures the effective utilization of the resources.
• The hardware is guaranteed to be used effectively whereas in serial computation
only some part of hardware was used and rest rendered idle.
• It is also impracticable to implement real time systems using serial computing.

Limitations of parallel computing:
• It addresses such as communication and synchronization between multiple sub
tasks and processes which is difficult to achieve.
• The algorithms must be managed in such a way that they can be handled in the
parallel mechanism.
• The algorithms or program must have low coupling and high cohesion.
• But it’s difficult to create such programs.
• More technically skilled and expert programmers can code a parallelism based
program well.

Architectures of parallel computer:
• General purpose :
a. Synchronous
b. Data flow
c. Pipeline
• Special purpose:
- Asynchronous
- Systolic

Synchronous Architecture:
• CU fetches instructions and give to all processor.
• CU gives same instructions to all the processor with different data.
• For example:
P0 = a + b
P1 = c + d
p2 = e + f

Data flow architecture:
• It is based on data driven technique.
• It means any program can be represented in a cyclic graph.
• For example : a × b + c × d.
• Here operations becomes node.
• Connection/dependencies is given by edges.

Pipelining:
• Pipelining is the implementation technique where
multiple instructions are overlapped in execution.
• The computer pipeline is divided in stages.
• Each stage completes a part of an instruction in
parallel.
• The stages are connected one to the next to form a
pipe instructions enter at one end, progress through
the stages and exit at the other end.

Asynchronous architecture:
• There is no synchronization among processors.
• All processors performs different instructions.
• Types of synchronous architecture:
a. Bus architecture
b. Switch based architecture

a. Bus architecture:
• Any processors can take any memory.
• If P1 takes M1, then P2 can not take M1 because P2 can not find M1.
• This may cause a deadlock in the operation.

b. Switch based architecture:
• If Processor P1 wants to take
Memory M1, then Switch S1 gets
ON.
• Now, Processor P2 can not take
Memory M1 because it is already
ON but it can take memory M2 or
M3.

Systolic architecture:
• The processor P1 takes the memory and performs some operations.
• After the completion, Processor P1 passes the data/memory to next processor P2
to perform some other operation on same data.
• Likewise, Processor P2 passes those data/memory to processor P3.
• Same data circulates over processor with multiple operations.

Moore’s Law:
• It refers to the Moore’s perception that “the number of transistors on a microchip
doubles every two years, though the cost of computers is halved.”
• Moore’s law states that we can expect the speed and capability of our computers to
increase every couple of years an we will pay less for them.

Grand challenge problem:
• The challenge is “For General purpose computing, develop a cost effective
architecture for improving single task completion time by exploiting parallelism.”
• In other words, the following concerns need to be addressed.
- programmability and algorithmic theory
- applications (for improved productivity)
- performance
- buildability
- power/energy
- cost.

• The grand challenge is nearly identical to the classic parallel computing.
• However in spite of every extensive effort over at least two decades, general
purpose parallel computing is in a state of limbo.
• On one hand, parallel machines have not been cost effective when it came to their
programming.
• On the other hand, easy-to-program programming and algorithms modesl(such as
PRAM) that provided great virtual parallel scalability have been developed;
• but unfortunately, it was not possible in the 1990’s to build machines that
efficiently support these models.
• The overheads for managing the parallelism provided by these models are simply
too high.

Types of parallelism:
• Bit level parallelism
• Instruction level parallelism
• Task level parallelism
• Data level parallelism

Bit level parallelism:
• It is the form of parallel computing which is based on the increasing processor’s
size.
• It reduces the number of instructions that the system must execute in order to
perform a task on large-sized data.
• Example: Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers.
• It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus
requiring two instructions to perform the operation.
• A 16- bit processor can perform the operation with just one instruction.

Instruction level parallelism:
• A processor can only address less than one instruction for each clock cycle phase.
• These instructions can be re-ordered and grouped which are later on executed
concurrently without affecting the result of the program.
• This is called instruction-level parallelism.
• It is a measure of how many of the instructions in a computer program can be
executed simultaneously.

Task level parallelism:
• Task parallelism employs the decomposition of a task into subtasks and then
allocating each of the subtasks for execution.
• The processors perform execution of sub tasks concurrently.

Data level parallelism:
• Instructions from a single stream operate concurrently on several data – Limited
by non-regular data manipulation patterns and by memory bandwidth.

Granularity:
• Granularity (or grain size) of a task is a measure of the amount of work (or
computation) which is performed by that task.
• Granularity is usually measured in terms of the number of instructions executed in
a particular task.
• On a parallel computer, user applications are executed as processes, tasks or
threads.
• The traditional definition of process is a program in execution.
• To achieve an improvement in speed through the use of parallelism, it is necessary
to divide the computation into tasks or processes that can be executed
simultaneously.
• The size of a process can be described by its granularity.

Types of granularity:
• Fine-grained parallelism
• Coarse-grained parallelism
• Medium-grained parallelism

Fine-grained parallelism:
• In fine-grained parallelism, a program is broken down to a large number of small tasks.
• These tasks are assigned individually to many processors.
• The amount of work associated with a parallel task is low and the work is evenly
distributed among the processors.
• Hence, fine-grained parallelism facilitates load balancing.
• In fine granularity, a process might consist of a few instructions, or perhaps even one
instruction.
• It is difficult for programmers to detect parallelism in a program, therefore, it is usually
the compilers responsibility to detect fine-grained parallelism.
• Fine-grained parallelism is best exploited in architectures which support fast
communication.
• Shared memory architecture which has a low communication overhead is most suitable
for fine-grained parallelism.

Coarse-grained parallelism:
• In coarse-grained parallelism, a program is split into large tasks.
• Due to this, a large amount of computation takes place in processors.
• This might result in load imbalance, wherein certain tasks process the bulk of the
data while others might be idle.
• Further, coarse-grained parallelism fails to exploit the parallelism in the program
as most of the computation is performed sequentially on a processor.
• The advantage of this type of parallelism is low communication and
synchronization overhead.
• Message-passing architecture takes a long time to communicate data among
processes which makes it suitable for coarse-grained parallelism.

Medium-grained parallelism:
• Medium-grained parallelism is used relatively to fine-grained and coarse-grained
parallelism.
• Medium-grained parallelism is a compromise between fine-grained and coarse-
grained parallelism,
• where we have task size and communication time greater than fine-grained
parallelism and lower than coarse-grained parallelism.
• Most general-purpose parallel computers fall in this category.

Performance metrics of parallel processor:
• Speedup
• Efficiency
• Redundancy

Speedup:
• Speedup is a measure of performance.
• It measures the ratio between the sequential execution time and the parallel
execution time.
• The speedup is defined as the ratio of the serial runtime of the best sequential
algorithm for solving a problem to the time taken by the parallel algorithm to
solve the same problem on p processors.
• S(p)=
𝑇(1)
𝑇(𝑝)
where, T(1) – execution time with 1 processing unit
T(p) – execution time with p processing unit

• Figure: Example of speedup

Efficiency:
• Efficiency is a measure of the usage of the computational capacity.
• It measures the ratio between performance and the number of resources available
to achieve that performance.
• E(p)=
𝑆(𝑝)
𝑝
=
𝑇(1)
𝑝 ×𝑇(𝑝)

Redundancy:
• Redundancy measures the increase in the required computation when using more
processing units.
• It measures the ratio between the number of operations performed by the parallel
execution and by the sequential execution.
• R(p) =
𝑂(𝑝)
𝑂(1)
where, O(p) – total no. of operations performed by p processors
O(1) – total no. of operations performed by 1 processor

Amdahl’s Law:
• The speedup of a program using multiple processors in parallel computing is
limited by the time needed for the serial fraction of the problem.
• Suppose, Rajesh have to attend an invitation.
• Rajesh’s another two friend Radhe and Shyam are also invited.
• There are conditions that all three friends have to go there separately and all of
them have to be present at door to get into the hall.
• Now Rajesh is coming by car, Radhe by bus and Shyam is coming by foot.
• Now, how fast Rajesh and Radhe can reach there it doesn’t matter, they have to
wait for Shyam.
• So to speed up the overall process, we need to concentrate on the performance of
Shyam other than Rajesh or Radhe.

• Amdahl's law is often used in parallel computing to predict the theoretical speedup
when using multiple processors.
• For example, if a program needs 20 hours to complete using a single thread, but a
one-hour portion of the program cannot be parallelized,
• therefore only the remaining 19 hours (p = 0.95) of execution time can be
parallelized,
• then regardless of how many threads are devoted to a parallelized execution of this
program, the minimum execution time cannot be less than one hour.
• Hence, the theoretical speedup is limited to at most 20 times the single thread
performance

Numericals:
• Suppose one wants to determine if it is advantageous to develop a parallel version
of a certain application. Through experimentation, it was verified that 90% of the
execution time is spent in procedures that can be parallelizable. What is the
maximum speedup that can be achieved with a parallel version of the application
executing on 8 processing units?
• Solution: proportion that can be made parallel, P = 90% = 0.9
no. of processors, N = 8
then,
speedup , S(p) =
1
1−𝑃 +
𝑃
𝑁
=
1
1−0.9 +
0.9
8
=
1
0.1+0.1125
=
1
0.2125
≈ 4.71

• In an enhancement of a design of a CPU, the speed of a floating point unit has
been increased by 20% and the fixed point unit has been increased by 10%. What
is overall speedup achieved if the ratio of the no. of floating point operations to no.
of fixed point operation is 2:3 and floating point operation used to take twice the
time taken by the fixed point operation in original design.
Soln:

Numerical:

Gustafson’s law:

Fundamental Concept of Parallel Processing

Recommended

Recommended

More Related Content

Similar to Fundamental Concept of Parallel Processing

Similar to Fundamental Concept of Parallel Processing (20)

More from Ashish KC

More from Ashish KC (20)

Recently uploaded

Recently uploaded (20)

Fundamental Concept of Parallel Processing