Parallel concepts1

TM

Parallel Concepts
Dr. C.V. Suresh Babu

TM

The Goal of Parallelization
• Reduction of elapsed time of a program
• Reduction in turnaround time of jobs
cpu
time

1 processor

communication
overhead

4 processors

• Overhead:
–
–
–
–
–

total increase in cpu time
communication
synchronization
additional work in algorithm
non-parallel part of the program
• (one processor works, others spin idle)

pr
oc
s

finish

2

Elapsed time

8 proc
s
4 p
roc
s

start

1

or
s
es
c
ro
p

Reduction in
elapsed time

Elapsed time

TM

Speedup and Efficiency
Both measure the parallelization properties of a program
• Let T(p) be the elapsed time on p processors
• The Speedup S(p) and the Efficiency E(p) are defined as:
S(p) = T(1)/T(p)
E(p) = S(p)/p
• for ideal parallel speedup we get:
Speedup

ideal

T(p) = T(1)/p
S(p) = T(1)/T(p) = p
E(p) = S(p)/p = 1 or 100%
Efficiency
1

Super-linear
Saturation
Disaster
Number of processors

Number of processors

Amdahl’s Law
This rule states the following for parallel programs:
The non-parallel fraction of the code (I.e. overhead)
imposes the upper limit on the scalability of the code
• the non-parallel (serial) fraction s of the program includes the
(1)
1 = s + f
! program has serial
communication and synchronization overhead and parallel fractions
(2)
(3)
(4)

T(1) =
=
=
T(p) =
S(p) =
=
<

(5)

T(parallel) + T(serial)
T(1) *(f + s)
T(1) *(f + (1-f))
T(1) *(f/p + (1-f))
T(1)/T(p)
1/(f/p + 1-f)
1/(1-f)
! for p-> inf.

S(p) < 1/(1-f)

TM

Amdahl’s Law: Time to Solution

T(p) = T(1)/S(p)
S(p) = 1/(f/p + (1-f))

Hypothetical program run time as function of #processors for several
parallel fractions f. Note the log-log plot

TM

TM

Fine-Grained Vs Coarse-Grained
• Fine-grain parallelism (typically loop level)
–
–
–
–

can be done incrementally, one loop at a time
does not require deep knowledge of the code
a lot of loops have to be parallel for decent speedup
potentially many synchronization points
MAIN
(at the end of each parallel loop)
A
E

B

F

C

G
K

• Coarse-grain parallelism
– make larger loops parallel at higher call-tree level
potentially in-closing many small loops
– more code is parallel at once
– fewer synchronization points, reducing overhead
– requires deeper knowledge of the code

H

L
p

Coarse-grained
D
I

J
N

M

O

q
r

s
t

Fine-grained

TM

Other Impediments to Scalability
Load imbalance:

p0
p1
p2
p3

• the time to complete a parallel
execution of a code segment is
start
determined by the longest running thread

Elapsed time

finish

• unequal work load distribution leads to some processors being
idle, while others work too much

with coarse grain parallelization, more opportunities for load
imbalance exist

Too many synchronization points
• compiler will put synchronization points at the start and exit of
each parallel region

Computing π with DPL
π=

1

4 dx
(1+x2)
0

=Σ
0<i<N

4
N(1+((i+0.5)/N)2)

PROGRAM PIPROG
INTEGER, PARAMETER:: N = 1000000
REAL (KIND=8):: LS,PI, W = 1.0/N
PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) )
PRINT *, PI
END

Notes:
–
–
–
–
–

essentially sequential form
automatic detection of parallelism
automatic work sharing
all variables shared by default
number of processors specified outside of the code

compile with:

TM

Computing π with Shared Memory
π=

1

4 dx
(1+x2)
0

=Σ
0<i<N

4
N(1+((i+0.5)/N)2)

#define n 1000000
main()
{
double pi, l, ls = 0.0, w = 1.0/n;
int i;
#pragma omp parallel for private(i,l) reduction(+:ls)
for(i=0; i<n; i++) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
printf(“pi is %fn”,ls*w);
}

Notes:
– essentially sequential form
– automatic work sharing

TM

Computing π with Message Passing
1
#include <mpi.h>
4 dx
#define N 1000000
(1+x2)
main()
0
{
double pi, l, ls = 0.0, w = 1.0/N;

π=

=Σ
0<i<N

4
N(1+((i+0.5)/N)2)

int i, mid, nth;

MPI_init(&argc, &argv);
MPI_comm_rank(MPI_COMM_WORLD,&mid);
MPI_comm_size(MPI_COMM_WORLD,&nth);

}

for(i=mid; i<N; i += nth) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if(mid == 0) printf(“pi is %fn”,pi*w);
MPI_finalize();

Notes:

TM

Comparing Parallel Paradigms
• Automatic parallelization combined with explicit Shared Variable
programming (compiler directives) used on machines with global
memory

– Symmetric Multi-Processors, CC-NUMA, PVP
– These methods collectively known as Shared Memory Programming (SMP)
– SMP programming model works at loop level, and coarse level parallelism:
• the coarse level parallelism has to be specified explicitly
• loop level parallelism can be found by the compiler (implicitly)

– Explicit Message Passing Methods are necessary with machines that
have no global memory addressability:
• clusters of all sort, NOW & COW

– Message Passing Methods require coarse level parallelism to be scalable

•Choosing programming model is largely a matter of the application,
personal preference and the target machine.

•it has nothing to do with scalability.
limitations:
– communication overhead
– process synchronization

Scalability

function of
•scalability is mainly aparallelism the hardware and (your)
implementation of the

TM

Summary

TM

• The serial part or the communication overhead of the code limits the
scalability of the code (Amdahl Law)
• programs have to be >99% parallel to use large (>30 proc) machines
• several Programming Models are in use today:
– Shared Memory programming (SMP) (with Automatic Compiler
parallelization, Data-Parallel and explicit Shared Memory models)
– Message Passing model
• Choosing a Programming Model is largely a matter of the application,
personal choice and target machine. It has nothing to do with scalability.
– Don’t confuse Algorithm and implementation
• machines with a global address space can run applications based on
both, SMP and Message Passing programming models

Parallel concepts1

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Parallel concepts1

Similar to Parallel concepts1 (20)

More from Dr. C.V. Suresh Babu

More from Dr. C.V. Suresh Babu (20)

Recently uploaded

Recently uploaded (20)

Parallel concepts1