2. TM
The Goal of Parallelization
• Reduction of elapsed time of a program
• Reduction in turnaround time of jobs
cpu
time
1 processor
communication
overhead
4 processors
• Overhead:
–
–
–
–
–
total increase in cpu time
communication
synchronization
additional work in algorithm
non-parallel part of the program
• (one processor works, others spin idle)
pr
oc
s
finish
2
Elapsed time
8 proc
s
4 p
roc
s
start
1
or
s
es
c
ro
p
Reduction in
elapsed time
Elapsed time
3. TM
Speedup and Efficiency
Both measure the parallelization properties of a program
• Let T(p) be the elapsed time on p processors
• The Speedup S(p) and the Efficiency E(p) are defined as:
S(p) = T(1)/T(p)
E(p) = S(p)/p
• for ideal parallel speedup we get:
Speedup
ideal
T(p) = T(1)/p
S(p) = T(1)/T(p) = p
E(p) = S(p)/p = 1 or 100%
Efficiency
1
Super-linear
Saturation
Disaster
Number of processors
Number of processors
4. Amdahl’s Law
This rule states the following for parallel programs:
The non-parallel fraction of the code (I.e. overhead)
imposes the upper limit on the scalability of the code
• the non-parallel (serial) fraction s of the program includes the
(1)
1 = s + f
! program has serial
communication and synchronization overhead and parallel fractions
(2)
(3)
(4)
T(1) =
=
=
T(p) =
S(p) =
=
<
(5)
T(parallel) + T(serial)
T(1) *(f + s)
T(1) *(f + (1-f))
T(1) *(f/p + (1-f))
T(1)/T(p)
1/(f/p + 1-f)
1/(1-f)
! for p-> inf.
S(p) < 1/(1-f)
TM
5. Amdahl’s Law: Time to Solution
T(p) = T(1)/S(p)
S(p) = 1/(f/p + (1-f))
Hypothetical program run time as function of #processors for several
parallel fractions f. Note the log-log plot
TM
6. TM
Fine-Grained Vs Coarse-Grained
• Fine-grain parallelism (typically loop level)
–
–
–
–
can be done incrementally, one loop at a time
does not require deep knowledge of the code
a lot of loops have to be parallel for decent speedup
potentially many synchronization points
MAIN
(at the end of each parallel loop)
A
E
B
F
C
G
K
• Coarse-grain parallelism
– make larger loops parallel at higher call-tree level
potentially in-closing many small loops
– more code is parallel at once
– fewer synchronization points, reducing overhead
– requires deeper knowledge of the code
H
L
p
Coarse-grained
D
I
J
N
M
O
q
r
s
t
Fine-grained
7. TM
Other Impediments to Scalability
Load imbalance:
p0
p1
p2
p3
• the time to complete a parallel
execution of a code segment is
start
determined by the longest running thread
Elapsed time
finish
• unequal work load distribution leads to some processors being
idle, while others work too much
with coarse grain parallelization, more opportunities for load
imbalance exist
Too many synchronization points
• compiler will put synchronization points at the start and exit of
each parallel region
8. Computing π with DPL
π=
1
4 dx
(1+x2)
0
=Σ
0<i<N
4
N(1+((i+0.5)/N)2)
PROGRAM PIPROG
INTEGER, PARAMETER:: N = 1000000
REAL (KIND=8):: LS,PI, W = 1.0/N
PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) )
PRINT *, PI
END
Notes:
–
–
–
–
–
essentially sequential form
automatic detection of parallelism
automatic work sharing
all variables shared by default
number of processors specified outside of the code
compile with:
TM
9. Computing π with Shared Memory
π=
1
4 dx
(1+x2)
0
=Σ
0<i<N
4
N(1+((i+0.5)/N)2)
#define n 1000000
main()
{
double pi, l, ls = 0.0, w = 1.0/n;
int i;
#pragma omp parallel for private(i,l) reduction(+:ls)
for(i=0; i<n; i++) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
printf(“pi is %fn”,ls*w);
}
Notes:
– essentially sequential form
– automatic work sharing
TM
10. Computing π with Message Passing
1
#include <mpi.h>
4 dx
#define N 1000000
(1+x2)
main()
0
{
double pi, l, ls = 0.0, w = 1.0/N;
π=
=Σ
0<i<N
4
N(1+((i+0.5)/N)2)
int i, mid, nth;
MPI_init(&argc, &argv);
MPI_comm_rank(MPI_COMM_WORLD,&mid);
MPI_comm_size(MPI_COMM_WORLD,&nth);
}
for(i=mid; i<N; i += nth) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if(mid == 0) printf(“pi is %fn”,pi*w);
MPI_finalize();
Notes:
TM
11. Comparing Parallel Paradigms
• Automatic parallelization combined with explicit Shared Variable
programming (compiler directives) used on machines with global
memory
– Symmetric Multi-Processors, CC-NUMA, PVP
– These methods collectively known as Shared Memory Programming (SMP)
– SMP programming model works at loop level, and coarse level parallelism:
• the coarse level parallelism has to be specified explicitly
• loop level parallelism can be found by the compiler (implicitly)
– Explicit Message Passing Methods are necessary with machines that
have no global memory addressability:
• clusters of all sort, NOW & COW
– Message Passing Methods require coarse level parallelism to be scalable
•Choosing programming model is largely a matter of the application,
personal preference and the target machine.
•it has nothing to do with scalability.
limitations:
– communication overhead
– process synchronization
Scalability
function of
•scalability is mainly aparallelism the hardware and (your)
implementation of the
TM
12. Summary
TM
• The serial part or the communication overhead of the code limits the
scalability of the code (Amdahl Law)
• programs have to be >99% parallel to use large (>30 proc) machines
• several Programming Models are in use today:
– Shared Memory programming (SMP) (with Automatic Compiler
parallelization, Data-Parallel and explicit Shared Memory models)
– Message Passing model
• Choosing a Programming Model is largely a matter of the application,
personal choice and target machine. It has nothing to do with scalability.
– Don’t confuse Algorithm and implementation
• machines with a global address space can run applications based on
both, SMP and Message Passing programming models