Multicore and GPU Programming

Autor: Roland Bruggmann, roland.bruggmann@students.bfh.ch
Date: 30. July 2015
Berner Fachhochschule | Haute école spécialisée bernoise | Bern University of Applied Sciences
Multicore and GPU Programming
Module BTI7407 Parallel Computing
Exercises

Contents
1 Introduction 1
1.1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.3 Scaling Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Acronyms 4
Bibliography 4
Exercises, Version 0.1 i

1 Introduction
1.1 Taxonomy
By Michael Flynn, in 1966 (see [Bar15, p. 3]):
Single Instruction, Single Data (SISD): One instruction at a time, operating on a single data item. E.g., each
core of a contemporary multicore-CPU can be considered a SISD machine.
Single Instruction, Multiple Data (SIMD): Each instruction is applied on a collection of items. E.g., vector
processors and GPUs on the level of the Streaming Multiprocessor.
Multiple Instructions, Single Data (MISD): Multiple instructions applied to the same data item. Used when
fault tolerance is required, e.g., in a military or aerospace applications.
Multiple Instructions, Multiple Data (MIMD): Multicore machines, including GPUs, follow this paradigm. GPUs
are made from a collection of SIMD units, whereby each can execute its own program—collectively they be-
have as a MIMD one.
1.2 Performance Metrics
1.2.1 Speedup
The improvement in execution time by the use of a parallel solution is defined as (see [Bar15, p. 14]):
speedup =
tseq
tpar
(1.1)
where tseq is the execution time of the sequential program, and tpar is the execution time of the parallel program
for solving the same instance of a problem. Both are wall-clock times, and as such they are not objective. Speedup
can still vary based on the system as well as on the input data. For this reason, it is customary to report average
figures, or even average, maximum, and minimum observed. It can tell us if it is feasible to accelerate the solution
of a problem, e.g., if speedup > 1.
1.2.2 Efficiency
Generic efficiency can tell us if this can be done efficiently, i.e., with a modest amount of resources (ressource
utilization, see [Bar15, p. 15]):
ef f iciency =
speedup
N
=
tseq
N · tpar
(1.2)
where N is the number of CPUs/cores employed for the execution of the parallel program. Normally, speedup is
expected as < N. When speedup = N, the corresponding parallel program exhibits what is called a linear speedup.
There are even situations where speedup > N and ef f iciency > 1 in what is known as a superlinear speedup
scenario.
Exercises, Version 0.1 1

1.2.3 Scaling Efficiency
In general, scalability is the ability to handle a growing amount of work efficiently. In the context of a parallel
algorithm and/or platform, scalability translates to being able to
ˆ (a) solve bigger problems (weak scaling efficiency) and/or
ˆ (b) to incorporate more computing resources (strong scaling efficiency).
Strong Scaling Efficiency is defined by the same equation as the generic efficiency in Equation 1.2, see [Bar15,
p. 17]):
strongScalingEf f iciency(N) =
tseq
N · tpar
(1.3)
Weak Scaling Efficiency is defined as (see [Bar15, p. 18]):
weakScalingEf f iciency(N) =
tseq
tpar
(1.4)
where tpar is the time to solve a problem that is N times bigger than the one the single machine is solving in time
tseq. There are number of issues with calculating scaling efficiency when GPU computing ressources are involved:
e.g., tseq for single CPU versus tpar for CPU/GPU-hybrid including I/O (cp. [Bar15, p. 18]).
1.2.4 Amdahl’s Law
Gene Amdahl, in 1967, assumed (see [Bar15, p. 21]):
ˆ We have a sequential application that requires time T to execute on a single CPU.
ˆ The application consists of a 0 α 1 part that can be parallelized.
The remaining 1 − α has to be done sequentially.
ˆ Parallel execution incurs no communication overhead, and the paralellizable part can be divided evenly among
any chosen number of CPUs. This assumption suits particularly well multicore architectures, where cores
have access to the same shared memory.
Then, speedup obained by N nodes should be upperbound by:
speedup =
tseq
tpar
=
T
(1 − α)T + α·T
N
=
1
1 − α + α
N
(1.5)
and by obtaining the limit for N → ∞:
lim
N→∞
(speedup) =
1
1 − α
(1.6)
It solves a difficult question: How much faster can a problem be solved by a paralell program? And it does so in a
completely abstract manner. It relies only on the characteristics of the problem, i.e., α.

Figure 1.1: Speedup curves for different values of α, as predicted by Amdahl’s law
Figure 1.2: Efficiency curves for different values of α, as predicted by Amdahl’s law

Acronyms
CPU Central Processing Unit
CUDA Compute Uniﬁed Device Architecture
GPGPU General Processing on Graphic Processing Unit
GPU Graphic Processing Unit
MIMD Multiple Instructions, Multiple Data
MISD Multiple Instructions, Single Data
MPI Message Passing Interface
OpenCL Open Computing Library
OpenMPI Open Message Passing Interface
PC Program Counter
PCAM Partitioning, Communication, Agglomeration, and Mapping
SIMD Single Instruction, Multiple Data
SIMT Single Instruction Multiple Threads
SISD Single Instruction, Single Data
Bibliography
[Bar15] Gerassimos Barlas. Multicore and GPU Programming – An Integrated Approach. 1st ed. Waltham: Morgan
Kaufmann, 2015. ISBN: 978-0-12-417137-4. URL: http://booksite.elsevier.com/9780124171374/
(visited on 30/07/2015).

Multicore and GPU Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multicore and GPU Programming

Similar to Multicore and GPU Programming (20)

More from Roland Bruggmann

More from Roland Bruggmann (20)

Recently uploaded

Recently uploaded (20)

Multicore and GPU Programming