Autor: Roland Bruggmann, roland.bruggmann@students.bfh.ch
Date: 30. July 2015
Berner Fachhochschule | Haute ´ecole sp´ecialis´ee bernoise | Bern University of Applied Sciences
Multicore and GPU Programming
Module BTI7407 Parallel Computing
Exercises
Contents
1 Introduction 1
1.1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.3 Scaling Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Acronyms 4
Bibliography 4
Exercises, Version 0.1 i
1 Introduction
1.1 Taxonomy
By Michael Flynn, in 1966 (see [Bar15, p. 3]):
Single Instruction, Single Data (SISD): One instruction at a time, operating on a single data item. E.g., each
core of a contemporary multicore-CPU can be considered a SISD machine.
Single Instruction, Multiple Data (SIMD): Each instruction is applied on a collection of items. E.g., vector
processors and GPUs on the level of the Streaming Multiprocessor.
Multiple Instructions, Single Data (MISD): Multiple instructions applied to the same data item. Used when
fault tolerance is required, e.g., in a military or aerospace applications.
Multiple Instructions, Multiple Data (MIMD): Multicore machines, including GPUs, follow this paradigm. GPUs
are made from a collection of SIMD units, whereby each can execute its own program—collectively they be-
have as a MIMD one.
1.2 Performance Metrics
1.2.1 Speedup
The improvement in execution time by the use of a parallel solution is defined as (see [Bar15, p. 14]):
speedup =
tseq
tpar
(1.1)
where tseq is the execution time of the sequential program, and tpar is the execution time of the parallel program
for solving the same instance of a problem. Both are wall-clock times, and as such they are not objective. Speedup
can still vary based on the system as well as on the input data. For this reason, it is customary to report average
figures, or even average, maximum, and minimum observed. It can tell us if it is feasible to accelerate the solution
of a problem, e.g., if speedup > 1.
1.2.2 Efficiency
Generic efficiency can tell us if this can be done efficiently, i.e., with a modest amount of resources (ressource
utilization, see [Bar15, p. 15]):
ef f iciency =
speedup
N
=
tseq
N · tpar
(1.2)
where N is the number of CPUs/cores employed for the execution of the parallel program. Normally, speedup is
expected as < N. When speedup = N, the corresponding parallel program exhibits what is called a linear speedup.
There are even situations where speedup > N and ef f iciency > 1 in what is known as a superlinear speedup
scenario.
Exercises, Version 0.1 1
1.2.3 Scaling Efficiency
In general, scalability is the ability to handle a growing amount of work efficiently. In the context of a parallel
algorithm and/or platform, scalability translates to being able to
ˆ (a) solve bigger problems (weak scaling efficiency) and/or
ˆ (b) to incorporate more computing resources (strong scaling efficiency).
Strong Scaling Efficiency is defined by the same equation as the generic efficiency in Equation 1.2, see [Bar15,
p. 17]):
strongScalingEf f iciency(N) =
tseq
N · tpar
(1.3)
Weak Scaling Efficiency is defined as (see [Bar15, p. 18]):
weakScalingEf f iciency(N) =
tseq
tpar
(1.4)
where tpar is the time to solve a problem that is N times bigger than the one the single machine is solving in time
tseq. There are number of issues with calculating scaling efficiency when GPU computing ressources are involved:
e.g., tseq for single CPU versus tpar for CPU/GPU-hybrid including I/O (cp. [Bar15, p. 18]).
1.2.4 Amdahl’s Law
Gene Amdahl, in 1967, assumed (see [Bar15, p. 21]):
ˆ We have a sequential application that requires time T to execute on a single CPU.
ˆ The application consists of a 0 α 1 part that can be parallelized.
The remaining 1 − α has to be done sequentially.
ˆ Parallel execution incurs no communication overhead, and the paralellizable part can be divided evenly among
any chosen number of CPUs. This assumption suits particularly well multicore architectures, where cores
have access to the same shared memory.
Then, speedup obained by N nodes should be upperbound by:
speedup =
tseq
tpar
=
T
(1 − α)T + α·T
N
=
1
1 − α + α
N
(1.5)
and by obtaining the limit for N → ∞:
lim
N→∞
(speedup) =
1
1 − α
(1.6)
It solves a difficult question: How much faster can a problem be solved by a paralell program? And it does so in a
completely abstract manner. It relies only on the characteristics of the problem, i.e., α.
Exercises, Version 0.1 2
Figure 1.1: Speedup curves for different values of α, as predicted by Amdahl’s law
Figure 1.2: Efficiency curves for different values of α, as predicted by Amdahl’s law
Exercises, Version 0.1 3
Acronyms
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
GPGPU General Processing on Graphic Processing Unit
GPU Graphic Processing Unit
MIMD Multiple Instructions, Multiple Data
MISD Multiple Instructions, Single Data
MPI Message Passing Interface
OpenCL Open Computing Library
OpenMPI Open Message Passing Interface
PC Program Counter
PCAM Partitioning, Communication, Agglomeration, and Mapping
SIMD Single Instruction, Multiple Data
SIMT Single Instruction Multiple Threads
SISD Single Instruction, Single Data
Bibliography
[Bar15] Gerassimos Barlas. Multicore and GPU Programming – An Integrated Approach. 1st ed. Waltham: Morgan
Kaufmann, 2015. ISBN: 978-0-12-417137-4. URL: http://booksite.elsevier.com/9780124171374/
(visited on 30/07/2015).
Exercises, Version 0.1 4

Multicore and GPU Programming

  • 1.
    Autor: Roland Bruggmann,roland.bruggmann@students.bfh.ch Date: 30. July 2015 Berner Fachhochschule | Haute ´ecole sp´ecialis´ee bernoise | Bern University of Applied Sciences Multicore and GPU Programming Module BTI7407 Parallel Computing Exercises
  • 2.
    Contents 1 Introduction 1 1.1Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.3 Scaling Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.4 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Acronyms 4 Bibliography 4 Exercises, Version 0.1 i
  • 3.
    1 Introduction 1.1 Taxonomy ByMichael Flynn, in 1966 (see [Bar15, p. 3]): Single Instruction, Single Data (SISD): One instruction at a time, operating on a single data item. E.g., each core of a contemporary multicore-CPU can be considered a SISD machine. Single Instruction, Multiple Data (SIMD): Each instruction is applied on a collection of items. E.g., vector processors and GPUs on the level of the Streaming Multiprocessor. Multiple Instructions, Single Data (MISD): Multiple instructions applied to the same data item. Used when fault tolerance is required, e.g., in a military or aerospace applications. Multiple Instructions, Multiple Data (MIMD): Multicore machines, including GPUs, follow this paradigm. GPUs are made from a collection of SIMD units, whereby each can execute its own program—collectively they be- have as a MIMD one. 1.2 Performance Metrics 1.2.1 Speedup The improvement in execution time by the use of a parallel solution is defined as (see [Bar15, p. 14]): speedup = tseq tpar (1.1) where tseq is the execution time of the sequential program, and tpar is the execution time of the parallel program for solving the same instance of a problem. Both are wall-clock times, and as such they are not objective. Speedup can still vary based on the system as well as on the input data. For this reason, it is customary to report average figures, or even average, maximum, and minimum observed. It can tell us if it is feasible to accelerate the solution of a problem, e.g., if speedup > 1. 1.2.2 Efficiency Generic efficiency can tell us if this can be done efficiently, i.e., with a modest amount of resources (ressource utilization, see [Bar15, p. 15]): ef f iciency = speedup N = tseq N · tpar (1.2) where N is the number of CPUs/cores employed for the execution of the parallel program. Normally, speedup is expected as < N. When speedup = N, the corresponding parallel program exhibits what is called a linear speedup. There are even situations where speedup > N and ef f iciency > 1 in what is known as a superlinear speedup scenario. Exercises, Version 0.1 1
  • 4.
    1.2.3 Scaling Efficiency Ingeneral, scalability is the ability to handle a growing amount of work efficiently. In the context of a parallel algorithm and/or platform, scalability translates to being able to ˆ (a) solve bigger problems (weak scaling efficiency) and/or ˆ (b) to incorporate more computing resources (strong scaling efficiency). Strong Scaling Efficiency is defined by the same equation as the generic efficiency in Equation 1.2, see [Bar15, p. 17]): strongScalingEf f iciency(N) = tseq N · tpar (1.3) Weak Scaling Efficiency is defined as (see [Bar15, p. 18]): weakScalingEf f iciency(N) = tseq tpar (1.4) where tpar is the time to solve a problem that is N times bigger than the one the single machine is solving in time tseq. There are number of issues with calculating scaling efficiency when GPU computing ressources are involved: e.g., tseq for single CPU versus tpar for CPU/GPU-hybrid including I/O (cp. [Bar15, p. 18]). 1.2.4 Amdahl’s Law Gene Amdahl, in 1967, assumed (see [Bar15, p. 21]): ˆ We have a sequential application that requires time T to execute on a single CPU. ˆ The application consists of a 0 α 1 part that can be parallelized. The remaining 1 − α has to be done sequentially. ˆ Parallel execution incurs no communication overhead, and the paralellizable part can be divided evenly among any chosen number of CPUs. This assumption suits particularly well multicore architectures, where cores have access to the same shared memory. Then, speedup obained by N nodes should be upperbound by: speedup = tseq tpar = T (1 − α)T + α·T N = 1 1 − α + α N (1.5) and by obtaining the limit for N → ∞: lim N→∞ (speedup) = 1 1 − α (1.6) It solves a difficult question: How much faster can a problem be solved by a paralell program? And it does so in a completely abstract manner. It relies only on the characteristics of the problem, i.e., α. Exercises, Version 0.1 2
  • 5.
    Figure 1.1: Speedupcurves for different values of α, as predicted by Amdahl’s law Figure 1.2: Efficiency curves for different values of α, as predicted by Amdahl’s law Exercises, Version 0.1 3
  • 6.
    Acronyms CPU Central ProcessingUnit CUDA Compute Unified Device Architecture GPGPU General Processing on Graphic Processing Unit GPU Graphic Processing Unit MIMD Multiple Instructions, Multiple Data MISD Multiple Instructions, Single Data MPI Message Passing Interface OpenCL Open Computing Library OpenMPI Open Message Passing Interface PC Program Counter PCAM Partitioning, Communication, Agglomeration, and Mapping SIMD Single Instruction, Multiple Data SIMT Single Instruction Multiple Threads SISD Single Instruction, Single Data Bibliography [Bar15] Gerassimos Barlas. Multicore and GPU Programming – An Integrated Approach. 1st ed. Waltham: Morgan Kaufmann, 2015. ISBN: 978-0-12-417137-4. URL: http://booksite.elsevier.com/9780124171374/ (visited on 30/07/2015). Exercises, Version 0.1 4