SlideShare a Scribd company logo
1 of 71
Download to read offline
Department of Computing
Imperial College London
PUMA
Abstracting Memory Latency Optimisation In Parallel
Applications
Richard Jones
Supervisor: Tony Field
June 2015
Abstract
Moore’s Law states that every eighteen months to two years, the number of transistors
per square inch on an integrated circuit approximately doubles[1], effectively leading to
a proportional performance gain. However, in the early twenty-first century, transistor
size reduction began to slow down, limiting the growth of complexity in high-performance
applications which was afforded by increasing computing power.
Consequently, there was a push for increased parallelism, enabling several tasks to be car-
ried out simultaneously. For high-performance computing applications, a logical extension
to this was to utilise multiple processors simultaneously in the same system, each with
multiple execution units, in order to increase parallelism with widely available consumer
hardware.
In multi-processor systems, having uniformly shared, globally accessible physical memory
means that memory access times are the same across all processors. These accesses can be
expensive, however, because they all require communication with a remote node, typically
across a bus which is shared among the processors. Since the bus is shared and can only
handle one request at a time, processors may have to wait to use it, causing delays when
attempting to access memory.
The situation can be improved by giving each processor its own section of memory, each
with its own data bus. Each section of memory is called a domain, and accessing a domain
which is assigned to a different processor requires the use of an interconnect which has
a higher latency than accessing local memory. This architecture is called NUMA (Non-
Uniform Memory Access). In order to exploit NUMA architectures efficiently, application
developers need to write code which minimises so-called cross-domain accesses to maximise
the application’s aggregate memory performance.
We present PUMA, which is a smart memory allocator that manages data in a NUMA-
aware way. PUMA exposes an interface to execute a kernel on the data in parallel, auto-
matically ensuring that each core which runs the kernel accesses primarily local memory.
It also provides an optional time-based load balancer which can adapt workloads to cases
where some cores may have be less powerful or have more to do per kernel invocation than
others.
Acknowledgements
I would like to thank the following for their contributions to PUMA, both directly and
indirectly:
• My supervisor, Tony Field, who has been a tremendous source of support, both in
the development of PUMA and in my completion of this year.
• Dr Michael Lange, the creator of our LERM case study, who spent hours helping me
to work out just exactly what was wrong with my timing results.
• My tutor, Murray Shanahan, for helping me get through all four years of my degree
relatively intact, and providing help and support throughout.
• Imperial’s High Performance Computing service, especially Simon Burbidge who was
invaluable in helping me find my way around the HPC systems.
• My family and friends (especially those who know nothing about computers) for
continuing to talk to me after being forced to proofread approximately seventeen
thousand different drafts of my project report. Also for providing vague moral sup-
port over the course of the first 21 years of my life.
i
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 PUMA (Pseudo-Uniform Memory Access) . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Hardware Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Caching In Multi-Processor Systems . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Memory In Agent-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.2 Data-Based Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.1 Manual Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.2 OP2/PyOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.3 Galois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.4 Intel TBB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.5 Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ii
Contents
2.8 LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Design and Implementation 21
3.1 Dynamic Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Memory Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Element Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Static Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Invalid Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Local vs. Remote Testing . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Testing and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 LERM Parallelisation 32
4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Applying PUMA to LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Evaluation 35
5.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Thread Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Parallel Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
Contents
6 Conclusions 45
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A Methods For Gathering Data 50
B API Reference 52
B.1 PUMA Set Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.4 Static Data Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C Getting Started: Standard Deviation Hello World! 58
D Licence 62
iv
List of Figures
1.1 Calvin ponders the applications of workload parallelisation. . . . . . . . . . 1
1.2 Parallel efficiency reached by a trivially parallel algorithm with dynamic
data using static over-allocation and not taking NUMA effects into account.
The red vertical line signifies the point past which we use cores on another
NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Percentage of total reads which are remote on average across several runs of
the trivially parallel section of the case study. . . . . . . . . . . . . . . . . . 5
1.4 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 6
2.1 “Transistor counts for integrated circuits plotted against their dates of in-
troduction. The curve shows Moore’s law - the doubling of transistor counts
every two years. The y-axis is logarithmic, so the line corresponds to expo-
nential growth.”[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Example of agent allocation with Linux’ built-in malloc() implementation.
Agents belonging to each thread are represented by a different colour per
thread. Black represents non-agent memory. . . . . . . . . . . . . . . . . . . 12
2.3 Parallel efficiency reached by particle management with dynamic data using
static over-allocation and not taking NUMA effects into account. The red
vertical line signifies the point past which we use cores on another NUMA
node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Parallel efficiency reached by environment update with dynamic data using
static over-allocation and not taking NUMA effects into account. The red
vertical line signifies the point past which we utilise cores from a second
NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Total runtime with dynamic data using static storage in the reference im-
plementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
v
List of Figures
3.1 How we lay out our internal lists of allocated memory. The black box rep-
resents the user-facing set structure, and each of blue and red represents a
different thread’s list of pre-allocated memory blocks. These blocks are the
same size for each thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Our first strategy for mapping elements to block descriptors. Darker blue
represents the block’s descriptor; light blue represents a page header with a
pointer to the block’s descriptor; red represents elements; the vertical lines
represent page boundaries; and white represents unallocated space within
the block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Our second strategy for mapping elements to block descriptors. The blue
block represents the block’s descriptor; the red blocks represent elements;
and the vertical lines represent page boundaries. In this example, blocks are
two pages long. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Biomass of plankton in the PUMA-based LERM simulation over time. Black
are maximum and minimum across all runs, red is average. . . . . . . . . . 36
5.2 Comparison between the average biomasses across runs in the reference and
PUMA implementations of the LERM simulation over time. . . . . . . . . . 37
5.3 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 38
5.4 Total runtime with dynamic data using static over-allocation and not taking
NUMA effects into account. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Parallel efficiency reached by agent updates in the PUMA implementation of
LERM vs in the reference implementation. The red vertical line signifies the
point past which we utilise cores on a second NUMA node. In this instance,
reduction is the necessary consolidation of per-thread environmental changes
from the update loop into the global state. Margin of error is calculated by
finding the percentage standard deviation in the original times and applying
it to the parallel efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Parallel efficiency reached by particle management in the PUMA implemen-
tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 41
5.7 Parallel efficiency reached by environment update in the PUMA implemen-
tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 42
5.8 Parallel efficiency reached by the update step in the PUMA implementation
of LERM with and without load balancing. . . . . . . . . . . . . . . . . . . 43
5.9 Parallel efficiency reached by the trivially parallel section of LERM when
using OpenMP (blue) and PUMA’s own thread pool (red). . . . . . . . . . 44
vi
Listings
2.1 LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Calculating the mapping between elements and their indices with a per-page
header. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Calculating the mapping between elements and their indices without using
headers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Pseudo-random number generator where i is a static seed. . . . . . . . . . . 26
4.1 PUMA-based LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
Chapter 1
Introduction
Figure 1.1: Calvin ponders the applications of workload parallelisation.
In modern computing, the execution time of many applications can be greatly reduced
by the use of large, multi-core systems. These gains are especially prevalent in scientific
simulations operating on very large data sets, such as ocean current simulation[3], weather
simulation[4] and finite element modelling[5].
Memory accesses in multi-processor systems can be subject to delays due to contention
for the memory controller. A common solution to this problem is to provide multiple
distinct controllers, each with its own discrete memory; this architecture is called NUMA,
or Non-Uniform Memory Access[6]. Processors can access memory associated with others’
memory controllers, but this incurs an extra delay due to the need for an interconnect
between controllers with higher latency.
The primary disadvantage of NUMA is that, as each core needs to access more memory
which is not associated with its controller, memory latency becomes a dominating factor
in runtime. This causes parallel efficiency (calulated with formula A.1) to drop off rapidly
1
Chapter 1. Introduction
unless it is taken into account, even in applications which are trivially or “embarrassingly”
parallel.
In parallel systems, cache coherency can also be a major factor in memory latency. In
order to enable a so-called “classical” programming model based on the Von Neumann
stored-program concept for general purpose computing, processors often automatically
synchronise their caches if one processor writes to a memory location which resides in both
its and another’s cache.
This synchronisation introduces extra latency when reading from or writing to memory
which is within a cache line recently accessed by another processor. Consequently, its
avoidance in high-performance parallel code can be critical.
Figure 1.2 shows the parallel efficiency as we utilise more cores in the trivially parallel sec-
tion of the reference implementation of LERM (Lagrangian Ensemble Recruitment Model)
which we use as our primary case study. Both NUMA latency and cache synchronisation
are responsible for the drop off; on one domain, there is a significant drop in performance
due to multiple processors accessing and updating data within the same cache line. When
we use multiple domains, we can see a change in the rate of drop-off in parallel efficiency,
as illustrated by the two-part trend line.
Figure 1.3 shows the increase in off-domain memory accesses in the reference implementa-
tion of LERM when we use cores associated with a second domain.
1.1 Motivation
In this project we aim to provide a framework with clear abstractions for application
developers - both with an understanding of computer architecture and without - to write
software which takes advantage of large, NUMA-based machines without dealing with the
underlying management of memory across NUMA domains.
Solutions already exist which provide this level of abstraction; however, each solution is
focused on a specific set of problems. The problem which provided the primary motivation
for our library is a type of agent-based modelling in which no agent interacts with any
other, except indirectly via environmental or global variables.
This kind of modelling can be used in a variety of areas, including economic interactions
of individuals with a market[7] and ecological simulations[8][9]. Some such models are
dynamic, in that the number of agents may change over time.
2
Chapter 1. Introduction
1 2 3 4 5 6 7 8 9 10 11 12
60
65
70
75
80
85
90
95
100
Cores In Use
ParallelEfficiency
Parallel Efficiency
Two-Part Trend Line
Figure 1.2: Parallel efficiency reached by a trivially parallel algorithm with
dynamic data using static over-allocation and not taking NUMA effects into
account. The red vertical line signifies the point past which we use cores on
another NUMA node.
1.2 Objectives
The primary purpose of our solution is to combine the dynamism of malloc() with the
benefits of static approaches, as well as to provide methods to run kernels on pre-pinned
threads across datasets. It must abstract away as much of the low-level detail as possible,
without sacrificing performance gains or configurability.
We aim to enable the simple parallelisation of applications operating on large, independent
sets of data in such a way that results remain correct within reasonable bounds. We
also aim to prevent NUMA cross-domain performance penalties without the application
developer’s intervention.
3
Chapter 1. Introduction
1.3 PUMA (Pseudo-Uniform Memory Access)
This project is concerned with the design, implementation and evaluation of PUMA, which
is a NUMA-aware library that provides methods for the allocation of homogeneous blocks
of memory (elements), the iteration over all of these elements and the creation of static,
domain-local data on a per-thread basis. It exposes a relatively small external API which
is easy to use. It does not require developers to have an understanding of the underlying
system topology, allowing them to focus more on the logic behind their kernel. It does,
however, provide advanced configuration options; for example, it automatically pins threads
to cores but can also take an affinity string at initialisation for customisable pinning.
As well as providing NUMA-aware memory management within its set structure, PUMA
also exposes an API for the allocation of per-thread static data which is placed on the
calling thread’s local NUMA domain. This allocated data are guaranteed to not be within
the same cache line as anything allocated from another thread, preventing memory latency
caused by maintaining cache coherency.
PUMA fits between existing solutions in that, while it imposes the constraint that data
must not have direct dependencies, the data set it operates on can be dynamic. It therefore
addresses a different class of problems from libraries such as OP2 (section 2.7.2) and Galois
(section 2.7.3).
We have adapted a scientific simulation (LERM) to use PUMA in order to examine its
effects and usability. The simulation is outlined in section2.8.
The PUMA-based implementation of this simulation has three main sections, each ex-
hibiting a different level of parallelisation: trivially parallel, partially parallel and entirely
serial.
Figure 1.4 shows the total runtime of LERM when implemented with PUMA against the
theoretical minimum as dictated by Amdahl’s Law[10]. This figure illustrates two aspects
of PUMA at work:
1. The execution time across cores on a single domain is close to the minimum due to
cache coherency optimisation;
2. The execution time across cores on multiple domains is also close to the theoretical
minimum as a result of both cache coherency optimisation and the reduction in
cross-domain memory traffic, as shown in Figure 1.3.
Currently, PUMA has been tested on the following Operating Systems:
• Ubuntu Linux 14.10 and 15.04 with kernels 3.16 and 3.19;
4
Chapter 1. Introduction
1 2 3 4 5 6 7 8 9 10 11 12
0
5
10
15
20
25
30
35
40
Cores In Use
%TotalReadsWhichAreRemote
Reference Implementation
PUMA-based Implementation
Figure 1.3: Percentage of total reads which are remote on average across several
runs of the trivially parallel section of the case study.
• Red Hat Enterprise Linux with kernel 2.6.32;
• Mac OS 10.10.3
PUMA is written to have as much backwards compatibility as it can within the Linux
kernel; it was written with the POSIX standard in mind and uses, as far as possible,
only standard-mandated features. For those which are not, alternatives are available as
compile-time options.
Internally, we implement time-based thread workload balancing. This allows us to manage
intelligently the time taken to run a kernel on each thread, preventing any one thread from
taking significantly longer than others.
5
Chapter 1. Introduction
1 2 4 8
100,000
158,000
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Actual Timing
Minimum Time (Amdahl’s Law)
Figure 1.4: Total runtime with dynamic data using PUMA.
1.4 Contributions
• In chapter 3, we discuss the design and implementation of PUMA, including how we
achieve near-ideal parallel efficiency even across NUMA domains;
• In chapter 4, we examine the scalability and parallelisation of the LERM simulation,
including discussing our necessarily parallel Particle Management implementation;
• In chapter 5, we present a detailed experimental evaluation of PUMA with respect
to LERM, including both timing data and simulation correctness verification.
The PUMA library code is hosted at https://github.com/CatharticMonkey/puma.
6
Chapter 2
Background
The most common architectural model in modern computers is called the Von Neumann
architecture; it is based on a model described by John Von Neumann in the First Draft of
a Report on the EDVAC[11].
The model describes a system consisting of the following:
• A processing unit containing an Arithmetic/Logic Unit (ALU) and registers;
• A control unit consisting of an instruction register and a program counter;
• Memory in which data and instructions are stored;
• External storage;
• Input/output capabilities.
The main benefit of a stored-program design such as the Von Neumann architecture is
the ability to dynamically change the process which the computer carries out - this is
in contrast to early computers which had hard-wired processes and could not be easily
reprogrammed. The flexibility afforded by the stored-program approach is critical to the
widespread use of computers as general-purpose machines.
2.1 Hardware Trends
The first computer processors were developed to execute a serial stream of instructions.
This was initially sustainable in terms of keeping up with increased requirements for more
complex computations, as single-chip performance was constantly being improved; Moore’s
7
Chapter 2. Background
law is an observation stating that “[t]he complexity for minimum component costs has
increased at a rate of roughly a factor of two per year... Certainly over the short term this
rate can be expected to continue, if not to increase.”[1] This trend is shown in Figure 2.1.
In other words, approximately every two years, the number of components in an integrated
circuit can be expected to increase twofold, leading to a proportional performance gain.
This, along with the significant increase in clock speeds from several MHz to several GHz in
the span of just a few decades meant that serial execution was also subject to approximately
linear gains.
In the early 21st century, however, these gains began to slow as a result of physical limita-
tions.
The next step for performance scaling was parallelism, and so the number of discrete
processing units in chips produced by most major manufacturers increased.
As a result of this move towards higher levels of hardware parallelism, there has been an
increasingly strong focus in current computing on parallelising software to take advantage
of it.
2.2 Memory Hierarchy
Due to the trend for increased speed of computation in a small space, memory access time
has become a major concern when it comes to execution time. This is especially true
as clock cycles have become shorter; shorter cycles mean more cycles wasted waiting for
accesses of the same length and so more instructions which could be executed during the
time taken for each access, but are not.
As a result, there are various architectural decisions made by processor manufacturers in
order to attempt to minimise this penalty and thus speed up program execution. The
primary method of tackling the problem is to implement a caching system which exploits
the fact that a significant portion of memory accesses exhibit spatial locality (i.e. they
are close to previously accessed memory); when a memory access is performed, the cache
(which is physically close to the processor and is often an on-chip bank of memory) is first
checked to see if it contains the desired data.
If it does, there is no need to look further and the access returns relatively quickly. If it
does not, the data are requested from main memory along with a surrounding block of
data of a predetermined size. These extra data are stored in the cache, enabling future
cache hits.
The simple caching model is often extended with the use of multiple levels of cache. In this
extension, lower levels of cache are physically closer to the memory data register, which is
the location to which memory accesses are initially loaded on the processor. As a result,
8
Chapter 2. Background
Figure 2.1: “Transistor counts for integrated circuits plotted against their dates
of introduction. The curve shows Moore’s law - the doubling of transistor
counts every two years. The y-axis is logarithmic, so the line corresponds to
exponential growth.”[2]
lower levels of cache are quicker to access than higher ones. However, lower level caches
have a smaller size limit because of their location. Consequently, it is desirable to have
several levels of cache, each bigger than the last, in order to reduce cache-processor latency
as much as possible without sacrificing size.
Caching helps to mitigate memory access penalties significantly, but main memory access
time is still important, especially in the case in which there will inevitably be many cache
9
Chapter 2. Background
misses because of the nature of the algorithm in question. If there are many cache misses,
memory access time can quickly become a dominant factor of execution time.
2.3 Parallel Computing
The logical extension of per-processor parallelisation is spreading heavy computational
tasks across a large number of processors. There are two main methods for achieving this:
• Utilising several separate computers networked together, sharing data and achieving
synchronisation through message passing over the network;
• Having two or more sockets on a motherboard, which increases the number of cores
available in one computer without requiring more expensive, advanced processors.
The first method’s primary draw is its scalability; it is useful for constructing large systems
relatively cheaply without specialised hardware. The main benefit of the second is that
it avoids the overhead of the first’s message-passing while still maintaining the ability to
use consumer-grade hardware. It is possible to combine these approaches in order to gain
many of the benefits of both. Instances of such combinations are:
• Edinburgh University’s Archer supercomputer[12]: a 4920-node Cray XC30 MPP
supercomputer with two 12-core Intel Ivy Bridge processors on each node, providing
a total of 118,080 cores;
• The UK Met Office’s supercomputer[4][13]: A Cray XC40, with a large number of
Intel Xeon processors, providing a total of 480,000 cores.
• Southampton University’s Iridis-Pi cluster[14]: a cluster of 64 Raspberry Pi Model B
nodes, providing a low-power, low-cost 64-core computer for educational applications.
Each of these focuses on providing a massively parallel system, in order to carry out certain
types of parallel computations; if they are used for a computation which must be done in
serial or which simply does not take advantage of the topology, all that they gain over a
single-core machine is lost.
Libraries which abstract away hardware details are often used to take advantage of this
kind of architecture; this makes it relatively easy to create software which is scalable across
several sockets (each containing a multi-core CPU), or even several networked computation
nodes, while not requiring in-depth architectural knowledge.
10
Chapter 2. Background
2.4 Caching In Multi-Processor Systems
In NUMA systems, there are two main approaches to cache management: the far simpler
and less expensive method (in terms of hardware-level synchronisation) involves not syn-
chronising caches across processors; this has the major disadvantage that programming in
the common Von Neumann paradigm becomes too complex to be feasible.
The other method involves maintaining cache coherency at the hardware level; this is called
cache-coherent NUMA (ccNUMA). It requires significantly more complexity in the design
of the system and can lead to a substantial synchronisation overhead if two processors are
accessing data in the same cache line; writes from one processor require an update in the
cache of the other, introducing latency.
ccNUMA is the more common of these two because it does not introduce extra complexity
in creating correct programs for multi-processor systems and the synchronisation overhead
can be avoided by not simultaneously accessing data within the same cache line on different
cores.
2.5 Memory In Agent-Based Models
Static preallocation of space for agents is a potential solution to the problem of allocating
memory for them in dynamic agent-based models; through first touch policies, it provides
the ability to handle agent placement easily and intelligently in terms of physical NUMA
domains. Memory is commonly partitioned into virtual pages, which, under first touch, are
only assigned physical space when they are first accessed; the physical memory assigned is
on the NUMA domain to which the processor touching the memory belongs[15].
This is, however, not always an option, as either it requires reallocation upon dataset growth
or massive over-allocation and the imposition of an artificial upper bound on dataset size.
Dynamic allocation methods do not require reallocation or size boundaries. The C standard
library’s malloc() is the primary method of allocating memory dynamically from the heap,
but it has several disadvantages when used for agent-based modelling:
• Execution time
– malloc() can make no assumptions about allocation size. This means it has to
handle holes left by free()d memory, so allocating data can require searching
for suitably-sized free blocks of memory.
11
Chapter 2. Background
• Lack of cache effectiveness
– Since allocations for all malloc() calls can come from the same pool, there
is no guarantee that all agents will occupy contiguous memory, meaning that
iteration may cause a significant number of cache misses.
In Figure 2.2, let each agent be equal in size to half of that loaded into core-
local cache on each miss. If the first (red) agent is accessed by thread 1, the
processor will also load the second (yellow) one into cache. The second agent
constitutes wasted cache space because it will not be accessed by the processor.
In the worst case, synchronisation needs to occur between the two processors
running the red and yellow threads in order to ensure coherency.
• Lack of NUMA-awareness
– Many malloc() implementations (for example the Linux default implementation
and tcmalloc) do not necessarily allocate from a pool which is local to the core
requesting the memory.
First-touch page creation means that if a malloc() call returns memory be-
longing to a thus-far untouched page and we initialise the memory on a CPU
belonging to the NUMA domain from which we will then access it, we should
not incur a cross-domain access. However, malloc() implementations which are
not NUMA-aware may allocate from pages which may have already been faulted
into memory on another domain.
We have no way, therefore, to guarantee that accessing agents allocated using
malloc() and similar calls will not incur the penalty of a cross-domain access.
malloc() implementations exist which are NUMA-aware. However, these still exhibit the
other two problems because malloc() can make no assumptions about the context in which
memory will be used and it must support variably-sized allocations. Consequently, even
NUMA-aware implementations are not suitable for this class of applications as a result of
the trade-off between generality and performance.
Figure 2.2: Example of agent allocation with Linux’ built-in malloc() imple-
mentation. Agents belonging to each thread are represented by a different colour
per thread. Black represents non-agent memory.
12
Chapter 2. Background
2.6 Workload Balancing
When operating on data sets in parallel, one issue which needs to be addressed is how to
ensure that each thread will finish its current workload at approximately the same time; if
threads finish in a staggered fashion, this can lead to sub-optimal parallel performance as
some threads that could be working are instead idle.
2.6.1 Work Stealing
In work stealing, balance across threads is achieved by the scheduler; computation is split
into discrete tasks which it then assigns to different processors in the form of a task queue.
If one processor completes its task queue, it “steals” one or more tasks from another’s
queue. This means that, as long as there exist tasks which have not been started and each
task is of a similar length, no processor will be idle for more than the time it takes to
complete one task.
If the tasks are not necessarily of a similar length, balance can still be approximately
achieved by estimating the length of each task and optimising the task queues based on
these estimates.
2.6.2 Data-Based Balancing
Data-based balancing is a method of balancing which consists of assigning blocks of data
to specific threads based on some partitioning strategy; these partitioning strategies can
be based on data size, for example, or the results of profiling several recent runs of com-
putational kernels.
Balancing with this strategy is inherently simpler than with task balancing if we are running
an identical kernel since it simply involves ensuring that each thread has approximately the
same amount of data to operate on. We can expand upon this by using timing data from
previous runs to estimate how long each thread will take to run, allowing us to achieve a
closer to optimal balance without the overhead of balancing at runtime.
2.7 Existing Approaches
2.7.1 Manual Parallelisation
There are two primary types of parallelisation: one involves running code across several
cores on the same motherboard; the other involves running it across several processors on
different computers, using messaging on a local network. The former approach avoids the
13
Chapter 2. Background
overhead of message passing, whereas the latter is more scalable using consumer hardware.
Often, they are mixed, using the MPI (Message Passing Interface) standard for the inter-
computer messaging and OpenMP or the Operating System’s threading interface for the
local parallelism.
Both require manual management of the placement of data. If running on a single NUMA-
enabled system, this involves predicting the cores which will be accessing data and allo-
cating physical memory accordingly. Using multiple networked computers requires manual
usage of MPI functions in order to transfer data among computers; it is not implicit, so
the application must be designed with this in mind.
The primary disadvantages of manual parallelisation are both related to its complexity; it
requires sufficient knowledge of system APIs and system architecture and it can require a
significant amount of programming time. Often, this renders a custom solution infeasible.
2.7.2 OP2/PyOP2
OP2, and its Python analogue, PyOP2, provide “an open-source framework for the ex-
ecution of unstructured grid applications on clusters of GPUs or multi-core CPUs.”[16]
They are focused on MPI-level distribution, with support for OpenCL, OpenMP or CUDA
for local parallelism. These two levels can be combined in one application, enabling the
developer to take advantage of the benefits of both.
The framework operates on static, independent data sets, allowing for data-specific opti-
misation at compile time. Its architecture involves code generation at compile-time using
“source-source translation to generate the appropriate back-end code for the different tar-
get platforms.”[16][17] The static nature of the data results in a low complexity requirement
at runtime in terms of memory management. The independence constraint means that the
order in which the data are iterated over for kernel application must have no significant
impact beyond floating point errors on its result.
The OpenMP local parallelisation code does not encounter the NUMA problem because
all of the data are statically allocated; as long as the data which will be accessed from
CPUs on different domains reside in different virtual pages, the default first touch policy
in most modern Operating Systems will ensure that memory accesses are primarily to the
local NUMA domain.
2.7.3 Galois
Galois is a C++ and Java framework which operates on data sets consisting of dynamic, in-
terdependent data. Consequently, both its memory management and its kernel run schedul-
ing have a significant runtime overhead.
14
Chapter 2. Background
Due to this extra overhead, it is primarily useful for data which are dynamic and have
sufficiently complex dependencies.
Galois is explicitly NUMA aware and contains support for OpenMP and MPI. However, it
only supports Linux.
2.7.4 Intel TBB
Intel provides a C++ parallelisation library called Thread Building Blocks. It contains
algorithms and structures designed to simplify the creation of multithreaded programs.
It implements task-based parallelism, with which it uses a task-based balancer, and provides
a memory allocator which prevents false sharing[18].
TBB is NUMA-aware. Its lack of specificity and task-based balancing do, however, mean
that it is not possible to ensure as much NUMA locality as in problem-specific, data
balancing libraries such as PUMA. Consequently, it is not necessarily an ideal solution in
some applications where runtime is one of the primary considerations.
2.7.5 Cilk
Cilk is a language based on C (with other dialects based on C++) which provides meth-
ods for a programmer to identify parallel sections while leaving the runtime to perform
scheduling. The task scheduler is based on a work stealing strategy, where the tasks are
defined by the programmer.
Cilk is not explicitly NUMA aware, and because tasks are scheduled by the runtime rather
than the programmer, there is limited scope to make use of NUMA systems while minimis-
ing off-domain accesses.
2.8 LERM
Our primary case study for this work is a Lagrangian Ensemble Recruitment metamodel, as
detailed in [19], which simulates phytoplankton populations. Our reference implementation
is the result of prior work[20] that involved parallelising one such metamodel. It was
observed that the reference implementation encountered the NUMA effect, leading to a
significant reduction in parallel efficiency in the trivially parallel section when spread across
domains; this is shown in Figures 1.2 and 1.3.
The simulation (LERM) consists of three primary parts: an agent update loop; particle
management, for the creation and deletion of agents; and the environment update, which
15
Chapter 2. Background
simulates the spread and interaction of agent-caused environmental changes. Listing 2.1
shows the main algorithm implemented in Python-like pseudocode.
The three main sections roughly correspond to three different cases we may encounter:
the update loop (Figure 1.2) is primarily trivially parallel with a reduction of per-thread
data at the end; the particle management step (Figure 2.3) is partially parallelisable but
is implemented in serial in the reference implementation; and the environment update
(Figure 2.4) is mostly parallelisable but implemented in serial in both implementations
due to having a negligible impact on runtime.
We can see that the update loop has an obvious dip in parallel efficiency after it begins to
utilise cores on a different NUMA node, due to its not taking NUMA effects into account
when assigning work to each thread.
Because the particle management and environment sections are both implemented with
serial algorithms, they do not demonstrate a reduction in parallel efficiency as a result of
the NUMA effect. They do, however, begin to dominate as the update loop is distributed
across cores, especially the particle management step.
The update step is, in theory, trivially parallelisable. The particle management and envi-
ronmental update steps are implemented in our reference implementation as serial code,
but the particle management step can be parallelised. Approximately 98% of the simula-
tion is parallelised (calculated with formula 2.1) in the PUMA version; by Amdahl’s Law,
we can therefore achieve a theoretical maximum speedup of approximately 50×. Figure
1.4 shows that we achieve very close to this.
However, in our reference implementation, only 87.5% is parallelised, because the original
particle management is in serial whereas, out of necessity, we have parallelised the particle
management in PUMA. By Amdahl’s Law, the maximum speedup achievable by the ref-
erence implementation is 8×. Figure 2.5 shows that we do not achieve close to our ideal
runtime, because of NUMA latency.
In order to ensure that we do observe the results of NUMA effects if they have an impact, we
initialise the LERM simulation with 400000 13 byte agents. Our dataset size (not taking
into account metadata overhead and other considerations) is, therefore, approximately
20.8MB. Since the L3 cache in our test machine is 12MB, we ensure that every timestep
requires that at least 40% of the agents have to be re-read from main memory.
16
Chapter 2. Background
1 def splitAgents () :
2 while len ( agents ) < minAgents :
3 splitIntoTwo ( someAgent )
4
5 def mergeAgents () :
6 while len ( agents ) > maxAgents :
7 mergeIntoOne ( someAgent , someOtherAgent )
8
9 def updateAgents () :
10 # T r i v i a l l y p a r a l l e l loop
11 f o r agent in agents :
12 ecologyKernel ( agent )
13
14 reducePerThreadData ()
15
16 def particleManagement () :
17 splitAgents ()
18 mergeAgents ()
19
20 def mixChemistry () :
21 f o r layer in l a y e r s :
22 totalConcentration += layer . concentration
23
24 def updateEnvironment () :
25 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e ()
26 mixChemistry ()
27
28 def main () :
29 in itia lise Envi ronm ent ()
30
31 while i < max timestep :
32 updateAgents ()
33 particleManagement ()
34 updateEnvironment ()
Listing 2.1: LERM pseudocode
17
Chapter 2. Background
1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(PM)
Figure 2.3: Parallel efficiency reached by particle management with dynamic
data using static over-allocation and not taking NUMA effects into account.
The red vertical line signifies the point past which we use cores on another
NUMA node.
18
Chapter 2. Background
1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(env)
Figure 2.4: Parallel efficiency reached by environment update with dynamic
data using static over-allocation and not taking NUMA effects into account.
The red vertical line signifies the point past which we utilise cores from a second
NUMA node.
19
Chapter 2. Background
1 2 4 8
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Actual Timing
Minimum Time (Amdahl’s Law)
Figure 2.5: Total runtime with dynamic data using static storage in the refer-
ence implementation.
Ts + Tp
Tt
(2.1)
How we calculate the proportion of an application which is parallelised. Ts is
the time spent in the serial sections, Tp is the time spent in the parallel sections
and Tt is the total execution time.
20
Chapter 3
Design and Implementation
PUMA consists of several parts:
• A NUMA-aware dynamic memory allocator for homogeneous elements;
• An allocator for thread-local static data which cannot be freed individually. This
allocator uses pools which are located on the domain to which the core associated
with the thread belongs;
• A parallel iteration interface which applies a kernel to all elements in a PUMA set;
• A balancer which changes which threads blocks of data are associated with in order
to balance kernel runtime across cores.
Much of PUMA’s design was needs-driven: it was developed in parallel with its integration
into a case study (see section 2.8) and its design evolved as new requirements became clear.
The reason that PUMA provides a kernel application function rather than direct access to
the underlying memory is to enable it to prevent cross-domain accesses. We achieve this
by pinning each thread in our pool to a specific core and ensuring that when we run the
kernel across our threads, each thread can access only domain-local elements.
PUMA works under the assumption that the application involves the manipulation of sets
of homogeneous elements. In our case study, these elements are the agents within the
model, each of which represents a group within the overall population of phytoplankton,
and we use two PUMA sets, one for each of dead and alive agents.
PUMA implements parallelism by maintaining a list of elements per thread, each of which
can only be accessed by a single thread at a time.
21
Chapter 3. Design and Implementation
Figure 3.1: How we lay out our internal lists
of allocated memory. The black box represents
the user-facing set structure, and each of blue
and red represents a different thread’s list of
pre-allocated memory blocks. These blocks are
the same size for each thread.
3.1 Dynamic Memory Allocator
Our initial design involved an unordered data structure which could act as a memory
manager for homogeneous elements. It would provide methods to map kernels across all
of its elements and ensure that each element would only be accessed from a processor
belonging to the NUMA domain on which it was allocated.
In order to achieve this, we use one list per thread within the user-facing PUMA set
structure (Figure 3.1). Each of these lists contains one or more blocks of memory at
least one virtual memory page long. We have a 1:1 mapping of threads to cores, enabling a
mostly lock-free design. This allows us to have correct multithreaded code while minimising
time-consuming context switches.
3.1.1 Memory Pools
In order to allocate memory quickly on demand, our dynamic allocator pre-allocates blocks,
each of which is one or more pages long. This has two purposes: only requesting large blocks
from the Operating System allows us to reduce the time spent on system calls; and the
smallest blocks on which system calls for the placement of data on NUMA domains can
22
Chapter 3. Design and Implementation
operate are one page long and must be page aligned.
These blocks have descriptors at the start which contain information on the memory which
has been allocated from them. The descriptors also contain pointers to the next and
previous elements in their per-thread list to allow for iteration over all elements.
Currently, block size is determined by a preprocessor definition at compile time, because
this size is integral to calculating the location of metadata from an element’s address. It
could also be determined at run-time if set before any PUMA initialisation is performed
by the application.
3.1.2 Element Headers
In order to free elements without exposing too much internal state to the user, we must have
some way of mapping elements’ addresses to the blocks in which they reside. Originally,
each element had a header containing a pointer to its block’s descriptor. This introduces
a significant memory overhead, however, especially if the size of the elements is small
compared to that of a pointer.
PUMA should use few resources in order to give users as much freedom as possible in its
use. Consequently, we devised two separate strategies for mapping elements to blocks’
descriptors. The first (Figure 3.2) was based on the NUMA allocation system calls which
we were already using to allocate blocks for the thread lists. These calls (specifically
numa alloc onnode() and numa alloc local()) guarantee that allocated memory will be
page-aligned.
Figure 3.2: Our first strategy for mapping elements to block descriptors. Darker
blue represents the block’s descriptor; light blue represents a page header with
a pointer to the block’s descriptor; red represents elements; the vertical lines
represent page boundaries; and white represents unallocated space within the
block.
If we ensure that each page within a block has a header, we can store a pointer in that
header to the block’s descriptor. Finding the block descriptor for a given element then
simply involves rounding the element’s address down to the next lowest multiple of the
page size.
This has two major disadvantages, however:
23
Chapter 3. Design and Implementation
• In order to calculate the index of a given element or the address corresponding to an
element’s index, we must perform a relatively complex calculation (between twenty
and fifty arithmetic operations), as shown in listing 3.1, rather than simple pointer
arithmetic (up to five operations). These are common calculations within PUMA, so
minimising their complexity is critical.
• If the usable page size after each header is not a multiple of our element size, we
can have up to sizeof(element) - 1 bytes of wasted space. This is especially
problematic with elements which are larger than our pages.
void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i )
{
s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ;
char ∗ arrayStart = node−>elementArray ;
s i z e t f i r s t S k i p I n d e x =
getIndexOfElementOnNode ( node , ( char ∗) node + pageSize + s i z e o f ( s t r u c t
pumaHeader ) ) ;
s i z e t elemsPerPage =
getIndexOfElementOnNode ( node , ( char ∗) node + 2 ∗ pageSize + s i z e o f (
s t r u c t pumaHeader ) ) − f i r s t S k i p I n d e x ;
s i z e t pageNum = ( i >= f i r s t S k i p I n d e x ) ∗ (1 + ( i − f i r s t S k i p I n d e x ) /
elemsPerPage ) ;
s i z e t lostSpace =
(pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−>
elementSize )
+ (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader
) ) % node−>elementSize )
+ pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ;
void ∗ element = ( i ∗ node−>elementSize + lostSpace + arrayStart ) ;
return element ;
}
s i z e t getIndexOfElement ( void ∗ element )
{
s t r u c t pumaNode∗ node = getNodeForElement ( element ) ;
return getIndexOfElementOnNode ( node , element ) ;
}
s i z e t getIndexOfElementOnNode ( s t r u c t pumaNode∗ node , void ∗ element )
{
s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ;
24
Chapter 3. Design and Implementation
char ∗ arrayStart = node−>elementArray ;
s i z e t pageNum = (( s i z e t ) element − ( s i z e t ) node ) / pageSize ;
s i z e t lostSpace =
(pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−>
elementSize )
+ (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader
) ) % node−>elementSize )
+ pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ;
s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart − lostSpace ) / node−>
elementSize ;
return index ;
}
Listing 3.1: Calculating the mapping between elements and their indices with
a per-page header.
Our second strategy (Figure 3.3) eliminated the need for these complex operations while
reducing memory overhead. POSIX systems provide a function to request a chunk of
memory aligned to a certain size, as long as that size is 2n pages long for some integer n.
If we ensure that block sizes also follow that restriction, we can allocate blockSize bytes
aligned to blockSize. Listing 3.2 shows how we calculate the mapping between elements
and their indices with this strategy.
s i z e t getIndexOfElement ( void ∗ element )
{
s t r u c t pumaNode∗ node = getNodeForElement ( element ) ;
return getIndexOfElementOnNode ( element , node ) ;
}
s i z e t getIndexOfElementOnNode ( void ∗ element , s t r u c t pumaNode∗ node )
{
char ∗ arrayStart = node−>elementArray ;
s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart ) / node−>elementSize ;
return index ;
}
void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i )
{
char ∗ arrayStart = node−>elementArray ;
void ∗ element = ( i ∗ node−>elementSize + arrayStart ) ;
return element ;
25
Chapter 3. Design and Implementation
}
s t r u c t pumaNode∗ getNodeForElement ( void ∗ element )
{
s t r u c t pumaNode∗ node =
( s t r u c t pumaNode∗) (( s i z e t ) element &
˜(( pumaPageSize ∗ PUMA NODEPAGES) − 1) ) ;
return node ;
}
Listing 3.2: Calculating the mapping between elements and their indices
without using headers.
Figure 3.3: Our second strategy for mapping elements to block descriptors. The
blue block represents the block’s descriptor; the red blocks represent elements;
and the vertical lines represent page boundaries. In this example, blocks are
two pages long.
3.2 Static Data
After parallelising all of the trivially parallel code in our primary case study, we found
that we were still encountering a major bottleneck. Profiling revealed that this was mostly
caused by an otherwise innocuous line in a pseudo random number generator. It was using
a static variable as the initial seed and then updating the seed each time it was called, as
shown in listing 3.3.
f l o a t rnd ( f l o a t a )
{
s t a t i c int i = 79654659;
f l o a t n ;
i = ( i ∗ 125) % 2796203;
n = ( i % ( int ) a ) + 1 . 0 ;
return n ;
}
Listing 3.3: Pseudo-random number generator where i is a static seed.
26
Chapter 3. Design and Implementation
As we increased our number of threads, writing to the seed required threads to wait for
cache synchronisation between cores, and using cores belonging to multiple NUMA domains
incurred lengthy cross-domain accesses.
The cache coherency problem could be solved to an extent using thread-local storage such
as that provided by #pragma omp threadprivate(...). However, since there are no
guarantees about the placement of thread-local static storage in relation to other threads’
variables, multiple thread-local seeds can still be located within the same cache line, leading
to synchronisation. This also means that we cannot optimise for NUMA without a more
problem-specific static memory management scheme.
We implemented a simple memory allocator which can allocate blocks of variable sizes
but not free() them individually. The lack of support for free()ing allows us to avoid
having to search for empty space within our available heap space while still allowing for
variable-sized allocations. This then places the responsibility for retaining reusable blocks
on the application developer.
This allocator is primarily for static data which is accessed regularly when running a kernel,
such as return values or seeds.
This allocator returns blocks of data which are located on the NUMA domain local to the
CPU which calls the allocation function. The main differences between it and PUMA’s
primary memory allocator are:
• The user is expected to keep track of allocated memory;
• The allocator enables variable sizes;
• Allocated blocks cannot be individually free()d.
3.3 Kernel Application
PUMA does not provide any way of retrieving individual elements from its set of allocated
elements. Instead, it exposes an interface for applying kernels to all elements. This interface
also enables the specification of functions used to manipulate extra data which is to be
passed into the kernel. With this, we can manipulate the data in the set as long as our
manipulation can be done in parallel and is not order-dependent.
The extra data which is passed into the kernel is thread-local in order to avoid cache
coherency overhead and expensive thread synchronisation. Consequently, we also allow
the user to specify a reduction function which is executed after all threads have finished
running the kernel and has access to all threads’ extra data.
27
Chapter 3. Design and Implementation
3.3.1 Load Balancer
When a kernel is run with PUMA, it first balances all of the per-thread lists at a block
level based on timing data from previous runs. If one thread has recently finished running
kernels significantly faster than other threads on average, we transfer blocks from slower
threads to it in order to increase its workload.
3.4 Challenges
3.4.1 Profiling
One of the most challenging aspects of developing PUMA was identifying the location
and type of bottlenecks. Most profiling tools we encountered, such as Intel’s VTune
Amplifier[21] and GNU gprof[22] are time- or cycle-based. VTune also provides metrics to
do with how OpenMP is utilised. However, finding hotspots of cross-domain activity was
still a matter of making educated guesses based on abnormal timing results from profilers.
VTune and a profiler called Likwid[23] also provide access to hardware counters, which can
be useful for profiling cross-domain accesses. However, without superuser access, it can
be difficult to obtain hardware counter-based results from these tools which can be used
for profiling; only the counters’ total values from the entire run are shown, meaning that
identifying hotspots is still a matter of guesswork.
In section 6.1 we discuss possible approaches to implementing userspace memory access
profiling tools in order to reduce the amount of guesswork required.
3.4.2 Invalid Memory Accesses
Because PUMA includes a memory allocator, we encountered several bugs regarding ac-
cessing invalid memory and corrupting header data.
In order to prevent these bugs, we use Valgrind’s[24] error detection interface to make our
allocator compatible with Valgrind’s memcheck utility. This enables Valgrind to alert the
user if they are reading from uninitialised memory or writing to un-allocated or free()d
memory.
This is not fully implemented, however; ideally, we would have Valgrind protect all memory
containing metadata. However, it is possible for multiple threads to read each other’s
metadata at once (without writing to it). Reading another thread’s metadata requires
marking it as valid before reading and marking it as invalid after.
Due to the non-deterministic nature of thread scheduling, this could sometimes lead to
28
Chapter 3. Design and Implementation
interleaving of validating and invalidating memory in such a way that between a thread
validating memory and reading it, another thread may have read the memory and then
invalidated it.
We decided that since overwriting this per-thread metadata was unlikely compared to other
memory access bugs, it was sensible to avoid protecting these blocks of memory entirely in
order to avoid false positives in Valgrind’s output.
3.4.3 Local vs. Remote Testing
NUMA-based architectures are not particularly prevalent in current consumer computers.
Consequently, the majority of our testing of the NUMA-based sections of PUMA had to
be performed while logged into a remote server.
It is, however, possible to perform some of this NUMA-based testing on a non-NUMA ma-
chine. While it is not particularly useful for gathering timing data, the qemu virtual ma-
chine has a configuration option enabling NUMA simulation, even on non-NUMA machines.
This can be useful for testing robustness and correctness of NUMA-aware applications.[25].
3.5 Testing and Debugging
We used various methods to test and debug PUMA. For testing, we wrote a short test
suite which tested several functions which were known to have caused errors which were
difficult to debug in early development. We also used LERM as a more comprehensive
testing platform, comparing the biomass in the PUMA version with that in the reference
implementation as a metric of functional correctness.
In terms of debugging, several methods were used. We used gdb and our Valgrind com-
patibility with both LERM and our unit tests in order to identify bugs within PUMA
itself.
We used system timers to assess whether each section’s parallel efficiency met our expecta-
tions. We also used both VTune and Likwid to collect more granular timing data, allowing
us to identify bottlenecks within both PUMA and LERM. PUMA bottlenecks acted as
indicators for what to optimise and LERM bottlenecks helped with the identification of
useful features for PUMA.
29
Chapter 3. Design and Implementation
3.6 Compilation
Compilation of PUMA requires a simple make invocation in the PUMA root directory. The
make targets are as follows:
• all: Build PUMA and docs and run unit tests
• doc: Build documentation with doxygen
• no test: Build PUMA without running unit tests
• clean: Clear the working tree
• docs clean: Clear all built documentation
3.6.1 Dependencies
PUMA relies on the following:
• libNUMA
• C99 compatible compiler
• Valgrind (optional)
• OpenMP (optional)
• Doxygen (optional, documentation)
3.6.2 Configuration
The following are configuration options for public use. For options which are either enabled
or disabled, 1 enables and 0 disables.
• PUMA NODEPAGES: Specifies the number of pages to allocate per chunk in the
per-thread chunk list. Default 2
• OPENMP: Enable OpenMP. If disabled, we use PUMA’s pthread-based thread pool-
ing solution (experimental). Default enabled
• STATIC THREADPOOL: If enabled and we are not using OpenMP, we share one
thread pool amongst all instances of PUMASet. Default disabled
30
Chapter 3. Design and Implementation
• BINDIR: Where we place the build shared library. Default {pumadir}/bin
• VALGRIND: Whether we build with valgrind support. Default enabled
The following is a configuration option for use during PUMA development. It may severely
hurt performance so should never be used in performance-critical code.
• DEBUG: Enable assertions. Default disabled
3.7 Getting Started
We present a short walkthrough on how to write a simple PUMA-based application in
appendix C. It consists of the generation of a random data set of which we find the standard
deviation by calculating the sums of all of the elements and of their squares.
We also include an API reference in appendix
31
Chapter 4
LERM Parallelisation
The basic LERM model (section 2.8) is concerned primarily with the simulation of agents in
a column of water 500m deep. The column is split into layers, with each layer corresponding
to one metre of the column.
When parallelising LERM, the na¨ıve approach involves domain decomposition; we split
layers equally between processors and each processor operates only on agents within its
assigned layers.
This has the problem, however, of encouraging inter-thread communication; agents may
move between layers, requiring the processor which moves a given agent to notify the newly
responsible thread. Given that any or all agents can move between layers during an update,
this potentially requires communication for every agent, leading to a large amount of time
wasted by processors which are waiting for access to synchronisation constructs.
This is not scalable beyond a certain number of processors (in this case, 500) without
subdividing layers. Also, the distribution of agents between layers is likely not to be fully
uniform, meaning that the workload will be unbalanced between processors.
Since the size of the problem is dictated by the number of agents rather than the number of
layers, and the number of agents is variable, a more scalable solution involves distributing
agents between processors. Since agents do not have to move between the domains managed
by different processors, there is no longer an inter-processor communication overhead.
4.1 Scalability
The isoefficiency function (equation 4.1) is a way of relating parallel efficiency to problem
size as the number of processors in use scales. One of its benefits is that it provides a
way of exploring how problem size must scale with the number of processors in order to
32
Chapter 4. LERM Parallelisation
E =
1
1 + To
W×tc
(4.1)
The isoefficiency function. W is the problem size, To is the serial overhead, tc
is the cost of execution for each operation and E is the parallel efficiency[26]
.
Ω(W) = C × To (4.2)
Workload growth for maintaining a fixed efficiency. W is the problem size, To
is the serial overhead and C is a constant representing fixed efficiency.[27].
maintain the same parallel efficiency.
Equation 4.2 shows a mapping between serial overhead and workload. If the equation holds
- i.e. the workload can be increased at least as quickly as the serial overhead as we increase
the number of processors in use - we say that an algorithm has perfect scalability. In other
words, we can maintain a constant efficiency as we increase processors.
The serial sections in the PUMA-based LERM implementation are all either O(n) (envi-
ronmental update) or O(p) where p is the number of processors in use. This means that To
and W are not directly related, so satisfying equation 4.1 requires scaling W proportionally
to To.
Since this is trivially sustainable as we increase the number of processors, the PUMA-based
LERM implementation can, in theory, maintain a constant efficiency.
4.2 Applying PUMA to LERM
Listing 4.1 shows Python-like pseudocode for the PUMA-based version of LERM. In order
to adapt LERM to use PUMA, we must first identify all sections which operate on agents
and adapt them to use the PUMA-based abstractions for running kernels, rather than
iterating over all agents and applying the kernel manually. Lines 10, 13 and 17 show
instances of this change when compared with lines 2, 6 and 11 respectively from listing 2.1.
These areas are primarily in the update and particle management steps. We also identify
any reductions performed after iterating over the agents and use PUMA’s reduction mech-
anism to perform these automatically. Line 17 shows where we tell PUMA to perform the
reduction after the update loop.
33
Chapter 4. LERM Parallelisation
1 def s p l i t K e r n e l ( agent ) :
2 i f len ( agents ) < minAgents :
3 splitIntoTwo ( agent )
4
5 def mergeKernel ( agent ) :
6 i f len ( agents ) > maxAgents :
7 mergeIntoOne ( agent , smallestAgent )
8
9 def mergeAgents () :
10 runKernel ( mergeKernel )
11
12 def splitAgents () :
13 runKernel ( mergeKernel )
14
15 def updateAgents () :
16 # T r i v i a l l y p a r a l l e l loop
17 runKernel ( ecologyKernel , reduction=reducePerThreadData )
18
19
20 # The r e s t i s the same as in the o r i g i n a l implementation
21 def particleManagement () :
22 splitAgents ()
23 mergeAgents ()
24
25 def mixChemistry () :
26 f o r layer in l a y e r s :
27 totalConcentration += layer . concentration
28
29 def updateEnvironment () :
30 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e ()
31 mixChemistry ()
32
33 def main () :
34 in itia lise Envi ronm ent ()
35
36 while i < max timestep :
37 updateAgents ()
38 particleManagement ()
39 updateEnvironment ()
Listing 4.1: PUMA-based LERM pseudocode
34
Chapter 5
Evaluation
Our primary metrics by which we examine the success of PUMA are twofold: first, we
compare our case study as implemented with PUMA to the reference implementation,
specifically in relation to biomass of plankton; second, we compare measured profiling data
to our expectations and to the reference.
5.1 Correctness
Figure 5.1 shows the biomass over time in the PUMA implementation. Even across several
runs with random initial seeds, it does not significantly deviate from the average. We
compare the average biomass in the PUMA implementation with the same metric in the
reference implementation in Figure 5.2, which shows that they follow a similar pattern and
the difference between the two is at most 7.2% of the reference’s biomass.
The differences are due to two factors:
• PUMA manually manages the workload for each thread. Since agents interact with
per-thread environment variables and the order of iteration over the agents is unde-
fined, the exact result of the simulation is non-deterministic.
• PUMA enforces a parallel programming model when interacting with the agents
it manages, because all iteration over agents must be expressed in the form of a
parallelisable kernel. Because of this, we had to reimplement the particle management
step in this form, which lead to different behaviour on a microscopic scale while
macroscopically maintaining correctness.
A benefit of having reimplemented particle management is that it prevents serial particle
management from dominating performance results, allowing us to focus on the NUMA
35
Chapter 5. Evaluation
0 100 200 300 400 500 600 700 800
0
2 · 1010
4 · 1010
6 · 1010
8 · 1010
1 · 1011
Timestep (Each Corresponds to 30 Minutes)
PlanktonBiomass
Figure 5.1: Biomass of plankton in the PUMA-based LERM simulation over
time. Black are maximum and minimum across all runs, red is average.
problem. The new method has, however, not been rigorously statistically analysed because
that is beyond the scope of this project (see section 6.1).
5.2 Profiling
Our profiling data consists of two parts, both of which we compare with the reference
LERM implementation. The first is the proportion of total memory accesses which are
remote for each number of cores. We expect this to be higher when using cores belonging
to a second domain, because reduction requires accesses to data assigned to all cores. It
should, however, be significantly lower than in the reference implementation.
The second is the parallel efficiency (formula A.1) across cores. In the trivially parallel
section, we expect it to remain at approximately 100% even as we use cores belonging to
a second domain.
Figure 1.3 shows a comparison between off-domain accesses in the PUMA-based LERM im-
36
Chapter 5. Evaluation
0 100 200 300 400 500 600 700
0
2 · 1010
4 · 1010
6 · 1010
8 · 1010
1 · 1011
Timestep (Each Corresponds to 30 Minutes)
PlanktonBiomass
0
10
20
30
40
50
%Difference
Difference (% of reference)
PUMA
Reference
Figure 5.2: Comparison between the average biomasses across runs in the ref-
erence and PUMA implementations of the LERM simulation over time.
plementation and the reference implementation. In PUMA, we reduce off-domain accesses
by 75% from the reference implementation.
Our timing data are presented in Figures 5.5, 5.6 and 5.7. The most important section
here is the mostly trivially parallel update; unlike in the reference implementation, we have
no noticeable NUMA-based reduction in parallel efficiency.
Both implementations exhibit similar parallel efficiency for the environmental update step
(Figure 5.7), because in both cases it is implemented as a serial algorithm.
Figures 5.3 and 5.4 each show the total time taken for the simulation when run on up
to twelve cores on a log-log plot. It also shows the theoretical minimum time according
to Amdahl’s Law, calculated with formula A.2. It is important to note that, while we
have significantly reduced the time taken to run LERM on a single core when we compare
the PUMA-based implementation with the reference implementation, this is not a direct
result of PUMA. Instead, it is a result of slightly different scheduling causing changes in
which paths are taken through the primary update kernel. The total timing graphs are
best considered in isolation from each other, to see how they conform to the Amdahl’s
37
Chapter 5. Evaluation
1 2 4 8
100,000
158,000
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Ideal
Parallel Update
Update Balancing
Particle Management
Environment Update
Figure 5.3: Total runtime with dynamic data using PUMA.
Law-dictated minimum timing in each case.
5.2.1 Load Balancing
In order to assess the usefulness of our load balancer, we tested both with and without
load balancing turned on. Figure 5.8 shows that load balancing has a small but noticable
effect on runtime in the trivially parallel step of LERM.
5.3 Known Issues
While our implementation provides an effective solution to the NUMA problem, we still
have areas in which it can be improved.
38
Chapter 5. Evaluation
1 2 4 8
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Ideal
Update
Particle Management
Environment Update
Figure 5.4: Total runtime with dynamic data using static over-allocation and
not taking NUMA effects into account.
39
Chapter 5. Evaluation
1 2 3 4 5 6 7 8 9 10 11 12
60
65
70
75
80
85
90
95
100
Cores In Use
ParallelEfficiency(Update)
Base Update (With Reduction)
PUMA Update (With Reduction)
Base Update (Without Reduction)
PUMA Update (Without Reduction)
Figure 5.5: Parallel efficiency reached by agent updates in the PUMA imple-
mentation of LERM vs in the reference implementation. The red vertical line
signifies the point past which we utilise cores on a second NUMA node.
In this instance, reduction is the necessary consolidation of per-thread environ-
mental changes from the update loop into the global state.
Margin of error is calculated by finding the percentage standard deviation in
the original times and applying it to the parallel efficiency.
5.3.1 Thread Pooling
PUMA relies on the persistence of thread pinning at initialisation. If new threads are
created each time we execute code in parallel, the pinning is no longer persistent. Con-
sequently, threads may be moved to other cores depending on the Operating System’s
scheduler, leading to bugs which are difficult to reproduce.
40
Chapter 5. Evaluation
1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(PM)
Base PM
PUMA PM
Figure 5.6: Parallel efficiency reached by particle management in the PUMA
implementation of LERM vs in the reference implementation.
The user may specify a CPU affinity string for both the GNU and Intel OpenMP im-
plementations as an environment variable. This has the disadvantage of requiring extra
parameters at program invocation, however, and removing control from the programmer.
We provide a custom thread pool implementation because the OpenMP standard does
not specify whether threads are reused between parallel sections. Both the Intel and
GNU implementations of OpenMP currently reuse threads which have been previously
spawned[28][29], but the standard allows for new threads to be spawned each time a parallel
section is implemented.
However, our thread pool implementation does not scale as well as OpenMP (Figure 5.9)
and relies on pthreads, meaning that it is not natively supported on Windows. Conse-
quently, we would like to optimise the custom thread pool, or attempt to find an existing
threading library in which we do not have to rely on non-guaranteed behaviour.
41
Chapter 5. Evaluation
2 4 6 8 10 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(Environment)
Base Environment
PUMA Environment
Figure 5.7: Parallel efficiency reached by environment update in the PUMA
implementation of LERM vs in the reference implementation.
5.3.2 Parallel Balancing
In order to ensure that no thread has a significantly longer runtime than any other, we
implement workload balancing with heuristics based on previous kernel runtimes. While
this has proven effective, it is non-optimal in that the parallelisable sections have not been
parallelised.
Since our balancing algorithm transfers ownership of memory among cores on the same
NUMA domain first before performing inter-domain copies, it could be parallelised through
domain decomposition; each NUMA domain would be internally balanced by a separate
thread with a serial cross-domain reduction at the end.
Currently, by Amdahl’s Law[10], the balancer limits the parallel speedup we can achieve,
and the balancing time increases as we use more cores.
42
Chapter 5. Evaluation
1 2 3 4 5 6 7 8 9 10 11 12
90
92
94
96
98
100
Cores In Use
ParallelEfficiency
PUMA Without Load Balancing
PUMA With Load Balancing
Figure 5.8: Parallel efficiency reached by the update step in the PUMA imple-
mentation of LERM with and without load balancing.
43
Chapter 5. Evaluation
2 4 6 8 10 12
80
85
90
95
100
Cores In Use
ParallelEfficiency(TriviallyParallel)
Figure 5.9: Parallel efficiency reached by the trivially parallel section of LERM
when using OpenMP (blue) and PUMA’s own thread pool (red).
44
Chapter 6
Conclusions
We have presented a framework which allows users with little or no knowledge of the
underlying topology and memory hierarchy of NUMA-based systems to develop software
which takes advantage of the available hardware while automatically preventing cache
coherency overhead and cross-domain accesses within parallel kernels.
Its uniqueness lies primarily in the class of problems which it tackles; as discussed in
chapter 2.7, solutions exist which are tailored to help with several classes of problems. We
have explored solutions for operating on sets of static, independent data (section 2.7.2)
and graphs of dynamic data with complex dependency hierarchies (section 2.7.3), none of
which are suitable for dynamic, independent data sets such as those used in branches of
agent-based modelling.
The availability of a solution tailored to this sort of problem could help with the rapid
development of scientific applications, leading to easier research and simulation.
6.1 Future Work
PUMA is far from complete; in particular, we would like to address the issues raised in
section 5.3.
We have designed PUMA to abstract away OS-specific interfaces, internally, for simplicity.
While the systems on which PUMA has been tested are POSIX-based, Windows also
provides NUMA libraries. In the future, it would be useful to port PUMA to Windows so
that applications using PUMA are not bound to POSIX systems.
A major feature which would make PUMA more able to take advantage of modern dis-
tributed systems is MPI support. This would mainly require changes to the balancing and
reduction sections of kernel application, and would enable further parallelisation.
45
Chapter 6. Conclusions
In order to help PUMA adoption in the scientific community, we would like to create
bindings for languages such as Python and Fortran, both of which are prevalent in scientific
computing. Tools exist for both languages to interface with C functions, so this should
require very little work in exchange for broader applicability of PUMA.
As mentioned in chapter 5, our parallelised version of LERM’s particle management step
has not been rigorously statistically analysed. It would be useful to analyse the changes in
order to assess whether the current PUMA implementation can be adapted to larger sim-
ulations. If not, adaptation would require different, possibly more complex parallelisation
methods.
In section 3.4.1, we discuss the potential usefulness of NUMA memory access profiling.
During PUMA’s development, we briefly explored various methods for the creation of a
profiler which would not require superuser privileges and would perform line-by-line pro-
filing of memory accesses, specifically identifying spots where many cross-domain accesses
were performed and where cache synchronisation dominated timing. Unfortunately, it was
too far outside the scope of PUMA to realistically explore in depth.
We examined two possible strategies for the implementation of such a profiler:
• Using some debugging library (such as LLDB’s C++ API[30]) to trap every memory
access and determine the physical location of the accessed address in order to count
off-domain accesses;
• Building on Valgrind, which translates machine code into its own RISC-like language
before executing the translated code, to count off-domain accesses. Valgrind could
also be used to examine cache coherency latency by adapting Cachegrind, a tool
which profiles cache utilisation.
46
Bibliography
[1] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics
Magazine, 1965.
[2] Wikimedia Commons. Transistor count and moore’s law, 2011.
[3] X. Guo, G. Gorman, M. Lange, L. Mitchell, and M. Weiland. Exploring the
thread-level parallelisms for the next generation geophysical fluid modelling frame-
work fluidity-icom. Procedia Engineering, 61:251–257, 2013.
[4] http://www.metoffice.gov.uk/news/in-depth/supercomputers.
[5] C. Vollaire, L. Nicolas, and A. Nicolas. Parallel computing for the finite element
method. Eur. Phys. J. AP, 1(3):305–314, 1998.
[6] http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf.
[7] Sunny Y. Auyang. Foundations of Complex-system Theories: In Economics, Evolu-
tionary Biology, and Statistical Physics. Cambridge University Press, 1999.
[8] U. Berger and H. Hildenbrandt. A new approach to spatially explicit modelling of
forest dynamics: spacing, ageing and neighbourhood competition of mangrove trees.
Ecological Modelling, 132:287–302, 2000.
[9] U. Saint-Paul and H. Schneider. Mangrove Dynamics and Management in North
Brazil. Springer Science & Business Media, 2010.
[10] Gene M. Amdahl. Validity of the single processor approach to achieving large scale
computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer
conference on - AFIPS ’67 (Spring), 1967.
[11] John Von Neumann. First Draft of a Report on the EDVAC - https:
//web.archive.org/web/20130314123032/http://qss.stanford.edu/~godfrey/
vonNeumann/vnedvac.pdf.
47
Bibliography
[12] http://www.archer.ac.uk/about-archer/.
[13] http://www.cray.com/sites/default/files/resources/cray_xc40_
specifications.pdf.
[14] Richard P. Boardman Steven J. Johnston Mark Scott Neil S. O’Brien Simon J. Cox,
James T. Cox. Iridis-pi: a low-cost, compact demonstration cluster. Cluster Comput-
ing, 2013.
[15] http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/
OrOn2_PfTune/sgi_html/ch08.html.
[16] http://www.oerc.ox.ac.uk/projects/op2.
[17] http://www.oerc.ox.ac.uk/sites/default/files/uploads/ProjectFiles/OP2/
OP2_Users_Guide.pdf.
[18] https://software.intel.com/en-us/intel-tbb/details.
[19] J.D. Woods. The lagrangian ensemble metamodel for simulating plankton ecosystems.
Progress in Oceanography, 67(1-2):84–159, 2005.
[20] Robert Kruszewski. Accelerating agent-based python models. Master’s thesis, Imperial
College London.
[21] https://software.intel.com/en-us/intel-vtune-amplifier-xe.
[22] https://sourceware.org/binutils/docs/gprof/.
[23] https://code.google.com/p/likwid/.
[24] http://valgrind.org/.
[25] http://linux.die.net/man/1/qemu-kvm.
[26] Ananth Grama, Anshul Gupta, and Vipin Kumar. Isoefficiency function: A scalability
metric for parallel algorithms and architectures, 1993.
[27] Peter Hanuliak and Michal Hanuliak. Analytical modelling in parallel and distributed
computing, pages 101–102. Chartridge Books Oxford, 2014.
[28] https://software.intel.com/en-us/forums/topic/382683.
[29] https://software.intel.com/en-us/forums/topic/382683.
[30] http://lldb.llvm.org/cpp_reference/html/index.html.
48
Bibliography
[31] http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_
66-GHz-6_40-GTs-Intel-QPI.
[32] http://opensource.org/licenses/BSD-3-Clause.
49
Appendix A
Methods For Gathering Data
All performance data are gathered using the Imperial College High Performance Computing
service (unless explicitly stated otherwise). Code was run on the Cx1 general-purpose
cluster using a node with the following hardware:
• Two six core Intel R Xeon R X5650, 2.66GHz, 12MB last-level cache[31]
– 2.66GHz
– 32KB L1 instruction cache per core
– 32KB L1 data cache per core
– 256KB L2 cache
– 12MB L3 cache
• Two NUMA domains, one for each processor
• Limited to 1GB memory by qsub queuing system
likwid-perfctr was used to gather information from hardware counters; these were pri-
marily related to counting cross-domain accesses using the UNC QHL REQUESTS REMOTE READS
counter and local accesses with the UNC QHL REQUESTS LOCAL READS counter.
All data were collected by averaging results over ten runs.
Formulae:
• Parallel efficiency:
100 ×
T1
Tn × n
(A.1)
where Tn is the time taken to run on n cores.
50
Appendix A. Methods For Gathering Data
• Amdahl’s Law theoretical minimum runtime:
T1 ∗ Ps + T1 ∗ Pp/n (A.2)
where Tn is the time taken to run on n cores, Ps is the proportion of the program
with is serial, Pp is the proportion which is in parallelisable and n is the number of
cores.
51
Appendix B
API Reference
3.7
B.1 PUMA Set Management
s t r u c t pumaSet∗ createPumaSet ( s i z e t elementSize , s i z e t numThreads , char ∗
threadAffinity ) ;
Creates a new struct pumaSet.
Arguments:
elementSize Size of each element in the set.
numThreads The number of threads we want to run pumaSet on.
threadAffinity An affinity string specifying the CPUs to which to bind threads. Can
contain numbers separated either by commas or dashes. “i-j” means
bind to every cpu from i to j inclusive. “i,j” means bind to i and j.
Formats can be mixed: for example, “0-3, 6, 10, 12, 15, 13” is valid.
If NULL, binds each thread to the CPU whose number matches the
thread (tid 0 == cpu 0 :: tid 1 == cpu 1 :: etc.).
If non-NULL, must specify at least as many CPUs as there are threads.
void destroyPumaSet ( s t r u c t pumaSet∗ set ) ;
Destroys and frees memory from the struct pumaSet.
52
Appendix B. API Reference
s i z e t getNumElements ( s t r u c t pumaSet∗ set ) ;
Returns the total number of elements in the struct pumaSet.
typedef s i z e t ( splitterFunc ) ( void ∗ perElemBalData , s i z e t numThreads ,
void ∗ extraData ) ;
Signature for a function which, given an element, the total number of threads and, option-
ally, a void pointer, will specify the thread with which to associate the element.
Arguments:
perElemBalData Per-element data passed into pumallocManualBalancing() which
enables the splitter to choose the placement of the associated element.
numThreads The total number of threads in use.
extraData Optional extra data, set by calling pumaListSetBalancer().
void pumaSetBalancer ( s t r u c t pumaSet∗ set , bool autoBalance , splitterFunc ∗
s p l i t t e r , void ∗ splitterExtraData ) ;
Sets the balancing strategy for a struct pumaSet.
Arguments:
set Set to set the balancing strategy for.
autoBalance Whether to automatically balance the set across threads prior to
each kernel run.
splitter A pointer to a function which determines the thread with which to
associate new data when pumallocManualBalancing() is called.
splitterExtraData A void pointer to be passed to the splitter function each time it
is called.
53
Appendix B. API Reference
B.2 Memory Allocation
void ∗ pumalloc ( s t r u c t pumaSet∗ set ) ;
Adds an element to the struct pumaSet and returns a pointer to it. The new element is
associated with the CPU on which the current thread is running.
void ∗ pumallocManualBalancing ( s t r u c t pumaSet∗ set , void ∗ balData ) ;
Adds an element to the struct pumaSet and returns a pointer to it. Passes balData to
the set’s splitter function to determine the CPU with which to associate the new element.
void ∗ pumallocAutoBalancing ( s t r u c t pumaSet∗ set ) ;
Adds an element to the struct pumaSet and returns a pointer to it. Automatically asso-
ciates the new element with the CPU with the fewest elements.
void pufree ( void ∗ element ) ;
Frees the specified element from its set.
54
Appendix B. API Reference
B.3 Kernel Application
s t r u c t pumaExtraKernelData
{
void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) ;
void ∗ constructorData ;
void (∗ extraDataDestructor ) ( void ∗ data ) ;
void (∗ extraDataThreadReduce ) ( void ∗ data ) ;
void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] ,
unsigned int nThreads ) ;
void ∗ retValue ;
};
A descriptor of functions which handle extra data for kernels to pass into runKernel().
Members:
extraDataConstructor A per-thread constructor for extra data which is passed into
the kernel.
constructorData A pointer to any extra data which may be required by the
constructor. May be NULL.
extraDataDestructor A destructor for data created with
extraDataConstructor().
extraDataThreadReduce A finalisation function which is run after the kernel on a per-
thread basis. Takes the per-thread data as an argument.
extraDataReduce A global finalisation function which is run after all threads
have finished running the kernel. Takes retValue, an array
of the extra data for all threads and the number of threads
in use.
retValue A pointer to a return value for use by extraDataReduce.
May be NULL.
void initKernelData ( s t r u c t pumaExtraKernelData∗ kernelData ,
void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) ,
void ∗ constructorData ,
void (∗ extraDataDestructor ) ( void ∗ data ) ,
void (∗ extraDataThreadReduce ) ( void ∗ data ) ,
void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] ,
unsigned int nThreads ) ,
void ∗ retValue ) ;
Initialises kernelData. Any or all of the arguments after kernelData may be NULL. Any
NULL functions are set to dummy functions which do nothing.
55
Appendix B. API Reference
extern s t r u c t pumaExtraKernelData emptyKernelData ;
A dummy descriptor for extra kernel data. Causes NULL to be passed to the kernel in place
of extra data.
typedef void (∗ pumaKernel ) ( void ∗ element , void ∗ extraData ) ;
The type signature for kernels which are to be run on a PUMA list.
Arguments:
element The current element in our iteration.
extraData Extra information specified by our extra data descriptor.
void runKernel ( s t r u c t pumaSet∗ set , pumaKernel kernel , s t r u c t
pumaExtraKernelData∗ extraDataDetails ) ;
Applies the given kernel to all elements in a struct pumaSet.
Arguments:
set The set containing the elements to which we want to apply our
kernel.
kernel A pointer to the kernel to apply.
extraDataDetails A pointer to the structure specifying the extra data to be passed
into the kernel.
void runKernelList ( s t r u c t pumaSet∗ set , pumaKernel kernels [ ] ,
s i z e t numKernels , s t r u c t pumaExtraKernelData∗ extraDataDetails ) ;
Applies the given kernels to all elements in a struct pumaSet. Kernels are applied in the
order in which they are specified in the array.
Arguments:
set The set containing the elements to which we want to apply our
kernels.
kernels An array of kernels to apply.
numKernels The number of kernels to apply.
extraDataDetails A pointer to the structure specifying the extra data to be passed
into the kernels.
56
Appendix B. API Reference
B.4 Static Data Allocation
void ∗ pumallocStaticLocal ( s i z e t s i z e ) ;
Allocates thread-local storage which resides on the NUMA domain to which the CPU which
executes the function belongs.
Arguments:
size The number of bytes we want to allocate.
void ∗ pumaDeleteStaticData ( void ) ;
Deletes all static data associated with the current thread.
57
Appendix C
Getting Started: Standard
Deviation Hello World!
In lieu of the traditional “Hello World” introductory program, we present a PUMA-based
program which generates a large set of random numbers between 0 and 1 and uses the
reduction mechanism of PUMA to calculate the set’s standard deviation.
In order to calculate the standard deviation, we require three things: a kernel, a con-
structor for the per-thread data and a reduction function. In the constructor, we use the
pumallocStaticLocal() function to allocate a static variable on a per-thread basis which
resides in memory local to the core to which each thread is pinned.
This interface for allocating thread-local data are only intended to be used for static data
whose lifespan extends to the end of the program. It is possible to delete all static data
which is related to a thread, but it is more sensible to simply reuse the allocated memory
each time we need similarly-sized data on a thread. This requires the use of pthread keys
in order to retrieve the allocated pointer each time it is needed.
// puma . h contains a l l of the puma public API d e c l a r a t i o n s we need .
#include ”puma . h”
#include <math . h>
#include <pthread . h>
#include <s t d l i b . h>
#include <s t d i o . h>
#include <getopt . h>
pthread key t extraDataKey ;
pthread once t initExtraDataOnce = PTHREAD ONCE INIT;
s t a t i c void i n i t i a l i s e K e y ( void )
{
pthread key create (&extraDataKey , NULL) ;
58
Appendix C. Getting Started: Standard Deviation Hello World!
}
s t r u c t stdDevExtraData
{
double sum ;
double squareSum ;
s i z e t numElements ;
};
s t a t i c void ∗ extraDataConstructor ( void ∗ constructorData )
{
( void ) pthread once(&initExtraDataOnce , &i n i t i a l i s e K e y ) ;
void ∗ stdDevExtraData = p t h r e a d g e t s p e c i f i c ( extraDataKey ) ;
i f ( stdDevExtraData == NULL)
{
stdDevExtraData = pumallocStaticLocal ( s i z e o f ( s t r u c t stdDevExtraData ) ) ;
p t h r e a d s e t s p e c i f i c ( extraDataKey , stdDevExtraData ) ;
}
return stdDevExtraData ;
}
s t a t i c void extraDataReduce ( void ∗ voidRet , void ∗ voidData [ ] ,
unsigned int nThreads )
{
double ∗ ret = ( double ∗) voidRet ;
double sum = 0;
double squareSum = 0;
s i z e t numElements = 0;
f o r ( unsigned int i = 0; i < nThreads ; ++i )
{
s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData [ i ] ;
numElements += data−>numElements ;
sum += data−>sum ;
squareSum += data−>squareSum ;
}
double mean = sum / numElements ;
∗ ret = squareSum / numElements − (mean ∗ mean) ;
}
s t a t i c void stdDevKernel ( void ∗ voidNum , void ∗ voidData )
{
double num = ∗( double ∗)voidNum ;
s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData ;
data−>sum += num;
data−>squareSum += num ∗ num;
59
Appendix C. Getting Started: Standard Deviation Hello World!
++data−>numElements ;
}
s t a t i c void s t a t i c D e s t r u c t o r ( void ∗ arg )
{
pumaDeleteStaticData () ;
}
Prior to running the kernel, we must actually create the struct pumaSet which contains
our data; to do this, we specify the size of our elements, the number of threads we wish to
use and, optionally, a string detailing what cores we want to pin threads to. We must also
seed the random number generator and read the arguments:
s t a t i c void printHelp ( char ∗ invocationName )
{
p r i n t f ( ”Usage : %s −t numThreads −e numElements [−a a f f i n i t y S t r i n g ] n”
”tnumThreads : The number of threads to use n”
”tnumElements : The number of numbers to a l l o c a t e n”
” t a f f i n i t y S t r i n g : A s t r i n g which s p e c i f i e s which cores to run on . n” ) ;
}
int main ( int argc , char ∗∗ argv )
{
int numThreads = 1;
int numElements = 1000;
char ∗ a f f i n i t y S t r = NULL;
/∗
Get command l i n e input f o r the a f f i n i t y s t r i n g and number of threads .
∗/
char c ;
while ( ( c = getopt ( argc , argv , ”e : a : t : ” ) ) != −1)
{
switch ( c )
{
case ’ t ’ :
numThreads = a to i ( optarg ) ;
break ;
case ’ e ’ :
numElements = a to i ( optarg ) ;
break ;
case ’ a ’ :
a f f i n i t y S t r = optArg ;
break ;
case ’h ’ :
printHelp ( argv [ 0 ] ) ;
break ;
60
Appendix C. Getting Started: Standard Deviation Hello World!
}
}
s t r u c t pumaSet∗ set =
createPumaSet ( s i z e o f ( double ) , numThreads , a f f i n i t y S t r ) ;
srand ( time (NULL) ) ;
From here, we can use the pumalloc call to allocate space within set for each number:
f o r ( s i z e t i = 0; i < numElements ; ++i )
{
double ∗ num = ( double ∗) pumalloc ( set ) ;
∗num = ( double ) rand () / RANDMAX;
}
We then use initKernelData() to create the extra data to be passed into our kernel.
From there, we call runKernel() to invoke our kernel and get the standard deviation of
the set.
s t r u c t pumaExtraKernelData kData ;
double stdDev = −1;
initKernelData(&kData , &extraDataConstructor , NULL, NULL, NULL,
&extraDataReduce , &stdDev ) ;
runKernel ( set , stdDevKernel , &kData ) ;
p r i n t f ( ”Our set has a standard deviation of %f n”
”Also , Hello World ! n” , stdDev ) ;
Finally, we clean up after ourselves by destroying our set and all our static data. The static
data destructor destroys data on a per-thread basis, so we must call the destructor from
all threads in our pool. To do this, we use the executeOnThreadPool() function from
pumathreadpool.h.
executeOnThreadPool ( set−>threadPool , staticDestructor , NULL) ;
destroyPumaSet ( set ) ;
}
In order to compile this tutorial, use the following command:
gcc -pthread -std=c99 <file>.c -o stddev -lpuma -L<PUMA bin dir> -I<PUMA inc
dir>
61
Appendix D
Licence
PUMA is released under the three-clause BSD licence[32]. We chose this rather than a
copyleft licence like GPL or LGPL in order to allow anyone to use PUMA with absolute
freedom aside from the inclusion of a short copyright notice.
62

More Related Content

What's hot

Neural Network Toolbox MATLAB
Neural Network Toolbox MATLABNeural Network Toolbox MATLAB
Neural Network Toolbox MATLABESCOM
 
95960910 atoll-getting-started-umts-310-en-v1
95960910 atoll-getting-started-umts-310-en-v195960910 atoll-getting-started-umts-310-en-v1
95960910 atoll-getting-started-umts-310-en-v1Oshin Neeh
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingAndrea Tino
 
A Matlab Implementation Of Nn
A Matlab Implementation Of NnA Matlab Implementation Of Nn
A Matlab Implementation Of NnESCOM
 
95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-ltearif budiman
 
Implementing tws extended agent for tivoli storage manager sg246030
Implementing tws extended agent for tivoli storage manager   sg246030Implementing tws extended agent for tivoli storage manager   sg246030
Implementing tws extended agent for tivoli storage manager sg246030Banking at Ho Chi Minh city
 
Advanced Networking Concepts Applied Using Linux on IBM System z
Advanced Networking  Concepts Applied Using  Linux on IBM System zAdvanced Networking  Concepts Applied Using  Linux on IBM System z
Advanced Networking Concepts Applied Using Linux on IBM System zIBM India Smarter Computing
 
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...Satya Harish
 
Linux kernel 2.6 document
Linux kernel 2.6 documentLinux kernel 2.6 document
Linux kernel 2.6 documentStanley Ho
 
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...mustafa sarac
 
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...Nitesh Pandit
 
Slr to tivoli performance reporter for os 390 migration cookbook sg245128
Slr to tivoli performance reporter for os 390 migration cookbook sg245128Slr to tivoli performance reporter for os 390 migration cookbook sg245128
Slr to tivoli performance reporter for os 390 migration cookbook sg245128Banking at Ho Chi Minh city
 

What's hot (19)

Neural Network Toolbox MATLAB
Neural Network Toolbox MATLABNeural Network Toolbox MATLAB
Neural Network Toolbox MATLAB
 
95960910 atoll-getting-started-umts-310-en-v1
95960910 atoll-getting-started-umts-310-en-v195960910 atoll-getting-started-umts-310-en-v1
95960910 atoll-getting-started-umts-310-en-v1
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
 
Manual
ManualManual
Manual
 
A Matlab Implementation Of Nn
A Matlab Implementation Of NnA Matlab Implementation Of Nn
A Matlab Implementation Of Nn
 
95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte
 
Implementing tws extended agent for tivoli storage manager sg246030
Implementing tws extended agent for tivoli storage manager   sg246030Implementing tws extended agent for tivoli storage manager   sg246030
Implementing tws extended agent for tivoli storage manager sg246030
 
Cube_Quest_Final_Report
Cube_Quest_Final_ReportCube_Quest_Final_Report
Cube_Quest_Final_Report
 
thesis-2005-029
thesis-2005-029thesis-2005-029
thesis-2005-029
 
Advanced Networking Concepts Applied Using Linux on IBM System z
Advanced Networking  Concepts Applied Using  Linux on IBM System zAdvanced Networking  Concepts Applied Using  Linux on IBM System z
Advanced Networking Concepts Applied Using Linux on IBM System z
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Administrator manual-e2
Administrator manual-e2Administrator manual-e2
Administrator manual-e2
 
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...
BOOK - IBM zOS V1R10 communications server TCP / IP implementation volume 1 b...
 
Linux kernel 2.6 document
Linux kernel 2.6 documentLinux kernel 2.6 document
Linux kernel 2.6 document
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
 
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...
 
Db2 partitioning
Db2 partitioningDb2 partitioning
Db2 partitioning
 
Slr to tivoli performance reporter for os 390 migration cookbook sg245128
Slr to tivoli performance reporter for os 390 migration cookbook sg245128Slr to tivoli performance reporter for os 390 migration cookbook sg245128
Slr to tivoli performance reporter for os 390 migration cookbook sg245128
 

Similar to final (1)

matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdfKhamis37
 
Gdfs sg246374
Gdfs sg246374Gdfs sg246374
Gdfs sg246374Accenture
 
Memory synthesis using_ai_methods
Memory synthesis using_ai_methodsMemory synthesis using_ai_methods
Memory synthesis using_ai_methodsGabriel Mateescu
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdfPerPerso
 
FYP_enerScope_Final_v4
FYP_enerScope_Final_v4FYP_enerScope_Final_v4
FYP_enerScope_Final_v4Hafiiz Osman
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationrmvvr143
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationrmvvr143
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli Kamali
 
Team Omni L2 Requirements Revised
Team Omni L2 Requirements RevisedTeam Omni L2 Requirements Revised
Team Omni L2 Requirements RevisedAndrew Daws
 
Operating Systems (printouts)
Operating Systems (printouts)Operating Systems (printouts)
Operating Systems (printouts)wx672
 
Coding interview preparation
Coding interview preparationCoding interview preparation
Coding interview preparationSrinevethaAR
 
452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdfkalelboss
 

Similar to final (1) (20)

matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdf
 
T401
T401T401
T401
 
jc_thesis_final
jc_thesis_finaljc_thesis_final
jc_thesis_final
 
Gdfs sg246374
Gdfs sg246374Gdfs sg246374
Gdfs sg246374
 
Memory synthesis using_ai_methods
Memory synthesis using_ai_methodsMemory synthesis using_ai_methods
Memory synthesis using_ai_methods
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdf
 
FYP_enerScope_Final_v4
FYP_enerScope_Final_v4FYP_enerScope_Final_v4
FYP_enerScope_Final_v4
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronization
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronization
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFU
 
Team Omni L2 Requirements Revised
Team Omni L2 Requirements RevisedTeam Omni L2 Requirements Revised
Team Omni L2 Requirements Revised
 
Operating Systems (printouts)
Operating Systems (printouts)Operating Systems (printouts)
Operating Systems (printouts)
 
main
mainmain
main
 
Coding interview preparation
Coding interview preparationCoding interview preparation
Coding interview preparation
 
452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf
 
Milan_thesis.pdf
Milan_thesis.pdfMilan_thesis.pdf
Milan_thesis.pdf
 
Electrónica digital: Logicsim
Electrónica digital: LogicsimElectrónica digital: Logicsim
Electrónica digital: Logicsim
 

final (1)

  • 1. Department of Computing Imperial College London PUMA Abstracting Memory Latency Optimisation In Parallel Applications Richard Jones Supervisor: Tony Field June 2015
  • 2. Abstract Moore’s Law states that every eighteen months to two years, the number of transistors per square inch on an integrated circuit approximately doubles[1], effectively leading to a proportional performance gain. However, in the early twenty-first century, transistor size reduction began to slow down, limiting the growth of complexity in high-performance applications which was afforded by increasing computing power. Consequently, there was a push for increased parallelism, enabling several tasks to be car- ried out simultaneously. For high-performance computing applications, a logical extension to this was to utilise multiple processors simultaneously in the same system, each with multiple execution units, in order to increase parallelism with widely available consumer hardware. In multi-processor systems, having uniformly shared, globally accessible physical memory means that memory access times are the same across all processors. These accesses can be expensive, however, because they all require communication with a remote node, typically across a bus which is shared among the processors. Since the bus is shared and can only handle one request at a time, processors may have to wait to use it, causing delays when attempting to access memory. The situation can be improved by giving each processor its own section of memory, each with its own data bus. Each section of memory is called a domain, and accessing a domain which is assigned to a different processor requires the use of an interconnect which has a higher latency than accessing local memory. This architecture is called NUMA (Non- Uniform Memory Access). In order to exploit NUMA architectures efficiently, application developers need to write code which minimises so-called cross-domain accesses to maximise the application’s aggregate memory performance. We present PUMA, which is a smart memory allocator that manages data in a NUMA- aware way. PUMA exposes an interface to execute a kernel on the data in parallel, auto- matically ensuring that each core which runs the kernel accesses primarily local memory. It also provides an optional time-based load balancer which can adapt workloads to cases where some cores may have be less powerful or have more to do per kernel invocation than others.
  • 3. Acknowledgements I would like to thank the following for their contributions to PUMA, both directly and indirectly: • My supervisor, Tony Field, who has been a tremendous source of support, both in the development of PUMA and in my completion of this year. • Dr Michael Lange, the creator of our LERM case study, who spent hours helping me to work out just exactly what was wrong with my timing results. • My tutor, Murray Shanahan, for helping me get through all four years of my degree relatively intact, and providing help and support throughout. • Imperial’s High Performance Computing service, especially Simon Burbidge who was invaluable in helping me find my way around the HPC systems. • My family and friends (especially those who know nothing about computers) for continuing to talk to me after being forced to proofread approximately seventeen thousand different drafts of my project report. Also for providing vague moral sup- port over the course of the first 21 years of my life. i
  • 4. Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 PUMA (Pseudo-Uniform Memory Access) . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 7 2.1 Hardware Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Caching In Multi-Processor Systems . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Memory In Agent-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.1 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.2 Data-Based Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7.1 Manual Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7.2 OP2/PyOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.3 Galois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.4 Intel TBB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7.5 Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 ii
  • 5. Contents 2.8 LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Design and Implementation 21 3.1 Dynamic Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Memory Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Element Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Static Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Invalid Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.3 Local vs. Remote Testing . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Testing and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 LERM Parallelisation 32 4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Applying PUMA to LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Evaluation 35 5.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.1 Thread Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.2 Parallel Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iii
  • 6. Contents 6 Conclusions 45 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A Methods For Gathering Data 50 B API Reference 52 B.1 PUMA Set Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 B.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 B.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 B.4 Static Data Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C Getting Started: Standard Deviation Hello World! 58 D Licence 62 iv
  • 7. List of Figures 1.1 Calvin ponders the applications of workload parallelisation. . . . . . . . . . 1 1.2 Parallel efficiency reached by a trivially parallel algorithm with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we use cores on another NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Percentage of total reads which are remote on average across several runs of the trivially parallel section of the case study. . . . . . . . . . . . . . . . . . 5 1.4 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 6 2.1 “Transistor counts for integrated circuits plotted against their dates of in- troduction. The curve shows Moore’s law - the doubling of transistor counts every two years. The y-axis is logarithmic, so the line corresponds to expo- nential growth.”[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Example of agent allocation with Linux’ built-in malloc() implementation. Agents belonging to each thread are represented by a different colour per thread. Black represents non-agent memory. . . . . . . . . . . . . . . . . . . 12 2.3 Parallel efficiency reached by particle management with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we use cores on another NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Parallel efficiency reached by environment update with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we utilise cores from a second NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Total runtime with dynamic data using static storage in the reference im- plementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 v
  • 8. List of Figures 3.1 How we lay out our internal lists of allocated memory. The black box rep- resents the user-facing set structure, and each of blue and red represents a different thread’s list of pre-allocated memory blocks. These blocks are the same size for each thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Our first strategy for mapping elements to block descriptors. Darker blue represents the block’s descriptor; light blue represents a page header with a pointer to the block’s descriptor; red represents elements; the vertical lines represent page boundaries; and white represents unallocated space within the block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Our second strategy for mapping elements to block descriptors. The blue block represents the block’s descriptor; the red blocks represent elements; and the vertical lines represent page boundaries. In this example, blocks are two pages long. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Biomass of plankton in the PUMA-based LERM simulation over time. Black are maximum and minimum across all runs, red is average. . . . . . . . . . 36 5.2 Comparison between the average biomasses across runs in the reference and PUMA implementations of the LERM simulation over time. . . . . . . . . . 37 5.3 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 38 5.4 Total runtime with dynamic data using static over-allocation and not taking NUMA effects into account. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 Parallel efficiency reached by agent updates in the PUMA implementation of LERM vs in the reference implementation. The red vertical line signifies the point past which we utilise cores on a second NUMA node. In this instance, reduction is the necessary consolidation of per-thread environmental changes from the update loop into the global state. Margin of error is calculated by finding the percentage standard deviation in the original times and applying it to the parallel efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.6 Parallel efficiency reached by particle management in the PUMA implemen- tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 41 5.7 Parallel efficiency reached by environment update in the PUMA implemen- tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 42 5.8 Parallel efficiency reached by the update step in the PUMA implementation of LERM with and without load balancing. . . . . . . . . . . . . . . . . . . 43 5.9 Parallel efficiency reached by the trivially parallel section of LERM when using OpenMP (blue) and PUMA’s own thread pool (red). . . . . . . . . . 44 vi
  • 9. Listings 2.1 LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Calculating the mapping between elements and their indices with a per-page header. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Calculating the mapping between elements and their indices without using headers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Pseudo-random number generator where i is a static seed. . . . . . . . . . . 26 4.1 PUMA-based LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 34 vii
  • 10. Chapter 1 Introduction Figure 1.1: Calvin ponders the applications of workload parallelisation. In modern computing, the execution time of many applications can be greatly reduced by the use of large, multi-core systems. These gains are especially prevalent in scientific simulations operating on very large data sets, such as ocean current simulation[3], weather simulation[4] and finite element modelling[5]. Memory accesses in multi-processor systems can be subject to delays due to contention for the memory controller. A common solution to this problem is to provide multiple distinct controllers, each with its own discrete memory; this architecture is called NUMA, or Non-Uniform Memory Access[6]. Processors can access memory associated with others’ memory controllers, but this incurs an extra delay due to the need for an interconnect between controllers with higher latency. The primary disadvantage of NUMA is that, as each core needs to access more memory which is not associated with its controller, memory latency becomes a dominating factor in runtime. This causes parallel efficiency (calulated with formula A.1) to drop off rapidly 1
  • 11. Chapter 1. Introduction unless it is taken into account, even in applications which are trivially or “embarrassingly” parallel. In parallel systems, cache coherency can also be a major factor in memory latency. In order to enable a so-called “classical” programming model based on the Von Neumann stored-program concept for general purpose computing, processors often automatically synchronise their caches if one processor writes to a memory location which resides in both its and another’s cache. This synchronisation introduces extra latency when reading from or writing to memory which is within a cache line recently accessed by another processor. Consequently, its avoidance in high-performance parallel code can be critical. Figure 1.2 shows the parallel efficiency as we utilise more cores in the trivially parallel sec- tion of the reference implementation of LERM (Lagrangian Ensemble Recruitment Model) which we use as our primary case study. Both NUMA latency and cache synchronisation are responsible for the drop off; on one domain, there is a significant drop in performance due to multiple processors accessing and updating data within the same cache line. When we use multiple domains, we can see a change in the rate of drop-off in parallel efficiency, as illustrated by the two-part trend line. Figure 1.3 shows the increase in off-domain memory accesses in the reference implementa- tion of LERM when we use cores associated with a second domain. 1.1 Motivation In this project we aim to provide a framework with clear abstractions for application developers - both with an understanding of computer architecture and without - to write software which takes advantage of large, NUMA-based machines without dealing with the underlying management of memory across NUMA domains. Solutions already exist which provide this level of abstraction; however, each solution is focused on a specific set of problems. The problem which provided the primary motivation for our library is a type of agent-based modelling in which no agent interacts with any other, except indirectly via environmental or global variables. This kind of modelling can be used in a variety of areas, including economic interactions of individuals with a market[7] and ecological simulations[8][9]. Some such models are dynamic, in that the number of agents may change over time. 2
  • 12. Chapter 1. Introduction 1 2 3 4 5 6 7 8 9 10 11 12 60 65 70 75 80 85 90 95 100 Cores In Use ParallelEfficiency Parallel Efficiency Two-Part Trend Line Figure 1.2: Parallel efficiency reached by a trivially parallel algorithm with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we use cores on another NUMA node. 1.2 Objectives The primary purpose of our solution is to combine the dynamism of malloc() with the benefits of static approaches, as well as to provide methods to run kernels on pre-pinned threads across datasets. It must abstract away as much of the low-level detail as possible, without sacrificing performance gains or configurability. We aim to enable the simple parallelisation of applications operating on large, independent sets of data in such a way that results remain correct within reasonable bounds. We also aim to prevent NUMA cross-domain performance penalties without the application developer’s intervention. 3
  • 13. Chapter 1. Introduction 1.3 PUMA (Pseudo-Uniform Memory Access) This project is concerned with the design, implementation and evaluation of PUMA, which is a NUMA-aware library that provides methods for the allocation of homogeneous blocks of memory (elements), the iteration over all of these elements and the creation of static, domain-local data on a per-thread basis. It exposes a relatively small external API which is easy to use. It does not require developers to have an understanding of the underlying system topology, allowing them to focus more on the logic behind their kernel. It does, however, provide advanced configuration options; for example, it automatically pins threads to cores but can also take an affinity string at initialisation for customisable pinning. As well as providing NUMA-aware memory management within its set structure, PUMA also exposes an API for the allocation of per-thread static data which is placed on the calling thread’s local NUMA domain. This allocated data are guaranteed to not be within the same cache line as anything allocated from another thread, preventing memory latency caused by maintaining cache coherency. PUMA fits between existing solutions in that, while it imposes the constraint that data must not have direct dependencies, the data set it operates on can be dynamic. It therefore addresses a different class of problems from libraries such as OP2 (section 2.7.2) and Galois (section 2.7.3). We have adapted a scientific simulation (LERM) to use PUMA in order to examine its effects and usability. The simulation is outlined in section2.8. The PUMA-based implementation of this simulation has three main sections, each ex- hibiting a different level of parallelisation: trivially parallel, partially parallel and entirely serial. Figure 1.4 shows the total runtime of LERM when implemented with PUMA against the theoretical minimum as dictated by Amdahl’s Law[10]. This figure illustrates two aspects of PUMA at work: 1. The execution time across cores on a single domain is close to the minimum due to cache coherency optimisation; 2. The execution time across cores on multiple domains is also close to the theoretical minimum as a result of both cache coherency optimisation and the reduction in cross-domain memory traffic, as shown in Figure 1.3. Currently, PUMA has been tested on the following Operating Systems: • Ubuntu Linux 14.10 and 15.04 with kernels 3.16 and 3.19; 4
  • 14. Chapter 1. Introduction 1 2 3 4 5 6 7 8 9 10 11 12 0 5 10 15 20 25 30 35 40 Cores In Use %TotalReadsWhichAreRemote Reference Implementation PUMA-based Implementation Figure 1.3: Percentage of total reads which are remote on average across several runs of the trivially parallel section of the case study. • Red Hat Enterprise Linux with kernel 2.6.32; • Mac OS 10.10.3 PUMA is written to have as much backwards compatibility as it can within the Linux kernel; it was written with the POSIX standard in mind and uses, as far as possible, only standard-mandated features. For those which are not, alternatives are available as compile-time options. Internally, we implement time-based thread workload balancing. This allows us to manage intelligently the time taken to run a kernel on each thread, preventing any one thread from taking significantly longer than others. 5
  • 15. Chapter 1. Introduction 1 2 4 8 100,000 158,000 251,000 398,000 631,000 1,000,000 1,580,000 2,510,000 Cores In Use TotalTime(ms) Actual Timing Minimum Time (Amdahl’s Law) Figure 1.4: Total runtime with dynamic data using PUMA. 1.4 Contributions • In chapter 3, we discuss the design and implementation of PUMA, including how we achieve near-ideal parallel efficiency even across NUMA domains; • In chapter 4, we examine the scalability and parallelisation of the LERM simulation, including discussing our necessarily parallel Particle Management implementation; • In chapter 5, we present a detailed experimental evaluation of PUMA with respect to LERM, including both timing data and simulation correctness verification. The PUMA library code is hosted at https://github.com/CatharticMonkey/puma. 6
  • 16. Chapter 2 Background The most common architectural model in modern computers is called the Von Neumann architecture; it is based on a model described by John Von Neumann in the First Draft of a Report on the EDVAC[11]. The model describes a system consisting of the following: • A processing unit containing an Arithmetic/Logic Unit (ALU) and registers; • A control unit consisting of an instruction register and a program counter; • Memory in which data and instructions are stored; • External storage; • Input/output capabilities. The main benefit of a stored-program design such as the Von Neumann architecture is the ability to dynamically change the process which the computer carries out - this is in contrast to early computers which had hard-wired processes and could not be easily reprogrammed. The flexibility afforded by the stored-program approach is critical to the widespread use of computers as general-purpose machines. 2.1 Hardware Trends The first computer processors were developed to execute a serial stream of instructions. This was initially sustainable in terms of keeping up with increased requirements for more complex computations, as single-chip performance was constantly being improved; Moore’s 7
  • 17. Chapter 2. Background law is an observation stating that “[t]he complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase.”[1] This trend is shown in Figure 2.1. In other words, approximately every two years, the number of components in an integrated circuit can be expected to increase twofold, leading to a proportional performance gain. This, along with the significant increase in clock speeds from several MHz to several GHz in the span of just a few decades meant that serial execution was also subject to approximately linear gains. In the early 21st century, however, these gains began to slow as a result of physical limita- tions. The next step for performance scaling was parallelism, and so the number of discrete processing units in chips produced by most major manufacturers increased. As a result of this move towards higher levels of hardware parallelism, there has been an increasingly strong focus in current computing on parallelising software to take advantage of it. 2.2 Memory Hierarchy Due to the trend for increased speed of computation in a small space, memory access time has become a major concern when it comes to execution time. This is especially true as clock cycles have become shorter; shorter cycles mean more cycles wasted waiting for accesses of the same length and so more instructions which could be executed during the time taken for each access, but are not. As a result, there are various architectural decisions made by processor manufacturers in order to attempt to minimise this penalty and thus speed up program execution. The primary method of tackling the problem is to implement a caching system which exploits the fact that a significant portion of memory accesses exhibit spatial locality (i.e. they are close to previously accessed memory); when a memory access is performed, the cache (which is physically close to the processor and is often an on-chip bank of memory) is first checked to see if it contains the desired data. If it does, there is no need to look further and the access returns relatively quickly. If it does not, the data are requested from main memory along with a surrounding block of data of a predetermined size. These extra data are stored in the cache, enabling future cache hits. The simple caching model is often extended with the use of multiple levels of cache. In this extension, lower levels of cache are physically closer to the memory data register, which is the location to which memory accesses are initially loaded on the processor. As a result, 8
  • 18. Chapter 2. Background Figure 2.1: “Transistor counts for integrated circuits plotted against their dates of introduction. The curve shows Moore’s law - the doubling of transistor counts every two years. The y-axis is logarithmic, so the line corresponds to exponential growth.”[2] lower levels of cache are quicker to access than higher ones. However, lower level caches have a smaller size limit because of their location. Consequently, it is desirable to have several levels of cache, each bigger than the last, in order to reduce cache-processor latency as much as possible without sacrificing size. Caching helps to mitigate memory access penalties significantly, but main memory access time is still important, especially in the case in which there will inevitably be many cache 9
  • 19. Chapter 2. Background misses because of the nature of the algorithm in question. If there are many cache misses, memory access time can quickly become a dominant factor of execution time. 2.3 Parallel Computing The logical extension of per-processor parallelisation is spreading heavy computational tasks across a large number of processors. There are two main methods for achieving this: • Utilising several separate computers networked together, sharing data and achieving synchronisation through message passing over the network; • Having two or more sockets on a motherboard, which increases the number of cores available in one computer without requiring more expensive, advanced processors. The first method’s primary draw is its scalability; it is useful for constructing large systems relatively cheaply without specialised hardware. The main benefit of the second is that it avoids the overhead of the first’s message-passing while still maintaining the ability to use consumer-grade hardware. It is possible to combine these approaches in order to gain many of the benefits of both. Instances of such combinations are: • Edinburgh University’s Archer supercomputer[12]: a 4920-node Cray XC30 MPP supercomputer with two 12-core Intel Ivy Bridge processors on each node, providing a total of 118,080 cores; • The UK Met Office’s supercomputer[4][13]: A Cray XC40, with a large number of Intel Xeon processors, providing a total of 480,000 cores. • Southampton University’s Iridis-Pi cluster[14]: a cluster of 64 Raspberry Pi Model B nodes, providing a low-power, low-cost 64-core computer for educational applications. Each of these focuses on providing a massively parallel system, in order to carry out certain types of parallel computations; if they are used for a computation which must be done in serial or which simply does not take advantage of the topology, all that they gain over a single-core machine is lost. Libraries which abstract away hardware details are often used to take advantage of this kind of architecture; this makes it relatively easy to create software which is scalable across several sockets (each containing a multi-core CPU), or even several networked computation nodes, while not requiring in-depth architectural knowledge. 10
  • 20. Chapter 2. Background 2.4 Caching In Multi-Processor Systems In NUMA systems, there are two main approaches to cache management: the far simpler and less expensive method (in terms of hardware-level synchronisation) involves not syn- chronising caches across processors; this has the major disadvantage that programming in the common Von Neumann paradigm becomes too complex to be feasible. The other method involves maintaining cache coherency at the hardware level; this is called cache-coherent NUMA (ccNUMA). It requires significantly more complexity in the design of the system and can lead to a substantial synchronisation overhead if two processors are accessing data in the same cache line; writes from one processor require an update in the cache of the other, introducing latency. ccNUMA is the more common of these two because it does not introduce extra complexity in creating correct programs for multi-processor systems and the synchronisation overhead can be avoided by not simultaneously accessing data within the same cache line on different cores. 2.5 Memory In Agent-Based Models Static preallocation of space for agents is a potential solution to the problem of allocating memory for them in dynamic agent-based models; through first touch policies, it provides the ability to handle agent placement easily and intelligently in terms of physical NUMA domains. Memory is commonly partitioned into virtual pages, which, under first touch, are only assigned physical space when they are first accessed; the physical memory assigned is on the NUMA domain to which the processor touching the memory belongs[15]. This is, however, not always an option, as either it requires reallocation upon dataset growth or massive over-allocation and the imposition of an artificial upper bound on dataset size. Dynamic allocation methods do not require reallocation or size boundaries. The C standard library’s malloc() is the primary method of allocating memory dynamically from the heap, but it has several disadvantages when used for agent-based modelling: • Execution time – malloc() can make no assumptions about allocation size. This means it has to handle holes left by free()d memory, so allocating data can require searching for suitably-sized free blocks of memory. 11
  • 21. Chapter 2. Background • Lack of cache effectiveness – Since allocations for all malloc() calls can come from the same pool, there is no guarantee that all agents will occupy contiguous memory, meaning that iteration may cause a significant number of cache misses. In Figure 2.2, let each agent be equal in size to half of that loaded into core- local cache on each miss. If the first (red) agent is accessed by thread 1, the processor will also load the second (yellow) one into cache. The second agent constitutes wasted cache space because it will not be accessed by the processor. In the worst case, synchronisation needs to occur between the two processors running the red and yellow threads in order to ensure coherency. • Lack of NUMA-awareness – Many malloc() implementations (for example the Linux default implementation and tcmalloc) do not necessarily allocate from a pool which is local to the core requesting the memory. First-touch page creation means that if a malloc() call returns memory be- longing to a thus-far untouched page and we initialise the memory on a CPU belonging to the NUMA domain from which we will then access it, we should not incur a cross-domain access. However, malloc() implementations which are not NUMA-aware may allocate from pages which may have already been faulted into memory on another domain. We have no way, therefore, to guarantee that accessing agents allocated using malloc() and similar calls will not incur the penalty of a cross-domain access. malloc() implementations exist which are NUMA-aware. However, these still exhibit the other two problems because malloc() can make no assumptions about the context in which memory will be used and it must support variably-sized allocations. Consequently, even NUMA-aware implementations are not suitable for this class of applications as a result of the trade-off between generality and performance. Figure 2.2: Example of agent allocation with Linux’ built-in malloc() imple- mentation. Agents belonging to each thread are represented by a different colour per thread. Black represents non-agent memory. 12
  • 22. Chapter 2. Background 2.6 Workload Balancing When operating on data sets in parallel, one issue which needs to be addressed is how to ensure that each thread will finish its current workload at approximately the same time; if threads finish in a staggered fashion, this can lead to sub-optimal parallel performance as some threads that could be working are instead idle. 2.6.1 Work Stealing In work stealing, balance across threads is achieved by the scheduler; computation is split into discrete tasks which it then assigns to different processors in the form of a task queue. If one processor completes its task queue, it “steals” one or more tasks from another’s queue. This means that, as long as there exist tasks which have not been started and each task is of a similar length, no processor will be idle for more than the time it takes to complete one task. If the tasks are not necessarily of a similar length, balance can still be approximately achieved by estimating the length of each task and optimising the task queues based on these estimates. 2.6.2 Data-Based Balancing Data-based balancing is a method of balancing which consists of assigning blocks of data to specific threads based on some partitioning strategy; these partitioning strategies can be based on data size, for example, or the results of profiling several recent runs of com- putational kernels. Balancing with this strategy is inherently simpler than with task balancing if we are running an identical kernel since it simply involves ensuring that each thread has approximately the same amount of data to operate on. We can expand upon this by using timing data from previous runs to estimate how long each thread will take to run, allowing us to achieve a closer to optimal balance without the overhead of balancing at runtime. 2.7 Existing Approaches 2.7.1 Manual Parallelisation There are two primary types of parallelisation: one involves running code across several cores on the same motherboard; the other involves running it across several processors on different computers, using messaging on a local network. The former approach avoids the 13
  • 23. Chapter 2. Background overhead of message passing, whereas the latter is more scalable using consumer hardware. Often, they are mixed, using the MPI (Message Passing Interface) standard for the inter- computer messaging and OpenMP or the Operating System’s threading interface for the local parallelism. Both require manual management of the placement of data. If running on a single NUMA- enabled system, this involves predicting the cores which will be accessing data and allo- cating physical memory accordingly. Using multiple networked computers requires manual usage of MPI functions in order to transfer data among computers; it is not implicit, so the application must be designed with this in mind. The primary disadvantages of manual parallelisation are both related to its complexity; it requires sufficient knowledge of system APIs and system architecture and it can require a significant amount of programming time. Often, this renders a custom solution infeasible. 2.7.2 OP2/PyOP2 OP2, and its Python analogue, PyOP2, provide “an open-source framework for the ex- ecution of unstructured grid applications on clusters of GPUs or multi-core CPUs.”[16] They are focused on MPI-level distribution, with support for OpenCL, OpenMP or CUDA for local parallelism. These two levels can be combined in one application, enabling the developer to take advantage of the benefits of both. The framework operates on static, independent data sets, allowing for data-specific opti- misation at compile time. Its architecture involves code generation at compile-time using “source-source translation to generate the appropriate back-end code for the different tar- get platforms.”[16][17] The static nature of the data results in a low complexity requirement at runtime in terms of memory management. The independence constraint means that the order in which the data are iterated over for kernel application must have no significant impact beyond floating point errors on its result. The OpenMP local parallelisation code does not encounter the NUMA problem because all of the data are statically allocated; as long as the data which will be accessed from CPUs on different domains reside in different virtual pages, the default first touch policy in most modern Operating Systems will ensure that memory accesses are primarily to the local NUMA domain. 2.7.3 Galois Galois is a C++ and Java framework which operates on data sets consisting of dynamic, in- terdependent data. Consequently, both its memory management and its kernel run schedul- ing have a significant runtime overhead. 14
  • 24. Chapter 2. Background Due to this extra overhead, it is primarily useful for data which are dynamic and have sufficiently complex dependencies. Galois is explicitly NUMA aware and contains support for OpenMP and MPI. However, it only supports Linux. 2.7.4 Intel TBB Intel provides a C++ parallelisation library called Thread Building Blocks. It contains algorithms and structures designed to simplify the creation of multithreaded programs. It implements task-based parallelism, with which it uses a task-based balancer, and provides a memory allocator which prevents false sharing[18]. TBB is NUMA-aware. Its lack of specificity and task-based balancing do, however, mean that it is not possible to ensure as much NUMA locality as in problem-specific, data balancing libraries such as PUMA. Consequently, it is not necessarily an ideal solution in some applications where runtime is one of the primary considerations. 2.7.5 Cilk Cilk is a language based on C (with other dialects based on C++) which provides meth- ods for a programmer to identify parallel sections while leaving the runtime to perform scheduling. The task scheduler is based on a work stealing strategy, where the tasks are defined by the programmer. Cilk is not explicitly NUMA aware, and because tasks are scheduled by the runtime rather than the programmer, there is limited scope to make use of NUMA systems while minimis- ing off-domain accesses. 2.8 LERM Our primary case study for this work is a Lagrangian Ensemble Recruitment metamodel, as detailed in [19], which simulates phytoplankton populations. Our reference implementation is the result of prior work[20] that involved parallelising one such metamodel. It was observed that the reference implementation encountered the NUMA effect, leading to a significant reduction in parallel efficiency in the trivially parallel section when spread across domains; this is shown in Figures 1.2 and 1.3. The simulation (LERM) consists of three primary parts: an agent update loop; particle management, for the creation and deletion of agents; and the environment update, which 15
  • 25. Chapter 2. Background simulates the spread and interaction of agent-caused environmental changes. Listing 2.1 shows the main algorithm implemented in Python-like pseudocode. The three main sections roughly correspond to three different cases we may encounter: the update loop (Figure 1.2) is primarily trivially parallel with a reduction of per-thread data at the end; the particle management step (Figure 2.3) is partially parallelisable but is implemented in serial in the reference implementation; and the environment update (Figure 2.4) is mostly parallelisable but implemented in serial in both implementations due to having a negligible impact on runtime. We can see that the update loop has an obvious dip in parallel efficiency after it begins to utilise cores on a different NUMA node, due to its not taking NUMA effects into account when assigning work to each thread. Because the particle management and environment sections are both implemented with serial algorithms, they do not demonstrate a reduction in parallel efficiency as a result of the NUMA effect. They do, however, begin to dominate as the update loop is distributed across cores, especially the particle management step. The update step is, in theory, trivially parallelisable. The particle management and envi- ronmental update steps are implemented in our reference implementation as serial code, but the particle management step can be parallelised. Approximately 98% of the simula- tion is parallelised (calculated with formula 2.1) in the PUMA version; by Amdahl’s Law, we can therefore achieve a theoretical maximum speedup of approximately 50×. Figure 1.4 shows that we achieve very close to this. However, in our reference implementation, only 87.5% is parallelised, because the original particle management is in serial whereas, out of necessity, we have parallelised the particle management in PUMA. By Amdahl’s Law, the maximum speedup achievable by the ref- erence implementation is 8×. Figure 2.5 shows that we do not achieve close to our ideal runtime, because of NUMA latency. In order to ensure that we do observe the results of NUMA effects if they have an impact, we initialise the LERM simulation with 400000 13 byte agents. Our dataset size (not taking into account metadata overhead and other considerations) is, therefore, approximately 20.8MB. Since the L3 cache in our test machine is 12MB, we ensure that every timestep requires that at least 40% of the agents have to be re-read from main memory. 16
  • 26. Chapter 2. Background 1 def splitAgents () : 2 while len ( agents ) < minAgents : 3 splitIntoTwo ( someAgent ) 4 5 def mergeAgents () : 6 while len ( agents ) > maxAgents : 7 mergeIntoOne ( someAgent , someOtherAgent ) 8 9 def updateAgents () : 10 # T r i v i a l l y p a r a l l e l loop 11 f o r agent in agents : 12 ecologyKernel ( agent ) 13 14 reducePerThreadData () 15 16 def particleManagement () : 17 splitAgents () 18 mergeAgents () 19 20 def mixChemistry () : 21 f o r layer in l a y e r s : 22 totalConcentration += layer . concentration 23 24 def updateEnvironment () : 25 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e () 26 mixChemistry () 27 28 def main () : 29 in itia lise Envi ronm ent () 30 31 while i < max timestep : 32 updateAgents () 33 particleManagement () 34 updateEnvironment () Listing 2.1: LERM pseudocode 17
  • 27. Chapter 2. Background 1 2 3 4 5 6 7 8 9 10 11 12 0 20 40 60 80 100 Cores In Use ParallelEfficiency(PM) Figure 2.3: Parallel efficiency reached by particle management with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we use cores on another NUMA node. 18
  • 28. Chapter 2. Background 1 2 3 4 5 6 7 8 9 10 11 12 0 20 40 60 80 100 Cores In Use ParallelEfficiency(env) Figure 2.4: Parallel efficiency reached by environment update with dynamic data using static over-allocation and not taking NUMA effects into account. The red vertical line signifies the point past which we utilise cores from a second NUMA node. 19
  • 29. Chapter 2. Background 1 2 4 8 631,000 1,000,000 1,580,000 2,510,000 Cores In Use TotalTime(ms) Actual Timing Minimum Time (Amdahl’s Law) Figure 2.5: Total runtime with dynamic data using static storage in the refer- ence implementation. Ts + Tp Tt (2.1) How we calculate the proportion of an application which is parallelised. Ts is the time spent in the serial sections, Tp is the time spent in the parallel sections and Tt is the total execution time. 20
  • 30. Chapter 3 Design and Implementation PUMA consists of several parts: • A NUMA-aware dynamic memory allocator for homogeneous elements; • An allocator for thread-local static data which cannot be freed individually. This allocator uses pools which are located on the domain to which the core associated with the thread belongs; • A parallel iteration interface which applies a kernel to all elements in a PUMA set; • A balancer which changes which threads blocks of data are associated with in order to balance kernel runtime across cores. Much of PUMA’s design was needs-driven: it was developed in parallel with its integration into a case study (see section 2.8) and its design evolved as new requirements became clear. The reason that PUMA provides a kernel application function rather than direct access to the underlying memory is to enable it to prevent cross-domain accesses. We achieve this by pinning each thread in our pool to a specific core and ensuring that when we run the kernel across our threads, each thread can access only domain-local elements. PUMA works under the assumption that the application involves the manipulation of sets of homogeneous elements. In our case study, these elements are the agents within the model, each of which represents a group within the overall population of phytoplankton, and we use two PUMA sets, one for each of dead and alive agents. PUMA implements parallelism by maintaining a list of elements per thread, each of which can only be accessed by a single thread at a time. 21
  • 31. Chapter 3. Design and Implementation Figure 3.1: How we lay out our internal lists of allocated memory. The black box represents the user-facing set structure, and each of blue and red represents a different thread’s list of pre-allocated memory blocks. These blocks are the same size for each thread. 3.1 Dynamic Memory Allocator Our initial design involved an unordered data structure which could act as a memory manager for homogeneous elements. It would provide methods to map kernels across all of its elements and ensure that each element would only be accessed from a processor belonging to the NUMA domain on which it was allocated. In order to achieve this, we use one list per thread within the user-facing PUMA set structure (Figure 3.1). Each of these lists contains one or more blocks of memory at least one virtual memory page long. We have a 1:1 mapping of threads to cores, enabling a mostly lock-free design. This allows us to have correct multithreaded code while minimising time-consuming context switches. 3.1.1 Memory Pools In order to allocate memory quickly on demand, our dynamic allocator pre-allocates blocks, each of which is one or more pages long. This has two purposes: only requesting large blocks from the Operating System allows us to reduce the time spent on system calls; and the smallest blocks on which system calls for the placement of data on NUMA domains can 22
  • 32. Chapter 3. Design and Implementation operate are one page long and must be page aligned. These blocks have descriptors at the start which contain information on the memory which has been allocated from them. The descriptors also contain pointers to the next and previous elements in their per-thread list to allow for iteration over all elements. Currently, block size is determined by a preprocessor definition at compile time, because this size is integral to calculating the location of metadata from an element’s address. It could also be determined at run-time if set before any PUMA initialisation is performed by the application. 3.1.2 Element Headers In order to free elements without exposing too much internal state to the user, we must have some way of mapping elements’ addresses to the blocks in which they reside. Originally, each element had a header containing a pointer to its block’s descriptor. This introduces a significant memory overhead, however, especially if the size of the elements is small compared to that of a pointer. PUMA should use few resources in order to give users as much freedom as possible in its use. Consequently, we devised two separate strategies for mapping elements to blocks’ descriptors. The first (Figure 3.2) was based on the NUMA allocation system calls which we were already using to allocate blocks for the thread lists. These calls (specifically numa alloc onnode() and numa alloc local()) guarantee that allocated memory will be page-aligned. Figure 3.2: Our first strategy for mapping elements to block descriptors. Darker blue represents the block’s descriptor; light blue represents a page header with a pointer to the block’s descriptor; red represents elements; the vertical lines represent page boundaries; and white represents unallocated space within the block. If we ensure that each page within a block has a header, we can store a pointer in that header to the block’s descriptor. Finding the block descriptor for a given element then simply involves rounding the element’s address down to the next lowest multiple of the page size. This has two major disadvantages, however: 23
  • 33. Chapter 3. Design and Implementation • In order to calculate the index of a given element or the address corresponding to an element’s index, we must perform a relatively complex calculation (between twenty and fifty arithmetic operations), as shown in listing 3.1, rather than simple pointer arithmetic (up to five operations). These are common calculations within PUMA, so minimising their complexity is critical. • If the usable page size after each header is not a multiple of our element size, we can have up to sizeof(element) - 1 bytes of wasted space. This is especially problematic with elements which are larger than our pages. void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i ) { s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ; char ∗ arrayStart = node−>elementArray ; s i z e t f i r s t S k i p I n d e x = getIndexOfElementOnNode ( node , ( char ∗) node + pageSize + s i z e o f ( s t r u c t pumaHeader ) ) ; s i z e t elemsPerPage = getIndexOfElementOnNode ( node , ( char ∗) node + 2 ∗ pageSize + s i z e o f ( s t r u c t pumaHeader ) ) − f i r s t S k i p I n d e x ; s i z e t pageNum = ( i >= f i r s t S k i p I n d e x ) ∗ (1 + ( i − f i r s t S k i p I n d e x ) / elemsPerPage ) ; s i z e t lostSpace = (pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−> elementSize ) + (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader ) ) % node−>elementSize ) + pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ; void ∗ element = ( i ∗ node−>elementSize + lostSpace + arrayStart ) ; return element ; } s i z e t getIndexOfElement ( void ∗ element ) { s t r u c t pumaNode∗ node = getNodeForElement ( element ) ; return getIndexOfElementOnNode ( node , element ) ; } s i z e t getIndexOfElementOnNode ( s t r u c t pumaNode∗ node , void ∗ element ) { s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ; 24
  • 34. Chapter 3. Design and Implementation char ∗ arrayStart = node−>elementArray ; s i z e t pageNum = (( s i z e t ) element − ( s i z e t ) node ) / pageSize ; s i z e t lostSpace = (pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−> elementSize ) + (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader ) ) % node−>elementSize ) + pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ; s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart − lostSpace ) / node−> elementSize ; return index ; } Listing 3.1: Calculating the mapping between elements and their indices with a per-page header. Our second strategy (Figure 3.3) eliminated the need for these complex operations while reducing memory overhead. POSIX systems provide a function to request a chunk of memory aligned to a certain size, as long as that size is 2n pages long for some integer n. If we ensure that block sizes also follow that restriction, we can allocate blockSize bytes aligned to blockSize. Listing 3.2 shows how we calculate the mapping between elements and their indices with this strategy. s i z e t getIndexOfElement ( void ∗ element ) { s t r u c t pumaNode∗ node = getNodeForElement ( element ) ; return getIndexOfElementOnNode ( element , node ) ; } s i z e t getIndexOfElementOnNode ( void ∗ element , s t r u c t pumaNode∗ node ) { char ∗ arrayStart = node−>elementArray ; s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart ) / node−>elementSize ; return index ; } void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i ) { char ∗ arrayStart = node−>elementArray ; void ∗ element = ( i ∗ node−>elementSize + arrayStart ) ; return element ; 25
  • 35. Chapter 3. Design and Implementation } s t r u c t pumaNode∗ getNodeForElement ( void ∗ element ) { s t r u c t pumaNode∗ node = ( s t r u c t pumaNode∗) (( s i z e t ) element & ˜(( pumaPageSize ∗ PUMA NODEPAGES) − 1) ) ; return node ; } Listing 3.2: Calculating the mapping between elements and their indices without using headers. Figure 3.3: Our second strategy for mapping elements to block descriptors. The blue block represents the block’s descriptor; the red blocks represent elements; and the vertical lines represent page boundaries. In this example, blocks are two pages long. 3.2 Static Data After parallelising all of the trivially parallel code in our primary case study, we found that we were still encountering a major bottleneck. Profiling revealed that this was mostly caused by an otherwise innocuous line in a pseudo random number generator. It was using a static variable as the initial seed and then updating the seed each time it was called, as shown in listing 3.3. f l o a t rnd ( f l o a t a ) { s t a t i c int i = 79654659; f l o a t n ; i = ( i ∗ 125) % 2796203; n = ( i % ( int ) a ) + 1 . 0 ; return n ; } Listing 3.3: Pseudo-random number generator where i is a static seed. 26
  • 36. Chapter 3. Design and Implementation As we increased our number of threads, writing to the seed required threads to wait for cache synchronisation between cores, and using cores belonging to multiple NUMA domains incurred lengthy cross-domain accesses. The cache coherency problem could be solved to an extent using thread-local storage such as that provided by #pragma omp threadprivate(...). However, since there are no guarantees about the placement of thread-local static storage in relation to other threads’ variables, multiple thread-local seeds can still be located within the same cache line, leading to synchronisation. This also means that we cannot optimise for NUMA without a more problem-specific static memory management scheme. We implemented a simple memory allocator which can allocate blocks of variable sizes but not free() them individually. The lack of support for free()ing allows us to avoid having to search for empty space within our available heap space while still allowing for variable-sized allocations. This then places the responsibility for retaining reusable blocks on the application developer. This allocator is primarily for static data which is accessed regularly when running a kernel, such as return values or seeds. This allocator returns blocks of data which are located on the NUMA domain local to the CPU which calls the allocation function. The main differences between it and PUMA’s primary memory allocator are: • The user is expected to keep track of allocated memory; • The allocator enables variable sizes; • Allocated blocks cannot be individually free()d. 3.3 Kernel Application PUMA does not provide any way of retrieving individual elements from its set of allocated elements. Instead, it exposes an interface for applying kernels to all elements. This interface also enables the specification of functions used to manipulate extra data which is to be passed into the kernel. With this, we can manipulate the data in the set as long as our manipulation can be done in parallel and is not order-dependent. The extra data which is passed into the kernel is thread-local in order to avoid cache coherency overhead and expensive thread synchronisation. Consequently, we also allow the user to specify a reduction function which is executed after all threads have finished running the kernel and has access to all threads’ extra data. 27
  • 37. Chapter 3. Design and Implementation 3.3.1 Load Balancer When a kernel is run with PUMA, it first balances all of the per-thread lists at a block level based on timing data from previous runs. If one thread has recently finished running kernels significantly faster than other threads on average, we transfer blocks from slower threads to it in order to increase its workload. 3.4 Challenges 3.4.1 Profiling One of the most challenging aspects of developing PUMA was identifying the location and type of bottlenecks. Most profiling tools we encountered, such as Intel’s VTune Amplifier[21] and GNU gprof[22] are time- or cycle-based. VTune also provides metrics to do with how OpenMP is utilised. However, finding hotspots of cross-domain activity was still a matter of making educated guesses based on abnormal timing results from profilers. VTune and a profiler called Likwid[23] also provide access to hardware counters, which can be useful for profiling cross-domain accesses. However, without superuser access, it can be difficult to obtain hardware counter-based results from these tools which can be used for profiling; only the counters’ total values from the entire run are shown, meaning that identifying hotspots is still a matter of guesswork. In section 6.1 we discuss possible approaches to implementing userspace memory access profiling tools in order to reduce the amount of guesswork required. 3.4.2 Invalid Memory Accesses Because PUMA includes a memory allocator, we encountered several bugs regarding ac- cessing invalid memory and corrupting header data. In order to prevent these bugs, we use Valgrind’s[24] error detection interface to make our allocator compatible with Valgrind’s memcheck utility. This enables Valgrind to alert the user if they are reading from uninitialised memory or writing to un-allocated or free()d memory. This is not fully implemented, however; ideally, we would have Valgrind protect all memory containing metadata. However, it is possible for multiple threads to read each other’s metadata at once (without writing to it). Reading another thread’s metadata requires marking it as valid before reading and marking it as invalid after. Due to the non-deterministic nature of thread scheduling, this could sometimes lead to 28
  • 38. Chapter 3. Design and Implementation interleaving of validating and invalidating memory in such a way that between a thread validating memory and reading it, another thread may have read the memory and then invalidated it. We decided that since overwriting this per-thread metadata was unlikely compared to other memory access bugs, it was sensible to avoid protecting these blocks of memory entirely in order to avoid false positives in Valgrind’s output. 3.4.3 Local vs. Remote Testing NUMA-based architectures are not particularly prevalent in current consumer computers. Consequently, the majority of our testing of the NUMA-based sections of PUMA had to be performed while logged into a remote server. It is, however, possible to perform some of this NUMA-based testing on a non-NUMA ma- chine. While it is not particularly useful for gathering timing data, the qemu virtual ma- chine has a configuration option enabling NUMA simulation, even on non-NUMA machines. This can be useful for testing robustness and correctness of NUMA-aware applications.[25]. 3.5 Testing and Debugging We used various methods to test and debug PUMA. For testing, we wrote a short test suite which tested several functions which were known to have caused errors which were difficult to debug in early development. We also used LERM as a more comprehensive testing platform, comparing the biomass in the PUMA version with that in the reference implementation as a metric of functional correctness. In terms of debugging, several methods were used. We used gdb and our Valgrind com- patibility with both LERM and our unit tests in order to identify bugs within PUMA itself. We used system timers to assess whether each section’s parallel efficiency met our expecta- tions. We also used both VTune and Likwid to collect more granular timing data, allowing us to identify bottlenecks within both PUMA and LERM. PUMA bottlenecks acted as indicators for what to optimise and LERM bottlenecks helped with the identification of useful features for PUMA. 29
  • 39. Chapter 3. Design and Implementation 3.6 Compilation Compilation of PUMA requires a simple make invocation in the PUMA root directory. The make targets are as follows: • all: Build PUMA and docs and run unit tests • doc: Build documentation with doxygen • no test: Build PUMA without running unit tests • clean: Clear the working tree • docs clean: Clear all built documentation 3.6.1 Dependencies PUMA relies on the following: • libNUMA • C99 compatible compiler • Valgrind (optional) • OpenMP (optional) • Doxygen (optional, documentation) 3.6.2 Configuration The following are configuration options for public use. For options which are either enabled or disabled, 1 enables and 0 disables. • PUMA NODEPAGES: Specifies the number of pages to allocate per chunk in the per-thread chunk list. Default 2 • OPENMP: Enable OpenMP. If disabled, we use PUMA’s pthread-based thread pool- ing solution (experimental). Default enabled • STATIC THREADPOOL: If enabled and we are not using OpenMP, we share one thread pool amongst all instances of PUMASet. Default disabled 30
  • 40. Chapter 3. Design and Implementation • BINDIR: Where we place the build shared library. Default {pumadir}/bin • VALGRIND: Whether we build with valgrind support. Default enabled The following is a configuration option for use during PUMA development. It may severely hurt performance so should never be used in performance-critical code. • DEBUG: Enable assertions. Default disabled 3.7 Getting Started We present a short walkthrough on how to write a simple PUMA-based application in appendix C. It consists of the generation of a random data set of which we find the standard deviation by calculating the sums of all of the elements and of their squares. We also include an API reference in appendix 31
  • 41. Chapter 4 LERM Parallelisation The basic LERM model (section 2.8) is concerned primarily with the simulation of agents in a column of water 500m deep. The column is split into layers, with each layer corresponding to one metre of the column. When parallelising LERM, the na¨ıve approach involves domain decomposition; we split layers equally between processors and each processor operates only on agents within its assigned layers. This has the problem, however, of encouraging inter-thread communication; agents may move between layers, requiring the processor which moves a given agent to notify the newly responsible thread. Given that any or all agents can move between layers during an update, this potentially requires communication for every agent, leading to a large amount of time wasted by processors which are waiting for access to synchronisation constructs. This is not scalable beyond a certain number of processors (in this case, 500) without subdividing layers. Also, the distribution of agents between layers is likely not to be fully uniform, meaning that the workload will be unbalanced between processors. Since the size of the problem is dictated by the number of agents rather than the number of layers, and the number of agents is variable, a more scalable solution involves distributing agents between processors. Since agents do not have to move between the domains managed by different processors, there is no longer an inter-processor communication overhead. 4.1 Scalability The isoefficiency function (equation 4.1) is a way of relating parallel efficiency to problem size as the number of processors in use scales. One of its benefits is that it provides a way of exploring how problem size must scale with the number of processors in order to 32
  • 42. Chapter 4. LERM Parallelisation E = 1 1 + To W×tc (4.1) The isoefficiency function. W is the problem size, To is the serial overhead, tc is the cost of execution for each operation and E is the parallel efficiency[26] . Ω(W) = C × To (4.2) Workload growth for maintaining a fixed efficiency. W is the problem size, To is the serial overhead and C is a constant representing fixed efficiency.[27]. maintain the same parallel efficiency. Equation 4.2 shows a mapping between serial overhead and workload. If the equation holds - i.e. the workload can be increased at least as quickly as the serial overhead as we increase the number of processors in use - we say that an algorithm has perfect scalability. In other words, we can maintain a constant efficiency as we increase processors. The serial sections in the PUMA-based LERM implementation are all either O(n) (envi- ronmental update) or O(p) where p is the number of processors in use. This means that To and W are not directly related, so satisfying equation 4.1 requires scaling W proportionally to To. Since this is trivially sustainable as we increase the number of processors, the PUMA-based LERM implementation can, in theory, maintain a constant efficiency. 4.2 Applying PUMA to LERM Listing 4.1 shows Python-like pseudocode for the PUMA-based version of LERM. In order to adapt LERM to use PUMA, we must first identify all sections which operate on agents and adapt them to use the PUMA-based abstractions for running kernels, rather than iterating over all agents and applying the kernel manually. Lines 10, 13 and 17 show instances of this change when compared with lines 2, 6 and 11 respectively from listing 2.1. These areas are primarily in the update and particle management steps. We also identify any reductions performed after iterating over the agents and use PUMA’s reduction mech- anism to perform these automatically. Line 17 shows where we tell PUMA to perform the reduction after the update loop. 33
  • 43. Chapter 4. LERM Parallelisation 1 def s p l i t K e r n e l ( agent ) : 2 i f len ( agents ) < minAgents : 3 splitIntoTwo ( agent ) 4 5 def mergeKernel ( agent ) : 6 i f len ( agents ) > maxAgents : 7 mergeIntoOne ( agent , smallestAgent ) 8 9 def mergeAgents () : 10 runKernel ( mergeKernel ) 11 12 def splitAgents () : 13 runKernel ( mergeKernel ) 14 15 def updateAgents () : 16 # T r i v i a l l y p a r a l l e l loop 17 runKernel ( ecologyKernel , reduction=reducePerThreadData ) 18 19 20 # The r e s t i s the same as in the o r i g i n a l implementation 21 def particleManagement () : 22 splitAgents () 23 mergeAgents () 24 25 def mixChemistry () : 26 f o r layer in l a y e r s : 27 totalConcentration += layer . concentration 28 29 def updateEnvironment () : 30 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e () 31 mixChemistry () 32 33 def main () : 34 in itia lise Envi ronm ent () 35 36 while i < max timestep : 37 updateAgents () 38 particleManagement () 39 updateEnvironment () Listing 4.1: PUMA-based LERM pseudocode 34
  • 44. Chapter 5 Evaluation Our primary metrics by which we examine the success of PUMA are twofold: first, we compare our case study as implemented with PUMA to the reference implementation, specifically in relation to biomass of plankton; second, we compare measured profiling data to our expectations and to the reference. 5.1 Correctness Figure 5.1 shows the biomass over time in the PUMA implementation. Even across several runs with random initial seeds, it does not significantly deviate from the average. We compare the average biomass in the PUMA implementation with the same metric in the reference implementation in Figure 5.2, which shows that they follow a similar pattern and the difference between the two is at most 7.2% of the reference’s biomass. The differences are due to two factors: • PUMA manually manages the workload for each thread. Since agents interact with per-thread environment variables and the order of iteration over the agents is unde- fined, the exact result of the simulation is non-deterministic. • PUMA enforces a parallel programming model when interacting with the agents it manages, because all iteration over agents must be expressed in the form of a parallelisable kernel. Because of this, we had to reimplement the particle management step in this form, which lead to different behaviour on a microscopic scale while macroscopically maintaining correctness. A benefit of having reimplemented particle management is that it prevents serial particle management from dominating performance results, allowing us to focus on the NUMA 35
  • 45. Chapter 5. Evaluation 0 100 200 300 400 500 600 700 800 0 2 · 1010 4 · 1010 6 · 1010 8 · 1010 1 · 1011 Timestep (Each Corresponds to 30 Minutes) PlanktonBiomass Figure 5.1: Biomass of plankton in the PUMA-based LERM simulation over time. Black are maximum and minimum across all runs, red is average. problem. The new method has, however, not been rigorously statistically analysed because that is beyond the scope of this project (see section 6.1). 5.2 Profiling Our profiling data consists of two parts, both of which we compare with the reference LERM implementation. The first is the proportion of total memory accesses which are remote for each number of cores. We expect this to be higher when using cores belonging to a second domain, because reduction requires accesses to data assigned to all cores. It should, however, be significantly lower than in the reference implementation. The second is the parallel efficiency (formula A.1) across cores. In the trivially parallel section, we expect it to remain at approximately 100% even as we use cores belonging to a second domain. Figure 1.3 shows a comparison between off-domain accesses in the PUMA-based LERM im- 36
  • 46. Chapter 5. Evaluation 0 100 200 300 400 500 600 700 0 2 · 1010 4 · 1010 6 · 1010 8 · 1010 1 · 1011 Timestep (Each Corresponds to 30 Minutes) PlanktonBiomass 0 10 20 30 40 50 %Difference Difference (% of reference) PUMA Reference Figure 5.2: Comparison between the average biomasses across runs in the ref- erence and PUMA implementations of the LERM simulation over time. plementation and the reference implementation. In PUMA, we reduce off-domain accesses by 75% from the reference implementation. Our timing data are presented in Figures 5.5, 5.6 and 5.7. The most important section here is the mostly trivially parallel update; unlike in the reference implementation, we have no noticeable NUMA-based reduction in parallel efficiency. Both implementations exhibit similar parallel efficiency for the environmental update step (Figure 5.7), because in both cases it is implemented as a serial algorithm. Figures 5.3 and 5.4 each show the total time taken for the simulation when run on up to twelve cores on a log-log plot. It also shows the theoretical minimum time according to Amdahl’s Law, calculated with formula A.2. It is important to note that, while we have significantly reduced the time taken to run LERM on a single core when we compare the PUMA-based implementation with the reference implementation, this is not a direct result of PUMA. Instead, it is a result of slightly different scheduling causing changes in which paths are taken through the primary update kernel. The total timing graphs are best considered in isolation from each other, to see how they conform to the Amdahl’s 37
  • 47. Chapter 5. Evaluation 1 2 4 8 100,000 158,000 251,000 398,000 631,000 1,000,000 1,580,000 2,510,000 Cores In Use TotalTime(ms) Ideal Parallel Update Update Balancing Particle Management Environment Update Figure 5.3: Total runtime with dynamic data using PUMA. Law-dictated minimum timing in each case. 5.2.1 Load Balancing In order to assess the usefulness of our load balancer, we tested both with and without load balancing turned on. Figure 5.8 shows that load balancing has a small but noticable effect on runtime in the trivially parallel step of LERM. 5.3 Known Issues While our implementation provides an effective solution to the NUMA problem, we still have areas in which it can be improved. 38
  • 48. Chapter 5. Evaluation 1 2 4 8 251,000 398,000 631,000 1,000,000 1,580,000 2,510,000 Cores In Use TotalTime(ms) Ideal Update Particle Management Environment Update Figure 5.4: Total runtime with dynamic data using static over-allocation and not taking NUMA effects into account. 39
  • 49. Chapter 5. Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 60 65 70 75 80 85 90 95 100 Cores In Use ParallelEfficiency(Update) Base Update (With Reduction) PUMA Update (With Reduction) Base Update (Without Reduction) PUMA Update (Without Reduction) Figure 5.5: Parallel efficiency reached by agent updates in the PUMA imple- mentation of LERM vs in the reference implementation. The red vertical line signifies the point past which we utilise cores on a second NUMA node. In this instance, reduction is the necessary consolidation of per-thread environ- mental changes from the update loop into the global state. Margin of error is calculated by finding the percentage standard deviation in the original times and applying it to the parallel efficiency. 5.3.1 Thread Pooling PUMA relies on the persistence of thread pinning at initialisation. If new threads are created each time we execute code in parallel, the pinning is no longer persistent. Con- sequently, threads may be moved to other cores depending on the Operating System’s scheduler, leading to bugs which are difficult to reproduce. 40
  • 50. Chapter 5. Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 0 20 40 60 80 100 Cores In Use ParallelEfficiency(PM) Base PM PUMA PM Figure 5.6: Parallel efficiency reached by particle management in the PUMA implementation of LERM vs in the reference implementation. The user may specify a CPU affinity string for both the GNU and Intel OpenMP im- plementations as an environment variable. This has the disadvantage of requiring extra parameters at program invocation, however, and removing control from the programmer. We provide a custom thread pool implementation because the OpenMP standard does not specify whether threads are reused between parallel sections. Both the Intel and GNU implementations of OpenMP currently reuse threads which have been previously spawned[28][29], but the standard allows for new threads to be spawned each time a parallel section is implemented. However, our thread pool implementation does not scale as well as OpenMP (Figure 5.9) and relies on pthreads, meaning that it is not natively supported on Windows. Conse- quently, we would like to optimise the custom thread pool, or attempt to find an existing threading library in which we do not have to rely on non-guaranteed behaviour. 41
  • 51. Chapter 5. Evaluation 2 4 6 8 10 12 0 20 40 60 80 100 Cores In Use ParallelEfficiency(Environment) Base Environment PUMA Environment Figure 5.7: Parallel efficiency reached by environment update in the PUMA implementation of LERM vs in the reference implementation. 5.3.2 Parallel Balancing In order to ensure that no thread has a significantly longer runtime than any other, we implement workload balancing with heuristics based on previous kernel runtimes. While this has proven effective, it is non-optimal in that the parallelisable sections have not been parallelised. Since our balancing algorithm transfers ownership of memory among cores on the same NUMA domain first before performing inter-domain copies, it could be parallelised through domain decomposition; each NUMA domain would be internally balanced by a separate thread with a serial cross-domain reduction at the end. Currently, by Amdahl’s Law[10], the balancer limits the parallel speedup we can achieve, and the balancing time increases as we use more cores. 42
  • 52. Chapter 5. Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 90 92 94 96 98 100 Cores In Use ParallelEfficiency PUMA Without Load Balancing PUMA With Load Balancing Figure 5.8: Parallel efficiency reached by the update step in the PUMA imple- mentation of LERM with and without load balancing. 43
  • 53. Chapter 5. Evaluation 2 4 6 8 10 12 80 85 90 95 100 Cores In Use ParallelEfficiency(TriviallyParallel) Figure 5.9: Parallel efficiency reached by the trivially parallel section of LERM when using OpenMP (blue) and PUMA’s own thread pool (red). 44
  • 54. Chapter 6 Conclusions We have presented a framework which allows users with little or no knowledge of the underlying topology and memory hierarchy of NUMA-based systems to develop software which takes advantage of the available hardware while automatically preventing cache coherency overhead and cross-domain accesses within parallel kernels. Its uniqueness lies primarily in the class of problems which it tackles; as discussed in chapter 2.7, solutions exist which are tailored to help with several classes of problems. We have explored solutions for operating on sets of static, independent data (section 2.7.2) and graphs of dynamic data with complex dependency hierarchies (section 2.7.3), none of which are suitable for dynamic, independent data sets such as those used in branches of agent-based modelling. The availability of a solution tailored to this sort of problem could help with the rapid development of scientific applications, leading to easier research and simulation. 6.1 Future Work PUMA is far from complete; in particular, we would like to address the issues raised in section 5.3. We have designed PUMA to abstract away OS-specific interfaces, internally, for simplicity. While the systems on which PUMA has been tested are POSIX-based, Windows also provides NUMA libraries. In the future, it would be useful to port PUMA to Windows so that applications using PUMA are not bound to POSIX systems. A major feature which would make PUMA more able to take advantage of modern dis- tributed systems is MPI support. This would mainly require changes to the balancing and reduction sections of kernel application, and would enable further parallelisation. 45
  • 55. Chapter 6. Conclusions In order to help PUMA adoption in the scientific community, we would like to create bindings for languages such as Python and Fortran, both of which are prevalent in scientific computing. Tools exist for both languages to interface with C functions, so this should require very little work in exchange for broader applicability of PUMA. As mentioned in chapter 5, our parallelised version of LERM’s particle management step has not been rigorously statistically analysed. It would be useful to analyse the changes in order to assess whether the current PUMA implementation can be adapted to larger sim- ulations. If not, adaptation would require different, possibly more complex parallelisation methods. In section 3.4.1, we discuss the potential usefulness of NUMA memory access profiling. During PUMA’s development, we briefly explored various methods for the creation of a profiler which would not require superuser privileges and would perform line-by-line pro- filing of memory accesses, specifically identifying spots where many cross-domain accesses were performed and where cache synchronisation dominated timing. Unfortunately, it was too far outside the scope of PUMA to realistically explore in depth. We examined two possible strategies for the implementation of such a profiler: • Using some debugging library (such as LLDB’s C++ API[30]) to trap every memory access and determine the physical location of the accessed address in order to count off-domain accesses; • Building on Valgrind, which translates machine code into its own RISC-like language before executing the translated code, to count off-domain accesses. Valgrind could also be used to examine cache coherency latency by adapting Cachegrind, a tool which profiles cache utilisation. 46
  • 56. Bibliography [1] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics Magazine, 1965. [2] Wikimedia Commons. Transistor count and moore’s law, 2011. [3] X. Guo, G. Gorman, M. Lange, L. Mitchell, and M. Weiland. Exploring the thread-level parallelisms for the next generation geophysical fluid modelling frame- work fluidity-icom. Procedia Engineering, 61:251–257, 2013. [4] http://www.metoffice.gov.uk/news/in-depth/supercomputers. [5] C. Vollaire, L. Nicolas, and A. Nicolas. Parallel computing for the finite element method. Eur. Phys. J. AP, 1(3):305–314, 1998. [6] http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf. [7] Sunny Y. Auyang. Foundations of Complex-system Theories: In Economics, Evolu- tionary Biology, and Statistical Physics. Cambridge University Press, 1999. [8] U. Berger and H. Hildenbrandt. A new approach to spatially explicit modelling of forest dynamics: spacing, ageing and neighbourhood competition of mangrove trees. Ecological Modelling, 132:287–302, 2000. [9] U. Saint-Paul and H. Schneider. Mangrove Dynamics and Management in North Brazil. Springer Science & Business Media, 2010. [10] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS ’67 (Spring), 1967. [11] John Von Neumann. First Draft of a Report on the EDVAC - https: //web.archive.org/web/20130314123032/http://qss.stanford.edu/~godfrey/ vonNeumann/vnedvac.pdf. 47
  • 57. Bibliography [12] http://www.archer.ac.uk/about-archer/. [13] http://www.cray.com/sites/default/files/resources/cray_xc40_ specifications.pdf. [14] Richard P. Boardman Steven J. Johnston Mark Scott Neil S. O’Brien Simon J. Cox, James T. Cox. Iridis-pi: a low-cost, compact demonstration cluster. Cluster Comput- ing, 2013. [15] http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/ OrOn2_PfTune/sgi_html/ch08.html. [16] http://www.oerc.ox.ac.uk/projects/op2. [17] http://www.oerc.ox.ac.uk/sites/default/files/uploads/ProjectFiles/OP2/ OP2_Users_Guide.pdf. [18] https://software.intel.com/en-us/intel-tbb/details. [19] J.D. Woods. The lagrangian ensemble metamodel for simulating plankton ecosystems. Progress in Oceanography, 67(1-2):84–159, 2005. [20] Robert Kruszewski. Accelerating agent-based python models. Master’s thesis, Imperial College London. [21] https://software.intel.com/en-us/intel-vtune-amplifier-xe. [22] https://sourceware.org/binutils/docs/gprof/. [23] https://code.google.com/p/likwid/. [24] http://valgrind.org/. [25] http://linux.die.net/man/1/qemu-kvm. [26] Ananth Grama, Anshul Gupta, and Vipin Kumar. Isoefficiency function: A scalability metric for parallel algorithms and architectures, 1993. [27] Peter Hanuliak and Michal Hanuliak. Analytical modelling in parallel and distributed computing, pages 101–102. Chartridge Books Oxford, 2014. [28] https://software.intel.com/en-us/forums/topic/382683. [29] https://software.intel.com/en-us/forums/topic/382683. [30] http://lldb.llvm.org/cpp_reference/html/index.html. 48
  • 59. Appendix A Methods For Gathering Data All performance data are gathered using the Imperial College High Performance Computing service (unless explicitly stated otherwise). Code was run on the Cx1 general-purpose cluster using a node with the following hardware: • Two six core Intel R Xeon R X5650, 2.66GHz, 12MB last-level cache[31] – 2.66GHz – 32KB L1 instruction cache per core – 32KB L1 data cache per core – 256KB L2 cache – 12MB L3 cache • Two NUMA domains, one for each processor • Limited to 1GB memory by qsub queuing system likwid-perfctr was used to gather information from hardware counters; these were pri- marily related to counting cross-domain accesses using the UNC QHL REQUESTS REMOTE READS counter and local accesses with the UNC QHL REQUESTS LOCAL READS counter. All data were collected by averaging results over ten runs. Formulae: • Parallel efficiency: 100 × T1 Tn × n (A.1) where Tn is the time taken to run on n cores. 50
  • 60. Appendix A. Methods For Gathering Data • Amdahl’s Law theoretical minimum runtime: T1 ∗ Ps + T1 ∗ Pp/n (A.2) where Tn is the time taken to run on n cores, Ps is the proportion of the program with is serial, Pp is the proportion which is in parallelisable and n is the number of cores. 51
  • 61. Appendix B API Reference 3.7 B.1 PUMA Set Management s t r u c t pumaSet∗ createPumaSet ( s i z e t elementSize , s i z e t numThreads , char ∗ threadAffinity ) ; Creates a new struct pumaSet. Arguments: elementSize Size of each element in the set. numThreads The number of threads we want to run pumaSet on. threadAffinity An affinity string specifying the CPUs to which to bind threads. Can contain numbers separated either by commas or dashes. “i-j” means bind to every cpu from i to j inclusive. “i,j” means bind to i and j. Formats can be mixed: for example, “0-3, 6, 10, 12, 15, 13” is valid. If NULL, binds each thread to the CPU whose number matches the thread (tid 0 == cpu 0 :: tid 1 == cpu 1 :: etc.). If non-NULL, must specify at least as many CPUs as there are threads. void destroyPumaSet ( s t r u c t pumaSet∗ set ) ; Destroys and frees memory from the struct pumaSet. 52
  • 62. Appendix B. API Reference s i z e t getNumElements ( s t r u c t pumaSet∗ set ) ; Returns the total number of elements in the struct pumaSet. typedef s i z e t ( splitterFunc ) ( void ∗ perElemBalData , s i z e t numThreads , void ∗ extraData ) ; Signature for a function which, given an element, the total number of threads and, option- ally, a void pointer, will specify the thread with which to associate the element. Arguments: perElemBalData Per-element data passed into pumallocManualBalancing() which enables the splitter to choose the placement of the associated element. numThreads The total number of threads in use. extraData Optional extra data, set by calling pumaListSetBalancer(). void pumaSetBalancer ( s t r u c t pumaSet∗ set , bool autoBalance , splitterFunc ∗ s p l i t t e r , void ∗ splitterExtraData ) ; Sets the balancing strategy for a struct pumaSet. Arguments: set Set to set the balancing strategy for. autoBalance Whether to automatically balance the set across threads prior to each kernel run. splitter A pointer to a function which determines the thread with which to associate new data when pumallocManualBalancing() is called. splitterExtraData A void pointer to be passed to the splitter function each time it is called. 53
  • 63. Appendix B. API Reference B.2 Memory Allocation void ∗ pumalloc ( s t r u c t pumaSet∗ set ) ; Adds an element to the struct pumaSet and returns a pointer to it. The new element is associated with the CPU on which the current thread is running. void ∗ pumallocManualBalancing ( s t r u c t pumaSet∗ set , void ∗ balData ) ; Adds an element to the struct pumaSet and returns a pointer to it. Passes balData to the set’s splitter function to determine the CPU with which to associate the new element. void ∗ pumallocAutoBalancing ( s t r u c t pumaSet∗ set ) ; Adds an element to the struct pumaSet and returns a pointer to it. Automatically asso- ciates the new element with the CPU with the fewest elements. void pufree ( void ∗ element ) ; Frees the specified element from its set. 54
  • 64. Appendix B. API Reference B.3 Kernel Application s t r u c t pumaExtraKernelData { void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) ; void ∗ constructorData ; void (∗ extraDataDestructor ) ( void ∗ data ) ; void (∗ extraDataThreadReduce ) ( void ∗ data ) ; void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] , unsigned int nThreads ) ; void ∗ retValue ; }; A descriptor of functions which handle extra data for kernels to pass into runKernel(). Members: extraDataConstructor A per-thread constructor for extra data which is passed into the kernel. constructorData A pointer to any extra data which may be required by the constructor. May be NULL. extraDataDestructor A destructor for data created with extraDataConstructor(). extraDataThreadReduce A finalisation function which is run after the kernel on a per- thread basis. Takes the per-thread data as an argument. extraDataReduce A global finalisation function which is run after all threads have finished running the kernel. Takes retValue, an array of the extra data for all threads and the number of threads in use. retValue A pointer to a return value for use by extraDataReduce. May be NULL. void initKernelData ( s t r u c t pumaExtraKernelData∗ kernelData , void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) , void ∗ constructorData , void (∗ extraDataDestructor ) ( void ∗ data ) , void (∗ extraDataThreadReduce ) ( void ∗ data ) , void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] , unsigned int nThreads ) , void ∗ retValue ) ; Initialises kernelData. Any or all of the arguments after kernelData may be NULL. Any NULL functions are set to dummy functions which do nothing. 55
  • 65. Appendix B. API Reference extern s t r u c t pumaExtraKernelData emptyKernelData ; A dummy descriptor for extra kernel data. Causes NULL to be passed to the kernel in place of extra data. typedef void (∗ pumaKernel ) ( void ∗ element , void ∗ extraData ) ; The type signature for kernels which are to be run on a PUMA list. Arguments: element The current element in our iteration. extraData Extra information specified by our extra data descriptor. void runKernel ( s t r u c t pumaSet∗ set , pumaKernel kernel , s t r u c t pumaExtraKernelData∗ extraDataDetails ) ; Applies the given kernel to all elements in a struct pumaSet. Arguments: set The set containing the elements to which we want to apply our kernel. kernel A pointer to the kernel to apply. extraDataDetails A pointer to the structure specifying the extra data to be passed into the kernel. void runKernelList ( s t r u c t pumaSet∗ set , pumaKernel kernels [ ] , s i z e t numKernels , s t r u c t pumaExtraKernelData∗ extraDataDetails ) ; Applies the given kernels to all elements in a struct pumaSet. Kernels are applied in the order in which they are specified in the array. Arguments: set The set containing the elements to which we want to apply our kernels. kernels An array of kernels to apply. numKernels The number of kernels to apply. extraDataDetails A pointer to the structure specifying the extra data to be passed into the kernels. 56
  • 66. Appendix B. API Reference B.4 Static Data Allocation void ∗ pumallocStaticLocal ( s i z e t s i z e ) ; Allocates thread-local storage which resides on the NUMA domain to which the CPU which executes the function belongs. Arguments: size The number of bytes we want to allocate. void ∗ pumaDeleteStaticData ( void ) ; Deletes all static data associated with the current thread. 57
  • 67. Appendix C Getting Started: Standard Deviation Hello World! In lieu of the traditional “Hello World” introductory program, we present a PUMA-based program which generates a large set of random numbers between 0 and 1 and uses the reduction mechanism of PUMA to calculate the set’s standard deviation. In order to calculate the standard deviation, we require three things: a kernel, a con- structor for the per-thread data and a reduction function. In the constructor, we use the pumallocStaticLocal() function to allocate a static variable on a per-thread basis which resides in memory local to the core to which each thread is pinned. This interface for allocating thread-local data are only intended to be used for static data whose lifespan extends to the end of the program. It is possible to delete all static data which is related to a thread, but it is more sensible to simply reuse the allocated memory each time we need similarly-sized data on a thread. This requires the use of pthread keys in order to retrieve the allocated pointer each time it is needed. // puma . h contains a l l of the puma public API d e c l a r a t i o n s we need . #include ”puma . h” #include <math . h> #include <pthread . h> #include <s t d l i b . h> #include <s t d i o . h> #include <getopt . h> pthread key t extraDataKey ; pthread once t initExtraDataOnce = PTHREAD ONCE INIT; s t a t i c void i n i t i a l i s e K e y ( void ) { pthread key create (&extraDataKey , NULL) ; 58
  • 68. Appendix C. Getting Started: Standard Deviation Hello World! } s t r u c t stdDevExtraData { double sum ; double squareSum ; s i z e t numElements ; }; s t a t i c void ∗ extraDataConstructor ( void ∗ constructorData ) { ( void ) pthread once(&initExtraDataOnce , &i n i t i a l i s e K e y ) ; void ∗ stdDevExtraData = p t h r e a d g e t s p e c i f i c ( extraDataKey ) ; i f ( stdDevExtraData == NULL) { stdDevExtraData = pumallocStaticLocal ( s i z e o f ( s t r u c t stdDevExtraData ) ) ; p t h r e a d s e t s p e c i f i c ( extraDataKey , stdDevExtraData ) ; } return stdDevExtraData ; } s t a t i c void extraDataReduce ( void ∗ voidRet , void ∗ voidData [ ] , unsigned int nThreads ) { double ∗ ret = ( double ∗) voidRet ; double sum = 0; double squareSum = 0; s i z e t numElements = 0; f o r ( unsigned int i = 0; i < nThreads ; ++i ) { s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData [ i ] ; numElements += data−>numElements ; sum += data−>sum ; squareSum += data−>squareSum ; } double mean = sum / numElements ; ∗ ret = squareSum / numElements − (mean ∗ mean) ; } s t a t i c void stdDevKernel ( void ∗ voidNum , void ∗ voidData ) { double num = ∗( double ∗)voidNum ; s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData ; data−>sum += num; data−>squareSum += num ∗ num; 59
  • 69. Appendix C. Getting Started: Standard Deviation Hello World! ++data−>numElements ; } s t a t i c void s t a t i c D e s t r u c t o r ( void ∗ arg ) { pumaDeleteStaticData () ; } Prior to running the kernel, we must actually create the struct pumaSet which contains our data; to do this, we specify the size of our elements, the number of threads we wish to use and, optionally, a string detailing what cores we want to pin threads to. We must also seed the random number generator and read the arguments: s t a t i c void printHelp ( char ∗ invocationName ) { p r i n t f ( ”Usage : %s −t numThreads −e numElements [−a a f f i n i t y S t r i n g ] n” ”tnumThreads : The number of threads to use n” ”tnumElements : The number of numbers to a l l o c a t e n” ” t a f f i n i t y S t r i n g : A s t r i n g which s p e c i f i e s which cores to run on . n” ) ; } int main ( int argc , char ∗∗ argv ) { int numThreads = 1; int numElements = 1000; char ∗ a f f i n i t y S t r = NULL; /∗ Get command l i n e input f o r the a f f i n i t y s t r i n g and number of threads . ∗/ char c ; while ( ( c = getopt ( argc , argv , ”e : a : t : ” ) ) != −1) { switch ( c ) { case ’ t ’ : numThreads = a to i ( optarg ) ; break ; case ’ e ’ : numElements = a to i ( optarg ) ; break ; case ’ a ’ : a f f i n i t y S t r = optArg ; break ; case ’h ’ : printHelp ( argv [ 0 ] ) ; break ; 60
  • 70. Appendix C. Getting Started: Standard Deviation Hello World! } } s t r u c t pumaSet∗ set = createPumaSet ( s i z e o f ( double ) , numThreads , a f f i n i t y S t r ) ; srand ( time (NULL) ) ; From here, we can use the pumalloc call to allocate space within set for each number: f o r ( s i z e t i = 0; i < numElements ; ++i ) { double ∗ num = ( double ∗) pumalloc ( set ) ; ∗num = ( double ) rand () / RANDMAX; } We then use initKernelData() to create the extra data to be passed into our kernel. From there, we call runKernel() to invoke our kernel and get the standard deviation of the set. s t r u c t pumaExtraKernelData kData ; double stdDev = −1; initKernelData(&kData , &extraDataConstructor , NULL, NULL, NULL, &extraDataReduce , &stdDev ) ; runKernel ( set , stdDevKernel , &kData ) ; p r i n t f ( ”Our set has a standard deviation of %f n” ”Also , Hello World ! n” , stdDev ) ; Finally, we clean up after ourselves by destroying our set and all our static data. The static data destructor destroys data on a per-thread basis, so we must call the destructor from all threads in our pool. To do this, we use the executeOnThreadPool() function from pumathreadpool.h. executeOnThreadPool ( set−>threadPool , staticDestructor , NULL) ; destroyPumaSet ( set ) ; } In order to compile this tutorial, use the following command: gcc -pthread -std=c99 <file>.c -o stddev -lpuma -L<PUMA bin dir> -I<PUMA inc dir> 61
  • 71. Appendix D Licence PUMA is released under the three-clause BSD licence[32]. We chose this rather than a copyleft licence like GPL or LGPL in order to allow anyone to use PUMA with absolute freedom aside from the inclusion of a short copyright notice. 62