final (1)

Department of Computing
Imperial College London
PUMA
Abstracting Memory Latency Optimisation In Parallel
Applications
Richard Jones
Supervisor: Tony Field
June 2015

Abstract
Moore’s Law states that every eighteen months to two years, the number of transistors
per square inch on an integrated circuit approximately doubles[1], effectively leading to
a proportional performance gain. However, in the early twenty-first century, transistor
size reduction began to slow down, limiting the growth of complexity in high-performance
applications which was afforded by increasing computing power.
Consequently, there was a push for increased parallelism, enabling several tasks to be car-
ried out simultaneously. For high-performance computing applications, a logical extension
to this was to utilise multiple processors simultaneously in the same system, each with
multiple execution units, in order to increase parallelism with widely available consumer
hardware.
In multi-processor systems, having uniformly shared, globally accessible physical memory
means that memory access times are the same across all processors. These accesses can be
expensive, however, because they all require communication with a remote node, typically
across a bus which is shared among the processors. Since the bus is shared and can only
handle one request at a time, processors may have to wait to use it, causing delays when
attempting to access memory.
The situation can be improved by giving each processor its own section of memory, each
with its own data bus. Each section of memory is called a domain, and accessing a domain
which is assigned to a different processor requires the use of an interconnect which has
a higher latency than accessing local memory. This architecture is called NUMA (Non-
Uniform Memory Access). In order to exploit NUMA architectures efficiently, application
developers need to write code which minimises so-called cross-domain accesses to maximise
the application’s aggregate memory performance.
We present PUMA, which is a smart memory allocator that manages data in a NUMA-
aware way. PUMA exposes an interface to execute a kernel on the data in parallel, auto-
matically ensuring that each core which runs the kernel accesses primarily local memory.
It also provides an optional time-based load balancer which can adapt workloads to cases
where some cores may have be less powerful or have more to do per kernel invocation than
others.

Acknowledgements
I would like to thank the following for their contributions to PUMA, both directly and
indirectly:
• My supervisor, Tony Field, who has been a tremendous source of support, both in
the development of PUMA and in my completion of this year.
• Dr Michael Lange, the creator of our LERM case study, who spent hours helping me
to work out just exactly what was wrong with my timing results.
• My tutor, Murray Shanahan, for helping me get through all four years of my degree
relatively intact, and providing help and support throughout.
• Imperial’s High Performance Computing service, especially Simon Burbidge who was
invaluable in helping me find my way around the HPC systems.
• My family and friends (especially those who know nothing about computers) for
continuing to talk to me after being forced to proofread approximately seventeen
thousand different drafts of my project report. Also for providing vague moral sup-
port over the course of the first 21 years of my life.
i

Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 PUMA (Pseudo-Uniform Memory Access) . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Hardware Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Caching In Multi-Processor Systems . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Memory In Agent-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.2 Data-Based Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.1 Manual Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.2 OP2/PyOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.3 Galois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.4 Intel TBB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.5 Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ii

Contents
2.8 LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Design and Implementation 21
3.1 Dynamic Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Memory Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Element Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Static Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Invalid Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Local vs. Remote Testing . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Testing and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 LERM Parallelisation 32
4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Applying PUMA to LERM . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Evaluation 35
5.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Thread Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Parallel Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii

Contents
6 Conclusions 45
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A Methods For Gathering Data 50
B API Reference 52
B.1 PUMA Set Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B.3 Kernel Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.4 Static Data Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C Getting Started: Standard Deviation Hello World! 58
D Licence 62
iv

List of Figures
1.1 Calvin ponders the applications of workload parallelisation. . . . . . . . . . 1
1.2 Parallel efficiency reached by a trivially parallel algorithm with dynamic
data using static over-allocation and not taking NUMA effects into account.
The red vertical line signifies the point past which we use cores on another
NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Percentage of total reads which are remote on average across several runs of
the trivially parallel section of the case study. . . . . . . . . . . . . . . . . . 5
1.4 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 6
2.1 “Transistor counts for integrated circuits plotted against their dates of in-
troduction. The curve shows Moore’s law - the doubling of transistor counts
every two years. The y-axis is logarithmic, so the line corresponds to expo-
nential growth.”[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Example of agent allocation with Linux’ built-in malloc() implementation.
Agents belonging to each thread are represented by a different colour per
thread. Black represents non-agent memory. . . . . . . . . . . . . . . . . . . 12
2.3 Parallel efficiency reached by particle management with dynamic data using
static over-allocation and not taking NUMA effects into account. The red
vertical line signifies the point past which we use cores on another NUMA
node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Parallel efficiency reached by environment update with dynamic data using
static over-allocation and not taking NUMA effects into account. The red
vertical line signifies the point past which we utilise cores from a second
NUMA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Total runtime with dynamic data using static storage in the reference im-
plementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
v

List of Figures
3.1 How we lay out our internal lists of allocated memory. The black box rep-
resents the user-facing set structure, and each of blue and red represents a
different thread’s list of pre-allocated memory blocks. These blocks are the
same size for each thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Our first strategy for mapping elements to block descriptors. Darker blue
represents the block’s descriptor; light blue represents a page header with a
pointer to the block’s descriptor; red represents elements; the vertical lines
represent page boundaries; and white represents unallocated space within
the block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Our second strategy for mapping elements to block descriptors. The blue
block represents the block’s descriptor; the red blocks represent elements;
and the vertical lines represent page boundaries. In this example, blocks are
two pages long. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Biomass of plankton in the PUMA-based LERM simulation over time. Black
are maximum and minimum across all runs, red is average. . . . . . . . . . 36
5.2 Comparison between the average biomasses across runs in the reference and
PUMA implementations of the LERM simulation over time. . . . . . . . . . 37
5.3 Total runtime with dynamic data using PUMA. . . . . . . . . . . . . . . . . 38
5.4 Total runtime with dynamic data using static over-allocation and not taking
NUMA effects into account. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Parallel efficiency reached by agent updates in the PUMA implementation of
LERM vs in the reference implementation. The red vertical line signifies the
point past which we utilise cores on a second NUMA node. In this instance,
reduction is the necessary consolidation of per-thread environmental changes
from the update loop into the global state. Margin of error is calculated by
finding the percentage standard deviation in the original times and applying
it to the parallel efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Parallel efficiency reached by particle management in the PUMA implemen-
tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 41
5.7 Parallel efficiency reached by environment update in the PUMA implemen-
tation of LERM vs in the reference implementation. . . . . . . . . . . . . . 42
5.8 Parallel efficiency reached by the update step in the PUMA implementation
of LERM with and without load balancing. . . . . . . . . . . . . . . . . . . 43
5.9 Parallel efficiency reached by the trivially parallel section of LERM when
using OpenMP (blue) and PUMA’s own thread pool (red). . . . . . . . . . 44
vi

Listings
2.1 LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Calculating the mapping between elements and their indices with a per-page
header. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Calculating the mapping between elements and their indices without using
headers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Pseudo-random number generator where i is a static seed. . . . . . . . . . . 26
4.1 PUMA-based LERM pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 34
vii

Chapter 1
Introduction
Figure 1.1: Calvin ponders the applications of workload parallelisation.
In modern computing, the execution time of many applications can be greatly reduced
by the use of large, multi-core systems. These gains are especially prevalent in scientific
simulations operating on very large data sets, such as ocean current simulation[3], weather
simulation[4] and finite element modelling[5].
Memory accesses in multi-processor systems can be subject to delays due to contention
for the memory controller. A common solution to this problem is to provide multiple
distinct controllers, each with its own discrete memory; this architecture is called NUMA,
or Non-Uniform Memory Access[6]. Processors can access memory associated with others’
memory controllers, but this incurs an extra delay due to the need for an interconnect
between controllers with higher latency.
The primary disadvantage of NUMA is that, as each core needs to access more memory
which is not associated with its controller, memory latency becomes a dominating factor
in runtime. This causes parallel efficiency (calulated with formula A.1) to drop off rapidly
1

Chapter 1. Introduction
unless it is taken into account, even in applications which are trivially or “embarrassingly”
parallel.
In parallel systems, cache coherency can also be a major factor in memory latency. In
order to enable a so-called “classical” programming model based on the Von Neumann
stored-program concept for general purpose computing, processors often automatically
synchronise their caches if one processor writes to a memory location which resides in both
its and another’s cache.
This synchronisation introduces extra latency when reading from or writing to memory
which is within a cache line recently accessed by another processor. Consequently, its
avoidance in high-performance parallel code can be critical.
Figure 1.2 shows the parallel efficiency as we utilise more cores in the trivially parallel sec-
tion of the reference implementation of LERM (Lagrangian Ensemble Recruitment Model)
which we use as our primary case study. Both NUMA latency and cache synchronisation
are responsible for the drop off; on one domain, there is a significant drop in performance
due to multiple processors accessing and updating data within the same cache line. When
we use multiple domains, we can see a change in the rate of drop-off in parallel efficiency,
as illustrated by the two-part trend line.
Figure 1.3 shows the increase in off-domain memory accesses in the reference implementa-
tion of LERM when we use cores associated with a second domain.
1.1 Motivation
In this project we aim to provide a framework with clear abstractions for application
developers - both with an understanding of computer architecture and without - to write
software which takes advantage of large, NUMA-based machines without dealing with the
underlying management of memory across NUMA domains.
Solutions already exist which provide this level of abstraction; however, each solution is
focused on a specific set of problems. The problem which provided the primary motivation
for our library is a type of agent-based modelling in which no agent interacts with any
other, except indirectly via environmental or global variables.
This kind of modelling can be used in a variety of areas, including economic interactions
of individuals with a market[7] and ecological simulations[8][9]. Some such models are
dynamic, in that the number of agents may change over time.
2

1 2 3 4 5 6 7 8 9 10 11 12
60
65
70
75
80
85
90
95
100
Cores In Use
ParallelEfficiency
Parallel Efficiency
Two-Part Trend Line
Figure 1.2: Parallel efficiency reached by a trivially parallel algorithm with
dynamic data using static over-allocation and not taking NUMA effects into
account. The red vertical line signifies the point past which we use cores on
another NUMA node.
1.2 Objectives
The primary purpose of our solution is to combine the dynamism of malloc() with the
benefits of static approaches, as well as to provide methods to run kernels on pre-pinned
threads across datasets. It must abstract away as much of the low-level detail as possible,
without sacrificing performance gains or configurability.
We aim to enable the simple parallelisation of applications operating on large, independent
sets of data in such a way that results remain correct within reasonable bounds. We
also aim to prevent NUMA cross-domain performance penalties without the application
developer’s intervention.
3

1.3 PUMA (Pseudo-Uniform Memory Access)
This project is concerned with the design, implementation and evaluation of PUMA, which
is a NUMA-aware library that provides methods for the allocation of homogeneous blocks
of memory (elements), the iteration over all of these elements and the creation of static,
domain-local data on a per-thread basis. It exposes a relatively small external API which
is easy to use. It does not require developers to have an understanding of the underlying
system topology, allowing them to focus more on the logic behind their kernel. It does,
however, provide advanced configuration options; for example, it automatically pins threads
to cores but can also take an affinity string at initialisation for customisable pinning.
As well as providing NUMA-aware memory management within its set structure, PUMA
also exposes an API for the allocation of per-thread static data which is placed on the
calling thread’s local NUMA domain. This allocated data are guaranteed to not be within
the same cache line as anything allocated from another thread, preventing memory latency
caused by maintaining cache coherency.
PUMA fits between existing solutions in that, while it imposes the constraint that data
must not have direct dependencies, the data set it operates on can be dynamic. It therefore
addresses a different class of problems from libraries such as OP2 (section 2.7.2) and Galois
(section 2.7.3).
We have adapted a scientific simulation (LERM) to use PUMA in order to examine its
effects and usability. The simulation is outlined in section2.8.
The PUMA-based implementation of this simulation has three main sections, each ex-
hibiting a different level of parallelisation: trivially parallel, partially parallel and entirely
serial.
Figure 1.4 shows the total runtime of LERM when implemented with PUMA against the
theoretical minimum as dictated by Amdahl’s Law[10]. This figure illustrates two aspects
of PUMA at work:
1. The execution time across cores on a single domain is close to the minimum due to
cache coherency optimisation;
2. The execution time across cores on multiple domains is also close to the theoretical
minimum as a result of both cache coherency optimisation and the reduction in
cross-domain memory traffic, as shown in Figure 1.3.
Currently, PUMA has been tested on the following Operating Systems:
• Ubuntu Linux 14.10 and 15.04 with kernels 3.16 and 3.19;
4

1 2 3 4 5 6 7 8 9 10 11 12
0
5
10
15
20
25
30
35
40
Cores In Use
%TotalReadsWhichAreRemote
Reference Implementation
PUMA-based Implementation
Figure 1.3: Percentage of total reads which are remote on average across several
runs of the trivially parallel section of the case study.
• Red Hat Enterprise Linux with kernel 2.6.32;
• Mac OS 10.10.3
PUMA is written to have as much backwards compatibility as it can within the Linux
kernel; it was written with the POSIX standard in mind and uses, as far as possible,
only standard-mandated features. For those which are not, alternatives are available as
compile-time options.
Internally, we implement time-based thread workload balancing. This allows us to manage
intelligently the time taken to run a kernel on each thread, preventing any one thread from
taking signiﬁcantly longer than others.
5

1 2 4 8
100,000
158,000
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Actual Timing
Minimum Time (Amdahl’s Law)
Figure 1.4: Total runtime with dynamic data using PUMA.
1.4 Contributions
• In chapter 3, we discuss the design and implementation of PUMA, including how we
achieve near-ideal parallel eﬃciency even across NUMA domains;
• In chapter 4, we examine the scalability and parallelisation of the LERM simulation,
including discussing our necessarily parallel Particle Management implementation;
• In chapter 5, we present a detailed experimental evaluation of PUMA with respect
to LERM, including both timing data and simulation correctness veriﬁcation.
The PUMA library code is hosted at https://github.com/CatharticMonkey/puma.
6

Chapter 2
Background
The most common architectural model in modern computers is called the Von Neumann
architecture; it is based on a model described by John Von Neumann in the First Draft of
a Report on the EDVAC[11].
The model describes a system consisting of the following:
• A processing unit containing an Arithmetic/Logic Unit (ALU) and registers;
• A control unit consisting of an instruction register and a program counter;
• Memory in which data and instructions are stored;
• External storage;
• Input/output capabilities.
The main benefit of a stored-program design such as the Von Neumann architecture is
the ability to dynamically change the process which the computer carries out - this is
in contrast to early computers which had hard-wired processes and could not be easily
reprogrammed. The flexibility afforded by the stored-program approach is critical to the
widespread use of computers as general-purpose machines.
2.1 Hardware Trends
The first computer processors were developed to execute a serial stream of instructions.
This was initially sustainable in terms of keeping up with increased requirements for more
complex computations, as single-chip performance was constantly being improved; Moore’s
7

Chapter 2. Background
law is an observation stating that “[t]he complexity for minimum component costs has
increased at a rate of roughly a factor of two per year... Certainly over the short term this
rate can be expected to continue, if not to increase.”[1] This trend is shown in Figure 2.1.
In other words, approximately every two years, the number of components in an integrated
circuit can be expected to increase twofold, leading to a proportional performance gain.
This, along with the significant increase in clock speeds from several MHz to several GHz in
the span of just a few decades meant that serial execution was also subject to approximately
linear gains.
In the early 21st century, however, these gains began to slow as a result of physical limita-
tions.
The next step for performance scaling was parallelism, and so the number of discrete
processing units in chips produced by most major manufacturers increased.
As a result of this move towards higher levels of hardware parallelism, there has been an
increasingly strong focus in current computing on parallelising software to take advantage
of it.
2.2 Memory Hierarchy
Due to the trend for increased speed of computation in a small space, memory access time
has become a major concern when it comes to execution time. This is especially true
as clock cycles have become shorter; shorter cycles mean more cycles wasted waiting for
accesses of the same length and so more instructions which could be executed during the
time taken for each access, but are not.
As a result, there are various architectural decisions made by processor manufacturers in
order to attempt to minimise this penalty and thus speed up program execution. The
primary method of tackling the problem is to implement a caching system which exploits
the fact that a significant portion of memory accesses exhibit spatial locality (i.e. they
are close to previously accessed memory); when a memory access is performed, the cache
(which is physically close to the processor and is often an on-chip bank of memory) is first
checked to see if it contains the desired data.
If it does, there is no need to look further and the access returns relatively quickly. If it
does not, the data are requested from main memory along with a surrounding block of
data of a predetermined size. These extra data are stored in the cache, enabling future
cache hits.
The simple caching model is often extended with the use of multiple levels of cache. In this
extension, lower levels of cache are physically closer to the memory data register, which is
the location to which memory accesses are initially loaded on the processor. As a result,
8

Figure 2.1: “Transistor counts for integrated circuits plotted against their dates
of introduction. The curve shows Moore’s law - the doubling of transistor
counts every two years. The y-axis is logarithmic, so the line corresponds to
exponential growth.”[2]
lower levels of cache are quicker to access than higher ones. However, lower level caches
have a smaller size limit because of their location. Consequently, it is desirable to have
several levels of cache, each bigger than the last, in order to reduce cache-processor latency
as much as possible without sacriﬁcing size.
Caching helps to mitigate memory access penalties signiﬁcantly, but main memory access
time is still important, especially in the case in which there will inevitably be many cache
9

misses because of the nature of the algorithm in question. If there are many cache misses,
memory access time can quickly become a dominant factor of execution time.
2.3 Parallel Computing
The logical extension of per-processor parallelisation is spreading heavy computational
tasks across a large number of processors. There are two main methods for achieving this:
• Utilising several separate computers networked together, sharing data and achieving
synchronisation through message passing over the network;
• Having two or more sockets on a motherboard, which increases the number of cores
available in one computer without requiring more expensive, advanced processors.
The first method’s primary draw is its scalability; it is useful for constructing large systems
relatively cheaply without specialised hardware. The main benefit of the second is that
it avoids the overhead of the first’s message-passing while still maintaining the ability to
use consumer-grade hardware. It is possible to combine these approaches in order to gain
many of the benefits of both. Instances of such combinations are:
• Edinburgh University’s Archer supercomputer[12]: a 4920-node Cray XC30 MPP
supercomputer with two 12-core Intel Ivy Bridge processors on each node, providing
a total of 118,080 cores;
• The UK Met Office’s supercomputer[4][13]: A Cray XC40, with a large number of
Intel Xeon processors, providing a total of 480,000 cores.
• Southampton University’s Iridis-Pi cluster[14]: a cluster of 64 Raspberry Pi Model B
nodes, providing a low-power, low-cost 64-core computer for educational applications.
Each of these focuses on providing a massively parallel system, in order to carry out certain
types of parallel computations; if they are used for a computation which must be done in
serial or which simply does not take advantage of the topology, all that they gain over a
single-core machine is lost.
Libraries which abstract away hardware details are often used to take advantage of this
kind of architecture; this makes it relatively easy to create software which is scalable across
several sockets (each containing a multi-core CPU), or even several networked computation
nodes, while not requiring in-depth architectural knowledge.
10

2.4 Caching In Multi-Processor Systems
In NUMA systems, there are two main approaches to cache management: the far simpler
and less expensive method (in terms of hardware-level synchronisation) involves not syn-
chronising caches across processors; this has the major disadvantage that programming in
the common Von Neumann paradigm becomes too complex to be feasible.
The other method involves maintaining cache coherency at the hardware level; this is called
cache-coherent NUMA (ccNUMA). It requires significantly more complexity in the design
of the system and can lead to a substantial synchronisation overhead if two processors are
accessing data in the same cache line; writes from one processor require an update in the
cache of the other, introducing latency.
ccNUMA is the more common of these two because it does not introduce extra complexity
in creating correct programs for multi-processor systems and the synchronisation overhead
can be avoided by not simultaneously accessing data within the same cache line on different
cores.
2.5 Memory In Agent-Based Models
Static preallocation of space for agents is a potential solution to the problem of allocating
memory for them in dynamic agent-based models; through first touch policies, it provides
the ability to handle agent placement easily and intelligently in terms of physical NUMA
domains. Memory is commonly partitioned into virtual pages, which, under first touch, are
only assigned physical space when they are first accessed; the physical memory assigned is
on the NUMA domain to which the processor touching the memory belongs[15].
This is, however, not always an option, as either it requires reallocation upon dataset growth
or massive over-allocation and the imposition of an artificial upper bound on dataset size.
Dynamic allocation methods do not require reallocation or size boundaries. The C standard
library’s malloc() is the primary method of allocating memory dynamically from the heap,
but it has several disadvantages when used for agent-based modelling:
• Execution time
– malloc() can make no assumptions about allocation size. This means it has to
handle holes left by free()d memory, so allocating data can require searching
for suitably-sized free blocks of memory.
11

• Lack of cache effectiveness
– Since allocations for all malloc() calls can come from the same pool, there
is no guarantee that all agents will occupy contiguous memory, meaning that
iteration may cause a significant number of cache misses.
In Figure 2.2, let each agent be equal in size to half of that loaded into core-
local cache on each miss. If the first (red) agent is accessed by thread 1, the
processor will also load the second (yellow) one into cache. The second agent
constitutes wasted cache space because it will not be accessed by the processor.
In the worst case, synchronisation needs to occur between the two processors
running the red and yellow threads in order to ensure coherency.
• Lack of NUMA-awareness
– Many malloc() implementations (for example the Linux default implementation
and tcmalloc) do not necessarily allocate from a pool which is local to the core
requesting the memory.
First-touch page creation means that if a malloc() call returns memory be-
longing to a thus-far untouched page and we initialise the memory on a CPU
belonging to the NUMA domain from which we will then access it, we should
not incur a cross-domain access. However, malloc() implementations which are
not NUMA-aware may allocate from pages which may have already been faulted
into memory on another domain.
We have no way, therefore, to guarantee that accessing agents allocated using
malloc() and similar calls will not incur the penalty of a cross-domain access.
malloc() implementations exist which are NUMA-aware. However, these still exhibit the
other two problems because malloc() can make no assumptions about the context in which
memory will be used and it must support variably-sized allocations. Consequently, even
NUMA-aware implementations are not suitable for this class of applications as a result of
the trade-off between generality and performance.
Figure 2.2: Example of agent allocation with Linux’ built-in malloc() imple-
mentation. Agents belonging to each thread are represented by a different colour
per thread. Black represents non-agent memory.
12

2.6 Workload Balancing
When operating on data sets in parallel, one issue which needs to be addressed is how to
ensure that each thread will finish its current workload at approximately the same time; if
threads finish in a staggered fashion, this can lead to sub-optimal parallel performance as
some threads that could be working are instead idle.
2.6.1 Work Stealing
In work stealing, balance across threads is achieved by the scheduler; computation is split
into discrete tasks which it then assigns to different processors in the form of a task queue.
If one processor completes its task queue, it “steals” one or more tasks from another’s
queue. This means that, as long as there exist tasks which have not been started and each
task is of a similar length, no processor will be idle for more than the time it takes to
complete one task.
If the tasks are not necessarily of a similar length, balance can still be approximately
achieved by estimating the length of each task and optimising the task queues based on
these estimates.
2.6.2 Data-Based Balancing
Data-based balancing is a method of balancing which consists of assigning blocks of data
to specific threads based on some partitioning strategy; these partitioning strategies can
be based on data size, for example, or the results of profiling several recent runs of com-
putational kernels.
Balancing with this strategy is inherently simpler than with task balancing if we are running
an identical kernel since it simply involves ensuring that each thread has approximately the
same amount of data to operate on. We can expand upon this by using timing data from
previous runs to estimate how long each thread will take to run, allowing us to achieve a
closer to optimal balance without the overhead of balancing at runtime.
2.7 Existing Approaches
2.7.1 Manual Parallelisation
There are two primary types of parallelisation: one involves running code across several
cores on the same motherboard; the other involves running it across several processors on
different computers, using messaging on a local network. The former approach avoids the
13

overhead of message passing, whereas the latter is more scalable using consumer hardware.
Often, they are mixed, using the MPI (Message Passing Interface) standard for the inter-
computer messaging and OpenMP or the Operating System’s threading interface for the
local parallelism.
Both require manual management of the placement of data. If running on a single NUMA-
enabled system, this involves predicting the cores which will be accessing data and allo-
cating physical memory accordingly. Using multiple networked computers requires manual
usage of MPI functions in order to transfer data among computers; it is not implicit, so
the application must be designed with this in mind.
The primary disadvantages of manual parallelisation are both related to its complexity; it
requires sufficient knowledge of system APIs and system architecture and it can require a
significant amount of programming time. Often, this renders a custom solution infeasible.
2.7.2 OP2/PyOP2
OP2, and its Python analogue, PyOP2, provide “an open-source framework for the ex-
ecution of unstructured grid applications on clusters of GPUs or multi-core CPUs.”[16]
They are focused on MPI-level distribution, with support for OpenCL, OpenMP or CUDA
for local parallelism. These two levels can be combined in one application, enabling the
developer to take advantage of the benefits of both.
The framework operates on static, independent data sets, allowing for data-specific opti-
misation at compile time. Its architecture involves code generation at compile-time using
“source-source translation to generate the appropriate back-end code for the different tar-
get platforms.”[16][17] The static nature of the data results in a low complexity requirement
at runtime in terms of memory management. The independence constraint means that the
order in which the data are iterated over for kernel application must have no significant
impact beyond floating point errors on its result.
The OpenMP local parallelisation code does not encounter the NUMA problem because
all of the data are statically allocated; as long as the data which will be accessed from
CPUs on different domains reside in different virtual pages, the default first touch policy
in most modern Operating Systems will ensure that memory accesses are primarily to the
local NUMA domain.
2.7.3 Galois
Galois is a C++ and Java framework which operates on data sets consisting of dynamic, in-
terdependent data. Consequently, both its memory management and its kernel run schedul-
ing have a significant runtime overhead.
14

Due to this extra overhead, it is primarily useful for data which are dynamic and have
sufficiently complex dependencies.
Galois is explicitly NUMA aware and contains support for OpenMP and MPI. However, it
only supports Linux.
2.7.4 Intel TBB
Intel provides a C++ parallelisation library called Thread Building Blocks. It contains
algorithms and structures designed to simplify the creation of multithreaded programs.
It implements task-based parallelism, with which it uses a task-based balancer, and provides
a memory allocator which prevents false sharing[18].
TBB is NUMA-aware. Its lack of specificity and task-based balancing do, however, mean
that it is not possible to ensure as much NUMA locality as in problem-specific, data
balancing libraries such as PUMA. Consequently, it is not necessarily an ideal solution in
some applications where runtime is one of the primary considerations.
2.7.5 Cilk
Cilk is a language based on C (with other dialects based on C++) which provides meth-
ods for a programmer to identify parallel sections while leaving the runtime to perform
scheduling. The task scheduler is based on a work stealing strategy, where the tasks are
defined by the programmer.
Cilk is not explicitly NUMA aware, and because tasks are scheduled by the runtime rather
than the programmer, there is limited scope to make use of NUMA systems while minimis-
ing off-domain accesses.
2.8 LERM
Our primary case study for this work is a Lagrangian Ensemble Recruitment metamodel, as
detailed in [19], which simulates phytoplankton populations. Our reference implementation
is the result of prior work[20] that involved parallelising one such metamodel. It was
observed that the reference implementation encountered the NUMA effect, leading to a
significant reduction in parallel efficiency in the trivially parallel section when spread across
domains; this is shown in Figures 1.2 and 1.3.
The simulation (LERM) consists of three primary parts: an agent update loop; particle
management, for the creation and deletion of agents; and the environment update, which
15

simulates the spread and interaction of agent-caused environmental changes. Listing 2.1
shows the main algorithm implemented in Python-like pseudocode.
The three main sections roughly correspond to three different cases we may encounter:
the update loop (Figure 1.2) is primarily trivially parallel with a reduction of per-thread
data at the end; the particle management step (Figure 2.3) is partially parallelisable but
is implemented in serial in the reference implementation; and the environment update
(Figure 2.4) is mostly parallelisable but implemented in serial in both implementations
due to having a negligible impact on runtime.
We can see that the update loop has an obvious dip in parallel efficiency after it begins to
utilise cores on a different NUMA node, due to its not taking NUMA effects into account
when assigning work to each thread.
Because the particle management and environment sections are both implemented with
serial algorithms, they do not demonstrate a reduction in parallel efficiency as a result of
the NUMA effect. They do, however, begin to dominate as the update loop is distributed
across cores, especially the particle management step.
The update step is, in theory, trivially parallelisable. The particle management and envi-
ronmental update steps are implemented in our reference implementation as serial code,
but the particle management step can be parallelised. Approximately 98% of the simula-
tion is parallelised (calculated with formula 2.1) in the PUMA version; by Amdahl’s Law,
we can therefore achieve a theoretical maximum speedup of approximately 50×. Figure
1.4 shows that we achieve very close to this.
However, in our reference implementation, only 87.5% is parallelised, because the original
particle management is in serial whereas, out of necessity, we have parallelised the particle
management in PUMA. By Amdahl’s Law, the maximum speedup achievable by the ref-
erence implementation is 8×. Figure 2.5 shows that we do not achieve close to our ideal
runtime, because of NUMA latency.
In order to ensure that we do observe the results of NUMA effects if they have an impact, we
initialise the LERM simulation with 400000 13 byte agents. Our dataset size (not taking
into account metadata overhead and other considerations) is, therefore, approximately
20.8MB. Since the L3 cache in our test machine is 12MB, we ensure that every timestep
requires that at least 40% of the agents have to be re-read from main memory.
16

1 def splitAgents () :
2 while len ( agents ) < minAgents :
3 splitIntoTwo ( someAgent )
4
5 def mergeAgents () :
6 while len ( agents ) > maxAgents :
7 mergeIntoOne ( someAgent , someOtherAgent )
8
9 def updateAgents () :
10 # T r i v i a l l y p a r a l l e l loop
11 f o r agent in agents :
12 ecologyKernel ( agent )
13
14 reducePerThreadData ()
15
16 def particleManagement () :
17 splitAgents ()
18 mergeAgents ()
19
20 def mixChemistry () :
21 f o r layer in l a y e r s :
22 totalConcentration += layer . concentration
23
24 def updateEnvironment () :
25 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e ()
26 mixChemistry ()
27
28 def main () :
29 in itia lise Envi ronm ent ()
30
31 while i < max timestep :
32 updateAgents ()
33 particleManagement ()
34 updateEnvironment ()
Listing 2.1: LERM pseudocode
17

1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(PM)
Figure 2.3: Parallel efficiency reached by particle management with dynamic
The red vertical line signifies the point past which we use cores on another
NUMA node.
18

1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(env)
Figure 2.4: Parallel efficiency reached by environment update with dynamic
The red vertical line signifies the point past which we utilise cores from a second
NUMA node.
19

1 2 4 8
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Actual Timing
Minimum Time (Amdahl’s Law)
Figure 2.5: Total runtime with dynamic data using static storage in the refer-
ence implementation.
Ts + Tp
Tt
(2.1)
How we calculate the proportion of an application which is parallelised. Ts is
the time spent in the serial sections, Tp is the time spent in the parallel sections
and Tt is the total execution time.
20

Chapter 3
Design and Implementation
PUMA consists of several parts:
• A NUMA-aware dynamic memory allocator for homogeneous elements;
• An allocator for thread-local static data which cannot be freed individually. This
allocator uses pools which are located on the domain to which the core associated
with the thread belongs;
• A parallel iteration interface which applies a kernel to all elements in a PUMA set;
• A balancer which changes which threads blocks of data are associated with in order
to balance kernel runtime across cores.
Much of PUMA’s design was needs-driven: it was developed in parallel with its integration
into a case study (see section 2.8) and its design evolved as new requirements became clear.
The reason that PUMA provides a kernel application function rather than direct access to
the underlying memory is to enable it to prevent cross-domain accesses. We achieve this
by pinning each thread in our pool to a speciﬁc core and ensuring that when we run the
kernel across our threads, each thread can access only domain-local elements.
PUMA works under the assumption that the application involves the manipulation of sets
of homogeneous elements. In our case study, these elements are the agents within the
model, each of which represents a group within the overall population of phytoplankton,
and we use two PUMA sets, one for each of dead and alive agents.
PUMA implements parallelism by maintaining a list of elements per thread, each of which
can only be accessed by a single thread at a time.
21

Chapter 3. Design and Implementation
Figure 3.1: How we lay out our internal lists
of allocated memory. The black box represents
the user-facing set structure, and each of blue
and red represents a diﬀerent thread’s list of
pre-allocated memory blocks. These blocks are
the same size for each thread.
3.1 Dynamic Memory Allocator
Our initial design involved an unordered data structure which could act as a memory
manager for homogeneous elements. It would provide methods to map kernels across all
of its elements and ensure that each element would only be accessed from a processor
belonging to the NUMA domain on which it was allocated.
In order to achieve this, we use one list per thread within the user-facing PUMA set
structure (Figure 3.1). Each of these lists contains one or more blocks of memory at
least one virtual memory page long. We have a 1:1 mapping of threads to cores, enabling a
mostly lock-free design. This allows us to have correct multithreaded code while minimising
time-consuming context switches.
3.1.1 Memory Pools
In order to allocate memory quickly on demand, our dynamic allocator pre-allocates blocks,
each of which is one or more pages long. This has two purposes: only requesting large blocks
from the Operating System allows us to reduce the time spent on system calls; and the
smallest blocks on which system calls for the placement of data on NUMA domains can
22

operate are one page long and must be page aligned.
These blocks have descriptors at the start which contain information on the memory which
has been allocated from them. The descriptors also contain pointers to the next and
previous elements in their per-thread list to allow for iteration over all elements.
Currently, block size is determined by a preprocessor definition at compile time, because
this size is integral to calculating the location of metadata from an element’s address. It
could also be determined at run-time if set before any PUMA initialisation is performed
by the application.
3.1.2 Element Headers
In order to free elements without exposing too much internal state to the user, we must have
some way of mapping elements’ addresses to the blocks in which they reside. Originally,
each element had a header containing a pointer to its block’s descriptor. This introduces
a significant memory overhead, however, especially if the size of the elements is small
compared to that of a pointer.
PUMA should use few resources in order to give users as much freedom as possible in its
use. Consequently, we devised two separate strategies for mapping elements to blocks’
descriptors. The first (Figure 3.2) was based on the NUMA allocation system calls which
we were already using to allocate blocks for the thread lists. These calls (specifically
numa alloc onnode() and numa alloc local()) guarantee that allocated memory will be
page-aligned.
Figure 3.2: Our first strategy for mapping elements to block descriptors. Darker
blue represents the block’s descriptor; light blue represents a page header with
a pointer to the block’s descriptor; red represents elements; the vertical lines
represent page boundaries; and white represents unallocated space within the
block.
If we ensure that each page within a block has a header, we can store a pointer in that
header to the block’s descriptor. Finding the block descriptor for a given element then
simply involves rounding the element’s address down to the next lowest multiple of the
page size.
This has two major disadvantages, however:
23

• In order to calculate the index of a given element or the address corresponding to an
element’s index, we must perform a relatively complex calculation (between twenty
and ﬁfty arithmetic operations), as shown in listing 3.1, rather than simple pointer
arithmetic (up to ﬁve operations). These are common calculations within PUMA, so
minimising their complexity is critical.
• If the usable page size after each header is not a multiple of our element size, we
can have up to sizeof(element) - 1 bytes of wasted space. This is especially
problematic with elements which are larger than our pages.
void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i )
{
s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ;
char ∗ arrayStart = node−>elementArray ;
s i z e t f i r s t S k i p I n d e x =
getIndexOfElementOnNode ( node , ( char ∗) node + pageSize + s i z e o f ( s t r u c t
pumaHeader ) ) ;
s i z e t elemsPerPage =
getIndexOfElementOnNode ( node , ( char ∗) node + 2 ∗ pageSize + s i z e o f (
s t r u c t pumaHeader ) ) − f i r s t S k i p I n d e x ;
s i z e t pageNum = ( i >= f i r s t S k i p I n d e x ) ∗ (1 + ( i − f i r s t S k i p I n d e x ) /
elemsPerPage ) ;
s i z e t lostSpace =
(pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−>
elementSize )
+ (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader
) ) % node−>elementSize )
+ pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ;
void ∗ element = ( i ∗ node−>elementSize + lostSpace + arrayStart ) ;
return element ;
}
s i z e t getIndexOfElement ( void ∗ element )
{
s t r u c t pumaNode∗ node = getNodeForElement ( element ) ;
return getIndexOfElementOnNode ( node , element ) ;
}
s i z e t getIndexOfElementOnNode ( s t r u c t pumaNode∗ node , void ∗ element )
{
s i z e t pageSize = ( s i z e t ) sysconf ( SC PAGE SIZE) ;
24

s i z e t pageNum = (( s i z e t ) element − ( s i z e t ) node ) / pageSize ;
s i z e t lostSpace =
(pageNum > 0) ∗ (( pageSize − s i z e o f ( s t r u c t pumaNode) ) % node−>
elementSize )
+ (pageNum > 1) ∗ (pageNum − 1) ∗ (( pageSize − s i z e o f ( s t r u c t pumaHeader
) ) % node−>elementSize )
+ pageNum ∗ s i z e o f ( s t r u c t pumaHeader ) ;
s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart − lostSpace ) / node−>
elementSize ;
return index ;
}
Listing 3.1: Calculating the mapping between elements and their indices with
a per-page header.
Our second strategy (Figure 3.3) eliminated the need for these complex operations while
reducing memory overhead. POSIX systems provide a function to request a chunk of
memory aligned to a certain size, as long as that size is 2n pages long for some integer n.
If we ensure that block sizes also follow that restriction, we can allocate blockSize bytes
aligned to blockSize. Listing 3.2 shows how we calculate the mapping between elements
and their indices with this strategy.
s i z e t getIndexOfElement ( void ∗ element )
{
s t r u c t pumaNode∗ node = getNodeForElement ( element ) ;
return getIndexOfElementOnNode ( element , node ) ;
}
s i z e t getIndexOfElementOnNode ( void ∗ element , s t r u c t pumaNode∗ node )
{
s i z e t index = ( s i z e t ) (( char ∗) element − arrayStart ) / node−>elementSize ;
return index ;
}
void ∗ getElement ( s t r u c t pumaNode∗ node , s i z e t i )
{
void ∗ element = ( i ∗ node−>elementSize + arrayStart ) ;
return element ;
25

}
s t r u c t pumaNode∗ getNodeForElement ( void ∗ element )
{
s t r u c t pumaNode∗ node =
( s t r u c t pumaNode∗) (( s i z e t ) element &
˜(( pumaPageSize ∗ PUMA NODEPAGES) − 1) ) ;
return node ;
}
Listing 3.2: Calculating the mapping between elements and their indices
without using headers.
Figure 3.3: Our second strategy for mapping elements to block descriptors. The
blue block represents the block’s descriptor; the red blocks represent elements;
and the vertical lines represent page boundaries. In this example, blocks are
two pages long.
3.2 Static Data
After parallelising all of the trivially parallel code in our primary case study, we found
that we were still encountering a major bottleneck. Proﬁling revealed that this was mostly
caused by an otherwise innocuous line in a pseudo random number generator. It was using
a static variable as the initial seed and then updating the seed each time it was called, as
shown in listing 3.3.
f l o a t rnd ( f l o a t a )
{
s t a t i c int i = 79654659;
f l o a t n ;
i = ( i ∗ 125) % 2796203;
n = ( i % ( int ) a ) + 1 . 0 ;
return n ;
}
Listing 3.3: Pseudo-random number generator where i is a static seed.
26

As we increased our number of threads, writing to the seed required threads to wait for
cache synchronisation between cores, and using cores belonging to multiple NUMA domains
incurred lengthy cross-domain accesses.
The cache coherency problem could be solved to an extent using thread-local storage such
as that provided by #pragma omp threadprivate(...). However, since there are no
guarantees about the placement of thread-local static storage in relation to other threads’
variables, multiple thread-local seeds can still be located within the same cache line, leading
to synchronisation. This also means that we cannot optimise for NUMA without a more
problem-specific static memory management scheme.
We implemented a simple memory allocator which can allocate blocks of variable sizes
but not free() them individually. The lack of support for free()ing allows us to avoid
having to search for empty space within our available heap space while still allowing for
variable-sized allocations. This then places the responsibility for retaining reusable blocks
on the application developer.
This allocator is primarily for static data which is accessed regularly when running a kernel,
such as return values or seeds.
This allocator returns blocks of data which are located on the NUMA domain local to the
CPU which calls the allocation function. The main differences between it and PUMA’s
primary memory allocator are:
• The user is expected to keep track of allocated memory;
• The allocator enables variable sizes;
• Allocated blocks cannot be individually free()d.
3.3 Kernel Application
PUMA does not provide any way of retrieving individual elements from its set of allocated
elements. Instead, it exposes an interface for applying kernels to all elements. This interface
also enables the specification of functions used to manipulate extra data which is to be
passed into the kernel. With this, we can manipulate the data in the set as long as our
manipulation can be done in parallel and is not order-dependent.
The extra data which is passed into the kernel is thread-local in order to avoid cache
coherency overhead and expensive thread synchronisation. Consequently, we also allow
the user to specify a reduction function which is executed after all threads have finished
running the kernel and has access to all threads’ extra data.
27

3.3.1 Load Balancer
When a kernel is run with PUMA, it first balances all of the per-thread lists at a block
level based on timing data from previous runs. If one thread has recently finished running
kernels significantly faster than other threads on average, we transfer blocks from slower
threads to it in order to increase its workload.
3.4 Challenges
3.4.1 Profiling
One of the most challenging aspects of developing PUMA was identifying the location
and type of bottlenecks. Most profiling tools we encountered, such as Intel’s VTune
Amplifier[21] and GNU gprof[22] are time- or cycle-based. VTune also provides metrics to
do with how OpenMP is utilised. However, finding hotspots of cross-domain activity was
still a matter of making educated guesses based on abnormal timing results from profilers.
VTune and a profiler called Likwid[23] also provide access to hardware counters, which can
be useful for profiling cross-domain accesses. However, without superuser access, it can
be difficult to obtain hardware counter-based results from these tools which can be used
for profiling; only the counters’ total values from the entire run are shown, meaning that
identifying hotspots is still a matter of guesswork.
In section 6.1 we discuss possible approaches to implementing userspace memory access
profiling tools in order to reduce the amount of guesswork required.
3.4.2 Invalid Memory Accesses
Because PUMA includes a memory allocator, we encountered several bugs regarding ac-
cessing invalid memory and corrupting header data.
In order to prevent these bugs, we use Valgrind’s[24] error detection interface to make our
allocator compatible with Valgrind’s memcheck utility. This enables Valgrind to alert the
user if they are reading from uninitialised memory or writing to un-allocated or free()d
memory.
This is not fully implemented, however; ideally, we would have Valgrind protect all memory
containing metadata. However, it is possible for multiple threads to read each other’s
metadata at once (without writing to it). Reading another thread’s metadata requires
marking it as valid before reading and marking it as invalid after.
Due to the non-deterministic nature of thread scheduling, this could sometimes lead to
28

interleaving of validating and invalidating memory in such a way that between a thread
validating memory and reading it, another thread may have read the memory and then
invalidated it.
We decided that since overwriting this per-thread metadata was unlikely compared to other
memory access bugs, it was sensible to avoid protecting these blocks of memory entirely in
order to avoid false positives in Valgrind’s output.
3.4.3 Local vs. Remote Testing
NUMA-based architectures are not particularly prevalent in current consumer computers.
Consequently, the majority of our testing of the NUMA-based sections of PUMA had to
be performed while logged into a remote server.
It is, however, possible to perform some of this NUMA-based testing on a non-NUMA ma-
chine. While it is not particularly useful for gathering timing data, the qemu virtual ma-
chine has a configuration option enabling NUMA simulation, even on non-NUMA machines.
This can be useful for testing robustness and correctness of NUMA-aware applications.[25].
3.5 Testing and Debugging
We used various methods to test and debug PUMA. For testing, we wrote a short test
suite which tested several functions which were known to have caused errors which were
difficult to debug in early development. We also used LERM as a more comprehensive
testing platform, comparing the biomass in the PUMA version with that in the reference
implementation as a metric of functional correctness.
In terms of debugging, several methods were used. We used gdb and our Valgrind com-
patibility with both LERM and our unit tests in order to identify bugs within PUMA
itself.
We used system timers to assess whether each section’s parallel efficiency met our expecta-
tions. We also used both VTune and Likwid to collect more granular timing data, allowing
us to identify bottlenecks within both PUMA and LERM. PUMA bottlenecks acted as
indicators for what to optimise and LERM bottlenecks helped with the identification of
useful features for PUMA.
29

3.6 Compilation
Compilation of PUMA requires a simple make invocation in the PUMA root directory. The
make targets are as follows:
• all: Build PUMA and docs and run unit tests
• doc: Build documentation with doxygen
• no test: Build PUMA without running unit tests
• clean: Clear the working tree
• docs clean: Clear all built documentation
3.6.1 Dependencies
PUMA relies on the following:
• libNUMA
• C99 compatible compiler
• Valgrind (optional)
• OpenMP (optional)
• Doxygen (optional, documentation)
3.6.2 Configuration
The following are configuration options for public use. For options which are either enabled
or disabled, 1 enables and 0 disables.
• PUMA NODEPAGES: Specifies the number of pages to allocate per chunk in the
per-thread chunk list. Default 2
• OPENMP: Enable OpenMP. If disabled, we use PUMA’s pthread-based thread pool-
ing solution (experimental). Default enabled
• STATIC THREADPOOL: If enabled and we are not using OpenMP, we share one
thread pool amongst all instances of PUMASet. Default disabled
30

• BINDIR: Where we place the build shared library. Default {pumadir}/bin
• VALGRIND: Whether we build with valgrind support. Default enabled
The following is a conﬁguration option for use during PUMA development. It may severely
hurt performance so should never be used in performance-critical code.
• DEBUG: Enable assertions. Default disabled
3.7 Getting Started
We present a short walkthrough on how to write a simple PUMA-based application in
appendix C. It consists of the generation of a random data set of which we ﬁnd the standard
deviation by calculating the sums of all of the elements and of their squares.
We also include an API reference in appendix
31

Chapter 4
LERM Parallelisation
The basic LERM model (section 2.8) is concerned primarily with the simulation of agents in
a column of water 500m deep. The column is split into layers, with each layer corresponding
to one metre of the column.
When parallelising LERM, the na¨ıve approach involves domain decomposition; we split
layers equally between processors and each processor operates only on agents within its
assigned layers.
This has the problem, however, of encouraging inter-thread communication; agents may
move between layers, requiring the processor which moves a given agent to notify the newly
responsible thread. Given that any or all agents can move between layers during an update,
this potentially requires communication for every agent, leading to a large amount of time
wasted by processors which are waiting for access to synchronisation constructs.
This is not scalable beyond a certain number of processors (in this case, 500) without
subdividing layers. Also, the distribution of agents between layers is likely not to be fully
uniform, meaning that the workload will be unbalanced between processors.
Since the size of the problem is dictated by the number of agents rather than the number of
layers, and the number of agents is variable, a more scalable solution involves distributing
agents between processors. Since agents do not have to move between the domains managed
by different processors, there is no longer an inter-processor communication overhead.
4.1 Scalability
The isoefficiency function (equation 4.1) is a way of relating parallel efficiency to problem
size as the number of processors in use scales. One of its benefits is that it provides a
way of exploring how problem size must scale with the number of processors in order to
32

Chapter 4. LERM Parallelisation
E =
1
1 + To
W×tc
(4.1)
The isoefficiency function. W is the problem size, To is the serial overhead, tc
is the cost of execution for each operation and E is the parallel efficiency[26]
.
Ω(W) = C × To (4.2)
Workload growth for maintaining a fixed efficiency. W is the problem size, To
is the serial overhead and C is a constant representing fixed efficiency.[27].
maintain the same parallel efficiency.
Equation 4.2 shows a mapping between serial overhead and workload. If the equation holds
- i.e. the workload can be increased at least as quickly as the serial overhead as we increase
the number of processors in use - we say that an algorithm has perfect scalability. In other
words, we can maintain a constant efficiency as we increase processors.
The serial sections in the PUMA-based LERM implementation are all either O(n) (envi-
ronmental update) or O(p) where p is the number of processors in use. This means that To
and W are not directly related, so satisfying equation 4.1 requires scaling W proportionally
to To.
Since this is trivially sustainable as we increase the number of processors, the PUMA-based
LERM implementation can, in theory, maintain a constant efficiency.
4.2 Applying PUMA to LERM
Listing 4.1 shows Python-like pseudocode for the PUMA-based version of LERM. In order
to adapt LERM to use PUMA, we must first identify all sections which operate on agents
and adapt them to use the PUMA-based abstractions for running kernels, rather than
iterating over all agents and applying the kernel manually. Lines 10, 13 and 17 show
instances of this change when compared with lines 2, 6 and 11 respectively from listing 2.1.
These areas are primarily in the update and particle management steps. We also identify
any reductions performed after iterating over the agents and use PUMA’s reduction mech-
anism to perform these automatically. Line 17 shows where we tell PUMA to perform the
reduction after the update loop.
33

Chapter 4. LERM Parallelisation
1 def s p l i t K e r n e l ( agent ) :
2 i f len ( agents ) < minAgents :
3 splitIntoTwo ( agent )
4
5 def mergeKernel ( agent ) :
6 i f len ( agents ) > maxAgents :
7 mergeIntoOne ( agent , smallestAgent )
8
9 def mergeAgents () :
10 runKernel ( mergeKernel )
11
12 def splitAgents () :
13 runKernel ( mergeKernel )
14
15 def updateAgents () :
16 # T r i v i a l l y p a r a l l e l loop
17 runKernel ( ecologyKernel , reduction=reducePerThreadData )
18
19
20 # The r e s t i s the same as in the o r i g i n a l implementation
21 def particleManagement () :
22 splitAgents ()
23 mergeAgents ()
24
25 def mixChemistry () :
26 f o r layer in l a y e r s :
27 totalConcentration += layer . concentration
28
29 def updateEnvironment () :
30 r e l o a d P h y s i c s F r o m I n i t i a l i s a t i o n F i l e ()
31 mixChemistry ()
32
33 def main () :
34 in itia lise Envi ronm ent ()
35
36 while i < max timestep :
37 updateAgents ()
38 particleManagement ()
39 updateEnvironment ()
Listing 4.1: PUMA-based LERM pseudocode
34

Chapter 5
Evaluation
Our primary metrics by which we examine the success of PUMA are twofold: first, we
compare our case study as implemented with PUMA to the reference implementation,
specifically in relation to biomass of plankton; second, we compare measured profiling data
to our expectations and to the reference.
5.1 Correctness
Figure 5.1 shows the biomass over time in the PUMA implementation. Even across several
runs with random initial seeds, it does not significantly deviate from the average. We
compare the average biomass in the PUMA implementation with the same metric in the
reference implementation in Figure 5.2, which shows that they follow a similar pattern and
the difference between the two is at most 7.2% of the reference’s biomass.
The differences are due to two factors:
• PUMA manually manages the workload for each thread. Since agents interact with
per-thread environment variables and the order of iteration over the agents is unde-
fined, the exact result of the simulation is non-deterministic.
• PUMA enforces a parallel programming model when interacting with the agents
it manages, because all iteration over agents must be expressed in the form of a
parallelisable kernel. Because of this, we had to reimplement the particle management
step in this form, which lead to different behaviour on a microscopic scale while
macroscopically maintaining correctness.
A benefit of having reimplemented particle management is that it prevents serial particle
management from dominating performance results, allowing us to focus on the NUMA
35

Chapter 5. Evaluation
0 100 200 300 400 500 600 700 800
0
2 · 1010
4 · 1010
6 · 1010
8 · 1010
1 · 1011
Timestep (Each Corresponds to 30 Minutes)
PlanktonBiomass
Figure 5.1: Biomass of plankton in the PUMA-based LERM simulation over
time. Black are maximum and minimum across all runs, red is average.
problem. The new method has, however, not been rigorously statistically analysed because
that is beyond the scope of this project (see section 6.1).
5.2 Profiling
Our profiling data consists of two parts, both of which we compare with the reference
LERM implementation. The first is the proportion of total memory accesses which are
remote for each number of cores. We expect this to be higher when using cores belonging
to a second domain, because reduction requires accesses to data assigned to all cores. It
should, however, be significantly lower than in the reference implementation.
The second is the parallel efficiency (formula A.1) across cores. In the trivially parallel
section, we expect it to remain at approximately 100% even as we use cores belonging to
a second domain.
Figure 1.3 shows a comparison between off-domain accesses in the PUMA-based LERM im-
36

0 100 200 300 400 500 600 700
0
2 · 1010
4 · 1010
6 · 1010
8 · 1010
1 · 1011
Timestep (Each Corresponds to 30 Minutes)
PlanktonBiomass
0
10
20
30
40
50
%Difference
Difference (% of reference)
PUMA
Reference
Figure 5.2: Comparison between the average biomasses across runs in the ref-
erence and PUMA implementations of the LERM simulation over time.
plementation and the reference implementation. In PUMA, we reduce off-domain accesses
by 75% from the reference implementation.
Our timing data are presented in Figures 5.5, 5.6 and 5.7. The most important section
here is the mostly trivially parallel update; unlike in the reference implementation, we have
no noticeable NUMA-based reduction in parallel efficiency.
Both implementations exhibit similar parallel efficiency for the environmental update step
(Figure 5.7), because in both cases it is implemented as a serial algorithm.
Figures 5.3 and 5.4 each show the total time taken for the simulation when run on up
to twelve cores on a log-log plot. It also shows the theoretical minimum time according
to Amdahl’s Law, calculated with formula A.2. It is important to note that, while we
have significantly reduced the time taken to run LERM on a single core when we compare
the PUMA-based implementation with the reference implementation, this is not a direct
result of PUMA. Instead, it is a result of slightly different scheduling causing changes in
which paths are taken through the primary update kernel. The total timing graphs are
best considered in isolation from each other, to see how they conform to the Amdahl’s
37

1 2 4 8
100,000
158,000
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Ideal
Parallel Update
Update Balancing
Particle Management
Environment Update
Figure 5.3: Total runtime with dynamic data using PUMA.
Law-dictated minimum timing in each case.
5.2.1 Load Balancing
In order to assess the usefulness of our load balancer, we tested both with and without
load balancing turned on. Figure 5.8 shows that load balancing has a small but noticable
eﬀect on runtime in the trivially parallel step of LERM.
5.3 Known Issues
While our implementation provides an eﬀective solution to the NUMA problem, we still
have areas in which it can be improved.
38

1 2 4 8
251,000
398,000
631,000
1,000,000
1,580,000
2,510,000
Cores In Use
TotalTime(ms)
Ideal
Update
Particle Management
Environment Update
Figure 5.4: Total runtime with dynamic data using static over-allocation and
not taking NUMA eﬀects into account.
39

1 2 3 4 5 6 7 8 9 10 11 12
60
65
70
75
80
85
90
95
100
Cores In Use
ParallelEfficiency(Update)
Base Update (With Reduction)
PUMA Update (With Reduction)
Base Update (Without Reduction)
PUMA Update (Without Reduction)
Figure 5.5: Parallel efficiency reached by agent updates in the PUMA imple-
mentation of LERM vs in the reference implementation. The red vertical line
signifies the point past which we utilise cores on a second NUMA node.
In this instance, reduction is the necessary consolidation of per-thread environ-
mental changes from the update loop into the global state.
Margin of error is calculated by finding the percentage standard deviation in
the original times and applying it to the parallel efficiency.
5.3.1 Thread Pooling
PUMA relies on the persistence of thread pinning at initialisation. If new threads are
created each time we execute code in parallel, the pinning is no longer persistent. Con-
sequently, threads may be moved to other cores depending on the Operating System’s
scheduler, leading to bugs which are difficult to reproduce.
40

1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(PM)
Base PM
PUMA PM
Figure 5.6: Parallel efficiency reached by particle management in the PUMA
implementation of LERM vs in the reference implementation.
The user may specify a CPU affinity string for both the GNU and Intel OpenMP im-
plementations as an environment variable. This has the disadvantage of requiring extra
parameters at program invocation, however, and removing control from the programmer.
We provide a custom thread pool implementation because the OpenMP standard does
not specify whether threads are reused between parallel sections. Both the Intel and
GNU implementations of OpenMP currently reuse threads which have been previously
spawned[28][29], but the standard allows for new threads to be spawned each time a parallel
section is implemented.
However, our thread pool implementation does not scale as well as OpenMP (Figure 5.9)
and relies on pthreads, meaning that it is not natively supported on Windows. Conse-
quently, we would like to optimise the custom thread pool, or attempt to find an existing
threading library in which we do not have to rely on non-guaranteed behaviour.
41

2 4 6 8 10 12
0
20
40
60
80
100
Cores In Use
ParallelEfficiency(Environment)
Base Environment
PUMA Environment
Figure 5.7: Parallel efficiency reached by environment update in the PUMA
implementation of LERM vs in the reference implementation.
5.3.2 Parallel Balancing
In order to ensure that no thread has a significantly longer runtime than any other, we
implement workload balancing with heuristics based on previous kernel runtimes. While
this has proven effective, it is non-optimal in that the parallelisable sections have not been
parallelised.
Since our balancing algorithm transfers ownership of memory among cores on the same
NUMA domain first before performing inter-domain copies, it could be parallelised through
domain decomposition; each NUMA domain would be internally balanced by a separate
thread with a serial cross-domain reduction at the end.
Currently, by Amdahl’s Law[10], the balancer limits the parallel speedup we can achieve,
and the balancing time increases as we use more cores.
42

1 2 3 4 5 6 7 8 9 10 11 12
90
92
94
96
98
100
Cores In Use
ParallelEﬃciency
PUMA Without Load Balancing
PUMA With Load Balancing
Figure 5.8: Parallel eﬃciency reached by the update step in the PUMA imple-
mentation of LERM with and without load balancing.
43

2 4 6 8 10 12
80
85
90
95
100
Cores In Use
ParallelEﬃciency(TriviallyParallel)
Figure 5.9: Parallel eﬃciency reached by the trivially parallel section of LERM
when using OpenMP (blue) and PUMA’s own thread pool (red).
44

Chapter 6
Conclusions
We have presented a framework which allows users with little or no knowledge of the
underlying topology and memory hierarchy of NUMA-based systems to develop software
which takes advantage of the available hardware while automatically preventing cache
coherency overhead and cross-domain accesses within parallel kernels.
Its uniqueness lies primarily in the class of problems which it tackles; as discussed in
chapter 2.7, solutions exist which are tailored to help with several classes of problems. We
have explored solutions for operating on sets of static, independent data (section 2.7.2)
and graphs of dynamic data with complex dependency hierarchies (section 2.7.3), none of
which are suitable for dynamic, independent data sets such as those used in branches of
agent-based modelling.
The availability of a solution tailored to this sort of problem could help with the rapid
development of scientiﬁc applications, leading to easier research and simulation.
6.1 Future Work
PUMA is far from complete; in particular, we would like to address the issues raised in
section 5.3.
We have designed PUMA to abstract away OS-speciﬁc interfaces, internally, for simplicity.
While the systems on which PUMA has been tested are POSIX-based, Windows also
provides NUMA libraries. In the future, it would be useful to port PUMA to Windows so
that applications using PUMA are not bound to POSIX systems.
A major feature which would make PUMA more able to take advantage of modern dis-
tributed systems is MPI support. This would mainly require changes to the balancing and
reduction sections of kernel application, and would enable further parallelisation.
45

Chapter 6. Conclusions
In order to help PUMA adoption in the scientific community, we would like to create
bindings for languages such as Python and Fortran, both of which are prevalent in scientific
computing. Tools exist for both languages to interface with C functions, so this should
require very little work in exchange for broader applicability of PUMA.
As mentioned in chapter 5, our parallelised version of LERM’s particle management step
has not been rigorously statistically analysed. It would be useful to analyse the changes in
order to assess whether the current PUMA implementation can be adapted to larger sim-
ulations. If not, adaptation would require different, possibly more complex parallelisation
methods.
In section 3.4.1, we discuss the potential usefulness of NUMA memory access profiling.
During PUMA’s development, we briefly explored various methods for the creation of a
profiler which would not require superuser privileges and would perform line-by-line pro-
filing of memory accesses, specifically identifying spots where many cross-domain accesses
were performed and where cache synchronisation dominated timing. Unfortunately, it was
too far outside the scope of PUMA to realistically explore in depth.
We examined two possible strategies for the implementation of such a profiler:
• Using some debugging library (such as LLDB’s C++ API[30]) to trap every memory
access and determine the physical location of the accessed address in order to count
off-domain accesses;
• Building on Valgrind, which translates machine code into its own RISC-like language
before executing the translated code, to count off-domain accesses. Valgrind could
also be used to examine cache coherency latency by adapting Cachegrind, a tool
which profiles cache utilisation.
46

Bibliography
[1] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics
Magazine, 1965.
[2] Wikimedia Commons. Transistor count and moore’s law, 2011.
[3] X. Guo, G. Gorman, M. Lange, L. Mitchell, and M. Weiland. Exploring the
thread-level parallelisms for the next generation geophysical fluid modelling frame-
work fluidity-icom. Procedia Engineering, 61:251–257, 2013.
[4] http://www.metoffice.gov.uk/news/in-depth/supercomputers.
[5] C. Vollaire, L. Nicolas, and A. Nicolas. Parallel computing for the finite element
method. Eur. Phys. J. AP, 1(3):305–314, 1998.
[6] http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf.
[7] Sunny Y. Auyang. Foundations of Complex-system Theories: In Economics, Evolu-
tionary Biology, and Statistical Physics. Cambridge University Press, 1999.
[8] U. Berger and H. Hildenbrandt. A new approach to spatially explicit modelling of
forest dynamics: spacing, ageing and neighbourhood competition of mangrove trees.
Ecological Modelling, 132:287–302, 2000.
[9] U. Saint-Paul and H. Schneider. Mangrove Dynamics and Management in North
Brazil. Springer Science & Business Media, 2010.
[10] Gene M. Amdahl. Validity of the single processor approach to achieving large scale
computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer
conference on - AFIPS ’67 (Spring), 1967.
[11] John Von Neumann. First Draft of a Report on the EDVAC - https:
//web.archive.org/web/20130314123032/http://qss.stanford.edu/~godfrey/
vonNeumann/vnedvac.pdf.
47

Bibliography
[12] http://www.archer.ac.uk/about-archer/.
[13] http://www.cray.com/sites/default/files/resources/cray_xc40_
specifications.pdf.
[14] Richard P. Boardman Steven J. Johnston Mark Scott Neil S. O’Brien Simon J. Cox,
James T. Cox. Iridis-pi: a low-cost, compact demonstration cluster. Cluster Comput-
ing, 2013.
[15] http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/
OrOn2_PfTune/sgi_html/ch08.html.
[16] http://www.oerc.ox.ac.uk/projects/op2.
[17] http://www.oerc.ox.ac.uk/sites/default/files/uploads/ProjectFiles/OP2/
OP2_Users_Guide.pdf.
[18] https://software.intel.com/en-us/intel-tbb/details.
[19] J.D. Woods. The lagrangian ensemble metamodel for simulating plankton ecosystems.
Progress in Oceanography, 67(1-2):84–159, 2005.
[20] Robert Kruszewski. Accelerating agent-based python models. Master’s thesis, Imperial
College London.
[21] https://software.intel.com/en-us/intel-vtune-amplifier-xe.
[22] https://sourceware.org/binutils/docs/gprof/.
[23] https://code.google.com/p/likwid/.
[24] http://valgrind.org/.
[25] http://linux.die.net/man/1/qemu-kvm.
[26] Ananth Grama, Anshul Gupta, and Vipin Kumar. Isoeﬃciency function: A scalability
metric for parallel algorithms and architectures, 1993.
[27] Peter Hanuliak and Michal Hanuliak. Analytical modelling in parallel and distributed
computing, pages 101–102. Chartridge Books Oxford, 2014.
[28] https://software.intel.com/en-us/forums/topic/382683.
[29] https://software.intel.com/en-us/forums/topic/382683.
[30] http://lldb.llvm.org/cpp_reference/html/index.html.
48

Bibliography
[31] http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_
66-GHz-6_40-GTs-Intel-QPI.
[32] http://opensource.org/licenses/BSD-3-Clause.
49

Appendix A
Methods For Gathering Data
All performance data are gathered using the Imperial College High Performance Computing
service (unless explicitly stated otherwise). Code was run on the Cx1 general-purpose
cluster using a node with the following hardware:
• Two six core Intel R Xeon R X5650, 2.66GHz, 12MB last-level cache[31]
– 2.66GHz
– 32KB L1 instruction cache per core
– 32KB L1 data cache per core
– 256KB L2 cache
– 12MB L3 cache
• Two NUMA domains, one for each processor
• Limited to 1GB memory by qsub queuing system
likwid-perfctr was used to gather information from hardware counters; these were pri-
marily related to counting cross-domain accesses using the UNC QHL REQUESTS REMOTE READS
counter and local accesses with the UNC QHL REQUESTS LOCAL READS counter.
All data were collected by averaging results over ten runs.
Formulae:
• Parallel eﬃciency:
100 ×
T1
Tn × n
(A.1)
where Tn is the time taken to run on n cores.
50

Appendix A. Methods For Gathering Data
• Amdahl’s Law theoretical minimum runtime:
T1 ∗ Ps + T1 ∗ Pp/n (A.2)
where Tn is the time taken to run on n cores, Ps is the proportion of the program
with is serial, Pp is the proportion which is in parallelisable and n is the number of
cores.
51

Appendix B
API Reference
3.7
B.1 PUMA Set Management
s t r u c t pumaSet∗ createPumaSet ( s i z e t elementSize , s i z e t numThreads , char ∗
threadAffinity ) ;
Creates a new struct pumaSet.
Arguments:
elementSize Size of each element in the set.
numThreads The number of threads we want to run pumaSet on.
threadAffinity An aﬃnity string specifying the CPUs to which to bind threads. Can
contain numbers separated either by commas or dashes. “i-j” means
bind to every cpu from i to j inclusive. “i,j” means bind to i and j.
Formats can be mixed: for example, “0-3, 6, 10, 12, 15, 13” is valid.
If NULL, binds each thread to the CPU whose number matches the
thread (tid 0 == cpu 0 :: tid 1 == cpu 1 :: etc.).
If non-NULL, must specify at least as many CPUs as there are threads.
void destroyPumaSet ( s t r u c t pumaSet∗ set ) ;
Destroys and frees memory from the struct pumaSet.
52

Appendix B. API Reference
s i z e t getNumElements ( s t r u c t pumaSet∗ set ) ;
Returns the total number of elements in the struct pumaSet.
typedef s i z e t ( splitterFunc ) ( void ∗ perElemBalData , s i z e t numThreads ,
void ∗ extraData ) ;
Signature for a function which, given an element, the total number of threads and, option-
ally, a void pointer, will specify the thread with which to associate the element.
Arguments:
perElemBalData Per-element data passed into pumallocManualBalancing() which
enables the splitter to choose the placement of the associated element.
numThreads The total number of threads in use.
extraData Optional extra data, set by calling pumaListSetBalancer().
void pumaSetBalancer ( s t r u c t pumaSet∗ set , bool autoBalance , splitterFunc ∗
s p l i t t e r , void ∗ splitterExtraData ) ;
Sets the balancing strategy for a struct pumaSet.
Arguments:
set Set to set the balancing strategy for.
autoBalance Whether to automatically balance the set across threads prior to
each kernel run.
splitter A pointer to a function which determines the thread with which to
associate new data when pumallocManualBalancing() is called.
splitterExtraData A void pointer to be passed to the splitter function each time it
is called.
53

B.2 Memory Allocation
void ∗ pumalloc ( s t r u c t pumaSet∗ set ) ;
Adds an element to the struct pumaSet and returns a pointer to it. The new element is
associated with the CPU on which the current thread is running.
void ∗ pumallocManualBalancing ( s t r u c t pumaSet∗ set , void ∗ balData ) ;
Adds an element to the struct pumaSet and returns a pointer to it. Passes balData to
the set’s splitter function to determine the CPU with which to associate the new element.
void ∗ pumallocAutoBalancing ( s t r u c t pumaSet∗ set ) ;
Adds an element to the struct pumaSet and returns a pointer to it. Automatically asso-
ciates the new element with the CPU with the fewest elements.
void pufree ( void ∗ element ) ;
Frees the speciﬁed element from its set.
54

B.3 Kernel Application
s t r u c t pumaExtraKernelData
{
void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) ;
void ∗ constructorData ;
void (∗ extraDataDestructor ) ( void ∗ data ) ;
void (∗ extraDataThreadReduce ) ( void ∗ data ) ;
void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] ,
unsigned int nThreads ) ;
void ∗ retValue ;
};
A descriptor of functions which handle extra data for kernels to pass into runKernel().
Members:
extraDataConstructor A per-thread constructor for extra data which is passed into
the kernel.
constructorData A pointer to any extra data which may be required by the
constructor. May be NULL.
extraDataDestructor A destructor for data created with
extraDataConstructor().
extraDataThreadReduce A finalisation function which is run after the kernel on a per-
thread basis. Takes the per-thread data as an argument.
extraDataReduce A global finalisation function which is run after all threads
have finished running the kernel. Takes retValue, an array
of the extra data for all threads and the number of threads
in use.
retValue A pointer to a return value for use by extraDataReduce.
May be NULL.
void initKernelData ( s t r u c t pumaExtraKernelData∗ kernelData ,
void ∗ (∗ extraDataConstructor ) ( void ∗ constructorData ) ,
void ∗ constructorData ,
void (∗ extraDataDestructor ) ( void ∗ data ) ,
void (∗ extraDataThreadReduce ) ( void ∗ data ) ,
void (∗ extraDataReduce ) ( void ∗ retValue , void ∗ data [ ] ,
unsigned int nThreads ) ,
void ∗ retValue ) ;
Initialises kernelData. Any or all of the arguments after kernelData may be NULL. Any
NULL functions are set to dummy functions which do nothing.
55

extern s t r u c t pumaExtraKernelData emptyKernelData ;
A dummy descriptor for extra kernel data. Causes NULL to be passed to the kernel in place
of extra data.
typedef void (∗ pumaKernel ) ( void ∗ element , void ∗ extraData ) ;
The type signature for kernels which are to be run on a PUMA list.
Arguments:
element The current element in our iteration.
extraData Extra information speciﬁed by our extra data descriptor.
void runKernel ( s t r u c t pumaSet∗ set , pumaKernel kernel , s t r u c t
pumaExtraKernelData∗ extraDataDetails ) ;
Applies the given kernel to all elements in a struct pumaSet.
Arguments:
set The set containing the elements to which we want to apply our
kernel.
kernel A pointer to the kernel to apply.
extraDataDetails A pointer to the structure specifying the extra data to be passed
into the kernel.
void runKernelList ( s t r u c t pumaSet∗ set , pumaKernel kernels [ ] ,
s i z e t numKernels , s t r u c t pumaExtraKernelData∗ extraDataDetails ) ;
Applies the given kernels to all elements in a struct pumaSet. Kernels are applied in the
order in which they are speciﬁed in the array.
Arguments:
set The set containing the elements to which we want to apply our
kernels.
kernels An array of kernels to apply.
numKernels The number of kernels to apply.
extraDataDetails A pointer to the structure specifying the extra data to be passed
into the kernels.
56

B.4 Static Data Allocation
void ∗ pumallocStaticLocal ( s i z e t s i z e ) ;
Allocates thread-local storage which resides on the NUMA domain to which the CPU which
executes the function belongs.
Arguments:
size The number of bytes we want to allocate.
void ∗ pumaDeleteStaticData ( void ) ;
Deletes all static data associated with the current thread.
57

Appendix C
Getting Started: Standard
Deviation Hello World!
In lieu of the traditional “Hello World” introductory program, we present a PUMA-based
program which generates a large set of random numbers between 0 and 1 and uses the
reduction mechanism of PUMA to calculate the set’s standard deviation.
In order to calculate the standard deviation, we require three things: a kernel, a con-
structor for the per-thread data and a reduction function. In the constructor, we use the
pumallocStaticLocal() function to allocate a static variable on a per-thread basis which
resides in memory local to the core to which each thread is pinned.
This interface for allocating thread-local data are only intended to be used for static data
whose lifespan extends to the end of the program. It is possible to delete all static data
which is related to a thread, but it is more sensible to simply reuse the allocated memory
each time we need similarly-sized data on a thread. This requires the use of pthread keys
in order to retrieve the allocated pointer each time it is needed.
// puma . h contains a l l of the puma public API d e c l a r a t i o n s we need .
#include ”puma . h”
#include <math . h>
#include <pthread . h>
#include <s t d l i b . h>
#include <s t d i o . h>
#include <getopt . h>
pthread key t extraDataKey ;
pthread once t initExtraDataOnce = PTHREAD ONCE INIT;
s t a t i c void i n i t i a l i s e K e y ( void )
{
pthread key create (&extraDataKey , NULL) ;
58

Appendix C. Getting Started: Standard Deviation Hello World!
}
s t r u c t stdDevExtraData
{
double sum ;
double squareSum ;
s i z e t numElements ;
};
s t a t i c void ∗ extraDataConstructor ( void ∗ constructorData )
{
( void ) pthread once(&initExtraDataOnce , &i n i t i a l i s e K e y ) ;
void ∗ stdDevExtraData = p t h r e a d g e t s p e c i f i c ( extraDataKey ) ;
i f ( stdDevExtraData == NULL)
{
stdDevExtraData = pumallocStaticLocal ( s i z e o f ( s t r u c t stdDevExtraData ) ) ;
p t h r e a d s e t s p e c i f i c ( extraDataKey , stdDevExtraData ) ;
}
return stdDevExtraData ;
}
s t a t i c void extraDataReduce ( void ∗ voidRet , void ∗ voidData [ ] ,
unsigned int nThreads )
{
double ∗ ret = ( double ∗) voidRet ;
double sum = 0;
double squareSum = 0;
s i z e t numElements = 0;
f o r ( unsigned int i = 0; i < nThreads ; ++i )
{
s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData [ i ] ;
numElements += data−>numElements ;
sum += data−>sum ;
squareSum += data−>squareSum ;
}
double mean = sum / numElements ;
∗ ret = squareSum / numElements − (mean ∗ mean) ;
}
s t a t i c void stdDevKernel ( void ∗ voidNum , void ∗ voidData )
{
double num = ∗( double ∗)voidNum ;
s t r u c t stdDevExtraData∗ data = ( s t r u c t stdDevExtraData ∗) voidData ;
data−>sum += num;
data−>squareSum += num ∗ num;
59

++data−>numElements ;
}
s t a t i c void s t a t i c D e s t r u c t o r ( void ∗ arg )
{
pumaDeleteStaticData () ;
}
Prior to running the kernel, we must actually create the struct pumaSet which contains
our data; to do this, we specify the size of our elements, the number of threads we wish to
use and, optionally, a string detailing what cores we want to pin threads to. We must also
seed the random number generator and read the arguments:
s t a t i c void printHelp ( char ∗ invocationName )
{
p r i n t f ( ”Usage : %s −t numThreads −e numElements [−a a f f i n i t y S t r i n g ] n”
”tnumThreads : The number of threads to use n”
”tnumElements : The number of numbers to a l l o c a t e n”
” t a f f i n i t y S t r i n g : A s t r i n g which s p e c i f i e s which cores to run on . n” ) ;
}
int main ( int argc , char ∗∗ argv )
{
int numThreads = 1;
int numElements = 1000;
char ∗ a f f i n i t y S t r = NULL;
/∗
Get command l i n e input f o r the a f f i n i t y s t r i n g and number of threads .
∗/
char c ;
while ( ( c = getopt ( argc , argv , ”e : a : t : ” ) ) != −1)
{
switch ( c )
{
case ’ t ’ :
numThreads = a to i ( optarg ) ;
break ;
case ’ e ’ :
numElements = a to i ( optarg ) ;
break ;
case ’ a ’ :
a f f i n i t y S t r = optArg ;
break ;
case ’h ’ :
printHelp ( argv [ 0 ] ) ;
break ;
60

}
}
s t r u c t pumaSet∗ set =
createPumaSet ( s i z e o f ( double ) , numThreads , a f f i n i t y S t r ) ;
srand ( time (NULL) ) ;
From here, we can use the pumalloc call to allocate space within set for each number:
f o r ( s i z e t i = 0; i < numElements ; ++i )
{
double ∗ num = ( double ∗) pumalloc ( set ) ;
∗num = ( double ) rand () / RANDMAX;
}
We then use initKernelData() to create the extra data to be passed into our kernel.
From there, we call runKernel() to invoke our kernel and get the standard deviation of
the set.
s t r u c t pumaExtraKernelData kData ;
double stdDev = −1;
initKernelData(&kData , &extraDataConstructor , NULL, NULL, NULL,
&extraDataReduce , &stdDev ) ;
runKernel ( set , stdDevKernel , &kData ) ;
p r i n t f ( ”Our set has a standard deviation of %f n”
”Also , Hello World ! n” , stdDev ) ;
Finally, we clean up after ourselves by destroying our set and all our static data. The static
data destructor destroys data on a per-thread basis, so we must call the destructor from
all threads in our pool. To do this, we use the executeOnThreadPool() function from
pumathreadpool.h.
executeOnThreadPool ( set−>threadPool , staticDestructor , NULL) ;
destroyPumaSet ( set ) ;
}
In order to compile this tutorial, use the following command:
gcc -pthread -std=c99 <file>.c -o stddev -lpuma -L<PUMA bin dir> -I<PUMA inc
dir>
61

Appendix D
Licence
PUMA is released under the three-clause BSD licence[32]. We chose this rather than a
copyleft licence like GPL or LGPL in order to allow anyone to use PUMA with absolute
freedom aside from the inclusion of a short copyright notice.
62

final (1)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to final (1)

Similar to final (1) (20)

final (1)