Knapp_Masterarbeit

Computational Science and Engineering
(International Master’s Program)
Technische Universität München
Master’s Thesis
Parasite: Local Scalability Profiling for Parallelization
Author: Nathaniel Knapp
1st
examiner: Prof. Dr. Michael Gerndt
2nd
examiner: Prof. Dr. Michael Bader
Advisor: M. Sc. Andreas Wilhelm
Thesis handed in on: October 25, 2016

I hereby declare that this thesis is entirely the result of my own work except where otherwise
indicated. I have only used the resources given in the list of references.
October 24, 2016 Nathaniel Knapp
ii

Acknowledgments
I would ﬁrst like to thank Andreas Wilhelm for advising me over the past year I have worked
on this project. His advice has been invaluable to the completion of this thesis. Second, I thank
Prof. Bader and Prof. Gerndt for agreeing to be examiners. Third, I thank my teachers and
mentors during CSE at TUM, especially Alexander P¨oppl for mentoring me on my previous
project. Fourth, I thank Prof. Corey O’Hern, Prof. Rimas Vaisnys, Carl Schreck, and Wendell
Smith for their mentoring at Yale, which inspired me to apply to study CSE at TUM. Fifth, I
thank my classmates in CSE who have made studying at TUM a wonderful experience.
iii

Abstract
In this master’s thesis, Parasite, a local scalability profiling tool, is presented. Parasite mea-
sures the parallelism of function call sites in C and C++ applications parallelized using Pthreads.
The parallelism, the ratio of a program’s work to its critical path, is an upper bound on speedup
for an infinite number of processors, and therefore a useful measure of scalability. The use of
Parasite is demonstrated on sorting algorithms, a molecular dynamics simulation, and other
programs. These tests use Parasite to compare methods of parallelization, elicit the depen-
dence of parallelism on input parameters, and find the factors in program design that limit
parallelism. Future extensions of the tool are also discussed.
v

Contents
Acknowledgements iii
Abstract v
Outline ix
I. Introduction and Background 1
1. Introduction 3
2. Background 5
2.1. Shared Memory Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Parallel Program Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1. Deciding Optimal Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2. Speedup Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3. Limitations on Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4. Using Parasite for Parallel Program Design . . . . . . . . . . . . . . . . . . 8
2.3. Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Related Work 13
3.1. Cilk Proﬁling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2. Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
II. The Parasite Scalability Proﬁler 17
4. Parceive 19
4.1. Acceptable Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2. Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3. Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. Algorithm 23
5.1. The Parasite Work-Span Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2. Work-Span Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3. Estimation of Mutex Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4. Graph Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
III. Results and Conclusion 31
6. Results 33
6.1. Fibonacci Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2. Vector-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3. Sorting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.1. Bubble Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii

Contents
6.3.2. Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.3. Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3.4. Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4. European Championships Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5. Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6. CPP Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7. Conclusion 59
IV. Appendices 61
8. Molecular Dynamics Code 63
8.1. Serial Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2. Fine Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3. Coarse Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliography 71
viii

Contents
Outline
Part I: Introduction and Theory
CHAPTER 1: INTRODUCTION
This chapter presents an overview of the thesis and its purpose.
CHAPTER 2: THEORY
This chapter discusses parallel programming theory relevant to local scalability profiling.
CHAPTER 3: RELATED WORK
This chapter discusses research and commercial tools similar to the Parasite tool.
Part II: The Parasite Scalability Profiler
CHAPTER 3: PARCEIVE
This chapter describes how programs are processed by Parceive before analysis by Parasite.
CHAPTER 4: ALGORITHM
This chapter describes the algorithms and data structures of the Parasite tool, which provide
local scalability profiling.
Part III: Results and Conclusion
CHAPTER 5: RESULTS
This chapter describes tests of Parasite on a diverse selection of programs.
CHAPTER 6: CONCLUSION
This chapter summarizes the current capabilities of the Parasite tool and discusses future ex-
tensions.
ix

Part I.
Introduction and Background
1

1. Introduction
Over the last half-century rapid IT advances “have depended critically on the rapid growth of
single-processor performance,” and much of this growth depended on increasing the number
and speed of transistors on a processor chip by decreasing their size [1]. However, since the
early 21st century, improvements in the speed of single processors have been very slow, as
limits in efficiencies of single-processor architectures have been reached. The size of transistors
continues to be reduced at the same rate, and the hardware industry builds chips that contain
several to hundreds of processors.
The so-far exponential increase in the computing potential of hardware is not equal to the
actual performance of this hardware that the user sees. This is because performance depends
not only on the capabilities of the hardware, but also the utilization of these capabilities. To
fully utilize these chips, they must be programmed using parallel programming models, which
present many software challenges. Therefore, research into methods of parallelization is essen-
tial for true improvement of hardware performance, as opposed to just improvements of the
hardware’s potential performance. Additionally, existing legacy code must be parallelized to
fully use the scalabilty potential of multicore processors. However, parallelization is time-
consuming and error-prone, so most legacy software “still operates sequentially and single-
threaded” [2]. Several challenges of parallel programming explain the gap in development
between hardware and software.
One challenge is successful design of a parallel program that operates concurrently. The
program’s computation may be split into components that run on different threads. Without
proper synchronization, operations that access the same memory locations can easily lead to
nondeterministic behavior. Hence, parallelization of serial programs requires identifiying de-
pendencies and refactoring to use multiple threads in a way in which these dependencies do
not create race conditions. Another challenge is load balancing: evenly dividing the work of
the threads so that full scalability potential is realized. This requires some understanding of the
amount of work associated with each task, as well as separating the program into tasks that are
small enough for even balancing to be possible.
The Chair of Computer Architecture at TUM has developed an interactive parallelization
tool, Parceive, which helps programmers overcome these design challenges for shared memory
systems [2]. Figure 1.1 illustrates the high-level components of Parceive.
Runtime Analysis
Binary instrumentation
Event inspection
Input
Binary application
Debug symbols
Static Analysis
Data-flow
Control-flow
Trace Data
Visualization
Framework
Views
Scalability Profiling
Parasite
Figure 1.1.: Steps of the Parceive tool.
The Parceive tool takes an executable as input, and using Intel’s Pin tool, dynamically instru-
ments predefined instructions, including function calls and returns, memory accesses, memory
allocation and release, and Pthread API calls. This instrumentation inserts callbacks that are
used to write trace data into a database at runtime. Then, the Parceive interpreter reads the
3

1. Introduction
trace stored in the database sequentially in chronological order. The interpreter API allows
the user to acquire information from events of interest generated from reading the database.
These events include function calls and returns, thread creation and ends, thread joins with
their parent, and mutex locks and unlocks.
In this thesis, a scalabiliy proﬁling tool called Parasite is described. This tool analyzes the
events generated by the Parceive interpreter to calculate the parallelism of call sites in Pthread
programs. The parallelism, the ratio of an application’s total work to its critical path, is the
upper bound on speedup possible on any number of processors.
Parasite’s parallelism calculations are useful in two ways. First, they allow the programmer
to quickly identify areas of high and low parallelism. This allows the programmer to focus par-
allelization effort where this effort can result in speedup, and to avoid spending unnecessary
parallelization effort on functions with inherently low parallelism. Low call site parallelism
values might also indicate the need to redesign the program to increase parallelism. Second,
the parallelism calculations allow the programmer to quickly see if the measured speedup of
their program is far from the upper bound on speedup shown by the parallelism. A large
gap between the parallelism and the measured speedup indicates design problems, synchro-
nization problems or operating system problems such as scheduling overhead and memory
bandwidth bottlenecks.
This thesis is structured as follows. Chapter 2 will describe parallel programming theory
relevant to the Parasite tool. Chapter 3 will describe other scalability proﬁling tools that have
been developed. Chapter 4 will describe how Parceive processes input programs before their
analysis by the Parasite tool. Chapter 5 will describe the algorithms Parasite uses to calculate
the parallelism for function call sites, estimate lock effects on parallelism, and verify the cal-
culations using directed acyclic graphs. Chapter 6 will describe tests of the Parasite tool on a
diverse selection of C and C++ programs parallelized using Pthreads. Finally, chapter 7 will
discuss the impact of the Parasite tool and possible future extensions.
4

2. Background
In this chapter theory relevant to the Parasite tool will be discussed. Section 2.1 describes
shared memory parallel programming, the Pthread API, types of parallelism, and the directed
acyclic graph model of multithreading. Section 2.2 discusses the scalability and performance
of shared memory parallel programs, and how the Parasite tool can be used to improve the
performance and scalability. Section 2.3 discusses synchronization and how it relates to the
Parasite tool.
2.1. Shared Memory Parallel Programming
One way to classify parallel programming models is the way that they access memory. In
shared memory parallel programs, multiple processors are allowed to share the same location
in memory, without any restrictions [3]. In distributed memory parallel programs, processors
do not share memory, and messages are used instead to transfer data between processors. Para-
site analyzes Pthread programs, which use the shared-memory model of parallel programming
performance. In this section the Pthread API, types of parallelism, and a way to model shared-
memory programs using graphs are discussed.
Pthreads. Pthreads, short for POSIX threads, is a programming API that can be implemented
using C, C++, or FORTRAN. Using Pthreads does not modify the language - instead Pthread
functions are inserted into the code to dynamically create and destroy parallelism and syn-
chronization [1]. A pthread create(...) call takes a thread ID, a function pointer, and
an optional pointer as arguments. The call creates a new thread with the thread ID that be-
gins running the function whose pointer it is passed. This function can also use the arguments
passed to it through the pointer in the pthread create(...) call. A pthread join(...)
call, always in a parent thread, takes a child thread ID argument and an optional pointer to a
return argument. The pthread join(...) statement creates an implicit barrier: execution of
the parent thread will not continue until the child thread has completed its execution. The only
Pthread synchronization calls that Parceive can analyze are pthread mutex lock(...) and
pthread mutex unlock(...). Mutexes are described in section 2.3. A serious limitation
of Pthreads is its lack of locality control. Locality control is the ability for the programmer to
explicitly direct the location of memory in the operating system. Other limitations include the
overhead of the thread creation and deletion, and the limited control over thread scheduling in
the operating system.
Types of Parallelism. There are many ways to classify parallelism. One way is to split
parallelism into the two categories data parallelism and functional decomposition. Data paral-
lelism is parallelism that increases with the amount of data, or the problem size [3]. Programs
analyzed in this thesis that have data parallelism include vector-vector multiplication, whose
available parallelism increases with the size of vectors being multiplied. Another example is
CPPCheck, analyzed in section 6.6, which is a static analysis tool for correctness and style. As
the work of this program grows with the number of ﬁles, this program shows data parallelism,
even though the operations the program executes for each ﬁle may differ. Functional decom-
position, in contrast, splits a program into tasks that perform different functions. At maximum,
programs with functional decomposition can scale by the number of tasks, but this requires the
tasks to have equal work - perfect load balancing [3].
Parallelism can also be split into regular and irregular parallelism. Programs with regular
5

2. Background
parallelism can be split into tasks that have predictable dependencies. Programs with irregular
parallelism can only be split into tasks with unpredictable dependencies. Usually, programs
with regular parallelism can be modeled by only one single directed acyclic graph, while
programs with irregular parallelism could be modeled by several different directed acyclic
graphs [4].
The Directed Acyclic Graph Model of Multithreading. To examine the structure of shared-
memory parallel programs, it is useful to abstract the programs as directed acyclic graphs
(DAGs). A directed acyclic graph has an ordering of all its vertices, called a topological or-
dering, in which for all directed edges from u to v, u comes before v in the ordering. This
requires the DAG to be acyclic; it is not possible to follow directed edges from any vertex of
the DAG so that the same vertex is reached again. A shared memory parallel program can be
represented as a DAG in which the vertices are strands, “sequences of serially executed instruc-
tions containing no parallel control,” and where graph edges indicate parallel control, such as
thread creation or thread joining [5].
Pthread applications can be modeled with the DAG model of multithreading. In this model,
strands are vertices, and the ordering of strands is shown by the edges between the strands. A
pthread create(...) statement creates two edges. The first edge, the continuation edge,
leads to the next strand in the same parent thread. The second edge, the child edge, goes to the
first strand in the spawned thread. A pthread join(...) statement creates an edge from
the last strand in the spawned thread to the strand following it in the parent thread. Figure 2.1
shows a DAG of a minimal Pthread program where one child thread is created by its parent
thread and then rejoins its parent thread.
Figure 2.1.: A DAG representing a simple Pthread program.
2.2. Parallel Program Performance
In this section the scalability and performance of parallel programs will be discussed, as well as
ways that the Parasite profiling tool can be used to improve performance and scalability. The
definitions introduced in this section come from [5] and [6].
6

2.2.1. Deciding Optimal Scalability
For measuring performance of sequential applications, developers are interested in the execu-
tion time of the application, and the proportion of this execution time spent in different func-
tions. For multithreaded applications, developers are also interested in how this execution time
depends on the number of processing cores [5]. This is an important question, as developers
must decide how many cores to use. Additional cores should only be used when their marginal
benefit is greater than their marginal cost. The benefit comes from speedup of the application.
The cost comes from two factors. The first factor is complexity of parallelization. Developers
must ensure that parallel code does not suffer from indeterminism, try to split computational
load as evenly as possible, and decide what size the load on each concurrently running pro-
cessor should be. This size is called the granularity of the parallelization. The second factor is
the added power consumption cost of additional processors or hardware needed for parallel
computation, but this is usually not a concern compared to the cost of developing parallel code.
One way of determining scalability is to measure it directly. This requires parallel programs
that can easily be changed to use a greater or fewer number of threads by changing an input
parameter. Unfortunately, these programs are limited to those which have independent tasks
within for loops, where an equal fraction of the independent iterations can easily be assigned
to each processor. For these programs, runtimes can be measured by using different number
of threads, and seeing the corresponding speedup. However, this process does not show the
scalability for separate function call sites. In this thesis, a call site is used to refer to each line
of a program where a function is called. For programs that contain more complex task-level
parallelism, or call sites with varying parallelism, it is not so easy to decide on the optimal
number of threads, as threads may have uneven workloads. For these programs, it is useful
to have Parasite because it shows individual call sites with high potential for parallelization,
without having to measure the scalability directly by profiling program runs with different
numbers of threads.
2.2.2. Speedup Bounds
Upper Bounds. Work and span are two measurements of the computation in a parallel pro-
gram. The work is the time it would take to execute all the strands in the computation sequen-
tially. This is the same as the time it takes to execute the computation on one processor, so it is
denoted as T1. The span is the time it takes to execute the critical path of the computation. This
is the same as the time it takes to execute the computation on an infinite number of processors,
so it is denoted as T∞.
“P processors can execute at most P instructions in unit time” [5], which creates the first
speedup constraint, the work law, where Tp is the parallel execution time:
Tp ≥ work/P (2.1)
The maximum speedup from parallelization increases linearly with the number of procesors
at first, because it is determined by the work law. However, as the number of processors in-
creases, they eventually cannot affect the speedup, because at least one of the processors must
execute all instructions on the critical path. This upper bound for the speedup possible on any
number of processors is called the parallelism. Parallelism “is the ratio of a computation’s work
to its span” [6]. This is stated in equation 2.2, where Sp is the speedup on P processors:
Sp = Tp/T1 ≤ T1/T∞ (2.2)
A Lower Bound. The work and span can also be used to calculate a lower bound on speedup
for an ideal machine. An ideal machine is one in which memory bandwidth does not limit
7

2. Background
performance, the scheduler is greedy, and there is no speculative work [3]. Speculative work
is when the machine performs work that may not be needed, before it would be needed, in
case this is faster than performing the work after it is needed. This lower bound on speedup is
called Brent’s Lemma [3].
Tp ≤ (T1 − T∞)/P + T∞ (2.3)
This formula is explained by the fact that the program always take at least T∞ time, but the
p processors can split up the remaining work, T1 − T∞, evenly. Therefore the sum in Equation
2.3 describes a lower bound on speedup for an ideal machine. An ideal machine does not have
the limitations on speedup described in the next section, so if this lower bound on speedup is
not met, it indicates that one of these limitations is acting on the program.
2.2.3. Limitations on Speedup
The goal of Parasite is to help the programmer identify why programs that are parallelized
do not reach their theoretical upper bound on speedup. There are six types of limitations on
speedup in a parallel program described in [3] and [6]:
• Insufficient parallelism: The program contains serial sections that prevent speedup when
using more processors.
• Contention: A processor is slowed down by competing accesses to synchronization prim-
itives, such as mutexes, or by the true or false sharing of cache lines.
• Insufficient memory bandwidth: The processors access memory at a rate higher than the
bandwidth of the machine's memory network can sustain.
• Strangled scaling occurs when synchronization primitives serialize execution and limit
scalability. This problem is often coupled with attempts to solve deadlocks or race condi-
tions, as synchronization primitives implemented to deal with these can lead to strangled
scaling.
• Load imbalance is when some worker threads have significantly more work than others.
This increases the span unnecessarily, as the threads with less work must wait idly while
the other threads complete. This can be dealt with by overdecomposition: splitting tasks
into many more concurrent portions than there are available threads. It is easier to spread
many small blocks of serial work evenly over threads, than a few large blocks.
• Overhead occurs from the cost of creating threads and destroying threads. This problem
is often coupled with load imbalance, as overdecomposition leads to a greater overhead.
Therefore, an appropriate granularity, size of concurrent workloads, should be chosen to
limit both the overhead and load imbalance.
2.2.4. Using Parasite for Parallel Program Design
This section describes how the programmer can use Parasite to diagnose limitations in their
program design, and in some cases, guide possible improvements to the program’s design.
Figure 2.2 provides a visualization of this process.
8

Figure 2.2.: Using Parasite to guide parallel program design.
First, the programmer immediately sees, from Parasite’s call site profiles, call sites in their
program where there is insufficient parallelism. A call site is a specific line in the program
where a function is called, and measurements for this call site include all child function calls
of the function. Parasite can be used to compare the number of processors employed for each
call site to the parallelism of each call site. This helps identify call sites where the parallelism
does not greatly outnumber of processors in use for the call site, and so the speedup may not
be linear [6]. Equation 2.4, derived from Equation 2.3 shows this mathematically:
Sp = T1/Tp =≈ P if T1/T∞ P (2.4)
Parallel slack is the ratio of the parallelism to the number of processors. With enough par-
allel slack, the program shows linear speedup. Scheduling overhead occurs when there is not
enough parallel slack for each processor to be given a task when it is free. This requires some
processors to wait for available work, potentially increasing the span of the program. The
amount of parallel slack needed depends on the operating system scheduler. Intel Cilk Plus
and Intel TBB task schedulers work well with high amounts of parallel slack, because they
only use as much parallelism as the hardware is capable to handle [3]. Pthreads, in contrast,
require the threads to run concurrently, so high parallel slack can decrease the speedup pos-
sible if the operating system has fewer threads than those created in Pthreads. In this case, to
simulate concurrency, the operating system must “time-slice” between the concurrent threads,
adding overhead for context switching and changing the items in the cache [3].
Second, the programmer will be able to use Parasite to investigate the impact of mutex con-
tention on parallelism, using an interactive visualization that allows easy selection of shared
9

2. Background
memory locations to lock using mutexes. The tool will then automatically calculate a new up-
per bound on speedup with locks on these shared memory locations, without the programmer
changing the source code.
Finally, the parallelism of a call site can be compared to the speedup measured using a dif-
ferent profiling tool. If there is a gap in the speedup, and use of the Parasite tool has ruled
out insufficient parallelism, scheduling overhead, and contention as possible causes, the pro-
grammer must consider alternative problems with their program design or operating system.
Insufficient memory bandwidth, synchronization primitives other than locks such as barriers,
and speculative work are remaining possibilities preventing the parallelization from achieving
its potential speedup.
2.3. Synchronization
Synchronization is coordination of events; synchronization constraints are specific orders of
events required in a concurrent program. The most common types of synchronization con-
traints are serialization, where one event must happen before another, and mutual exclusion,
where one event must not happen at the same time as another [7]. When the two events in ques-
tion are on the same thread, these constraints are easy to satisfy. The serialization constraint is
met by placing events in the order intended. The mutual exclusion constraint is automatically
met, because only one event can happen at the same time on the same thread. When two events
that need to have a specific order or need to be mutually exclusive occur on different threads,
synchronization constraints are harder to meet. From the programmer’s perspective, the or-
der of events on different threads is non-deterministic, as it depends on the operating system
scheduling.
Problems. There are a number of problems associated with synchronization. Two common
examples are race conditions, which occur “when concurrent tasks perform operations on the
same memory location without proper synchronization, and one of the memory operations is
a write” [3]. These can have no negative effect in some cases, but are nondeterministic, and
therefore can fail, so are unacceptable in parallel code. Another example is a deadlock, which
“occurs when at least two tasks wait for each other and each cannot resume until the other task
proceeds” [3].
It is both an advantage and disadvantage of Parasite that it cannot detect synchronization
problems, as it operates using a sequential execution of a Pthread program trace. This sequen-
tial execution acts as if the Pthread program was operating using a single thread. The advan-
tage is that programs can be tested for parallelism before synchronization problems are dealt
with. This saves the programmer time when their only goal for using Parasite is to quickly
compare different parallelizations, to assess if a parallelization provides some minimum scal-
ability requirement, and to identify regions of high and low scalability. The disadvantage is
that Parasite’s parallelism calculations are not necessarily accurate for programs that employ
sychronization. A program that deadlocks has no parallelism, as it will never complete. The
parallelism of a program where semaphores, conditional waits, or barriers are employed can-
not be accurately measured using Parasite, as wait times due to these primitives will increase
both the work and span. Even Parasite’s mutex wait time correction, described in section 5.3,
only provides a rough estimate of the additional work and span that waiting for mutexes re-
quires.
Semaphores. A general solution to many synchronization problems is called a semaphore.
A semaphore is defined in [7] by the following three conditions:
• The semaphore can be initialized to any integer value, but after that it can only be incre-
mented or decremented by one.
10

2.3. Synchronization
• When a thread decrements the semaphore to a negative value, the thread blocks (does
not continue) and cannot continue until a different thread increments the semaphores.
• When a thread increments the semaphore and there are waiting threads, then one of the
waiting threads in unblocked.
The application of semaphores to diverse synchronization problems is described in detail
in [7]. The lock wait time estimation algorithm used in Parasite only deals with the case of
mutexes, which are semaphores initialized to values of one. Mutexes are often used to protect
variables that are shared in memory between different threads, to avoid race conditions.
11

3. Related Work
In this section five profiling tools similar to Parasite will be described. The first two, Cilkview
and Cilkprof, are designed for programs parallelized using the Cilk and Cilk++ multithreading
APIs. The third, ParaMeter, analyzes programs with irregular parallelism. The fourth, Intel
Advisor, is a commercial tool that can be used for scalability profiling. The fifth, Kismet, profiles
potential parallelism in serial programs.
3.1. Cilk Profiling Tools
In this section two tools that profile programs using Cilk and Cilk++ will be described. Cilk
and Cilk++ are programming languages designed for multithreaded computing, that extend
C or C++ code with three constructs, cilk spawn(), cilk sync(), and cilk for(), that
support writing task-parallel programs. The first two constructs are similar to
pthread create(...) and pthread join(...), respectively, in Pthreads. Cilk is dif-
ferent from Pthreads, however, in that it does not allow the developer to explicity choose if
threads are created; cilk spawn() only creates a new thread if the Cilk scheduling algo-
rithms decide this will help the performance. Therefore, Pthreads is better for shared-memory
parallel applications in which complete control over thread creation is necessary. Cilk is better
for shared-memory parallel applications that require excellent load balancing, as the backend
of Cilk decides how to balance tasks between threads, unlike Pthreads, where the user is re-
sponsible for load balancing. Normally a user’s attempt at load balancing will not be as good
as Cilk’s backend algorithms for load balancing, as the user will not devote the time to load
balancing that was required to develop the Cilk backend algorithms.
Cilkview. The Cilkview scalability analyzer is a software tool for profiling multithreaded
Cilk++ applications [5]. Like Parceive, Cilkview uses the Pin dynamic instrumentation frame-
work to instrument threading API calls. By analyzing the instrumented binary, Cilkview mea-
sures work and span during a simulation of a serial execution of parallel Cilk++ code. In this
measurement, parallel control constructs such as cilk spawn() or cilk sync() statements
are identified by “metadata embedded by the Cilk++ compiler in the binary executable” [5].
Unlike Parasite, Cilkview can analyze scheduling overhead by using the burdened DAG model
of multithreading, which extends the DAG model described in section 2.1.
In the burdened-DAG model, the work and span of some computations are weighted ac-
cording to their grain size, by including a burden on each edge that continues after a thread
end event, and each edge that continues on the parent thread after a new thread event. The
burdens estimate the cost of migrating tasks, and assume all the tasks that can be migrated are
migrated. Task migration is performed by the underlying Cilk scheduler. The main influence
of using a burdened DAG instead of a DAG is that it increases the work and span values used
in the parallelism calculation, and it decreases the parallelism. The decrease in parallelism is
much higher for programs that have fine-grained parallelism, as these programs have more
edges where burdens are added.
Cilkprof. Cilkprof is a scalability profiler developed for multithreaded Cilk computations [6].
It extends Cilkview to provide work, span and parallelism profiles for individual function call
sites as well as the overall program. It uses compiler instrumentation to create an instrumented
Cilk program, that it then runs serially, to analyze each call site: every location in the code
where a function is either called or spawned. The Cilkprof algorithm measures the work and
13

3. Related Work
span of each call site, in order to get their ratio: the parallelism. It is not described here as
it is used in Parasite and therefore described in detail in section 5.1. Conceptually, Parasite’s
algorithm is the same, with four differences in its implementation:
1. Cilkprof’s algorithm is used for Cilk or Cilk++, while the Parasite algorithm is designed
for Pthreads.
2. The Parasite algorithm includes an estimation of the effects of mutexes on parallelism.
This is useful to programmers, as they may be trying to parallelize code which requires
mutexes. Cilkprof and Cilkview do not consider synchronization.
3. The algorithm for this paper is implemented with object-oriented style in C++, unlike
Cilkprof’s algorithms, which are implemented in C. This has the advantage that the code
is more readable, and simpler, as it can use helpful data structures in the C++ standard li-
brary, such as unordered maps, in place of the C data structures programmed specifically
for Cilkprof.
4. The implementation of Parasite is more generalizable to other threading APIs than Cilk,
as it responds to thread and function events instead of Cilk function calls. These events
can be more easily mapped to threading constructs in other APIs than Cilk’s threading
constructs can.
3.2. Other Tools
ParaMeter: Profiling Irregular Parallelism. Kularni et al. developed a tool, called ParaMeter,
that “produces parallelism profiles for irregular programs” [4]. Irregular programs are orga-
nized with trees and graphs and many have amorphous data parallelism. This is a type of
parallelism where conflicting computations can be performed in any order, where each chosen
order is a DAG that may have its own parallelism. Parasite cannot easily analyze programs
with amorphous data parallelism for two reasons. First, Parasite can only analyze one of the
possible DAGs that models a program. Second, the structure of graphs representing programs
with amorphous data parallelism may depend on the scheduling decisions of the operating
system, and Parasite cannot take these scheduling decisions into account.
ParaMeter deals with these challenges by making the parallelism profile it generates imple-
mentation independent, and using greedy scheduling and incremental execution. Greedy
scheduling “means that at each step of execution, ParaMeter will try to execute as many el-
ements as possible.” Incremental execution means each step of computation is “scheduled tak-
ing work generated in the previous step in account” [4]. ParaMeter not only measures par-
allelism, like Parasite and CilkProf, but also parallelism intensity, which is the the amount of
available parallelism divided by the overall size of the worklist at a given time in the compu-
tation [4]. This metric is useful for deciding on work scheduling policies for tasks: random
policies perform better with high parallelism intensities because it is less likely the policies
create scheduling conflicts, which are situations where tasks must wait idly for other tasks to
complete due to dependencies.
The Intel Advisor: A Commercial Tool. The most similar Intel tool to Parasite is the In-
tel Advisor, which can profile serial programs with annotations that specify parallelism, C
and C++ programs parallelized using Intel Thread Building Blocks or OpenMP, C programs
parallelized using Microsoft TPL, or Fortran programs parallelized using OpenMP [8]. The
Threading Advisor workflow of the Intel Advisor provides similar features to Parasite; both
are designed to assist software developers and architects who are in the process of optimizing
parallelization. However, the Advisor tool is proprietary, which is a disadvantage compared
to Parasite, which is open-source, so Parasite’s algorithms are entirely transparent and open to
inspection by developers.
14

3.2. Other Tools
Parasite has the ability to quickly compare parallelism of different Pthread parallelizations of
the same program, without correct synchronization. Intel Advisor has a similar fast prototyp-
ing feature, that allows developers to look at different parallelizations of a program, conveyed
to the tool using annotations, to compare them before actually implementing their paralleliza-
tion [8]. The Advisor accomplishes this by keeping the code serial when comparing the par-
allelizations, so there can be no bugs related to concurrent execution in any of the potential
parallelizations.
The Intel Advisor provides scalability estimates for the entire program in its suitability anal-
ysis, shown in ﬁgure 3.1, but unlike Parasite, it only looks at the entire program for parallelism
estimates and does not provide individual scalability estimates for functions. Also unlike Para-
site, the tool contains features that analyze call sites and loops for their vectorization potential.
Like Parasite, it can be used to examine the proportion of work spent in different functions, to
help the programmer see where execution time is spent in tasks that can be parallelized [9].
Figure 3.1.: Intel Advisor suitability analysis screenshot.
Kismet: Parallel Speedup Estimates for Serial Programs. The Parasite tool, as well as the
tools described in the previous sections, all require the input program to already be parallelized
in some way. In contrast, Jeon et al. developed a tool, Kismet, that creates parallel speedup
estimates for serial programs [10]. Like Parasite, Kismet calculates an upper bound on the pro-
gram’s attainable speedup. Unlike Parasite, it takes into account operating system conditions
including “number of cores, synchronization overhead, cache effects, and expressible paral-
lelism types” [10]. The speedup algorithm uses a parallel execution time model that depends
on these operating system conditions as well as the amount of parallelism available. Kismet
determines the amount of parallelism available using summarizing hierarchical critical path
analysis, which measures the critical path and work, like the Cilkprof work-span algorithm,
but uses a different approach to take these measurements. This involves building a hierarchi-
cal region structure from source code, consisting of different regions that help separate different
levels of parallelism.
The advantage of Kismet’s approach over Parasite and the other tools described in this sec-
tion is that it does not require additional effort by the programmer. The Intel Advisor requires
annotations that show parallel control, while the other tools require a parallelization. However,
unlike the other tools, Kismet cannot be used to compare different parallelizations of the same
serial program.
15

Part II.
The Parasite Scalability Proﬁler
17

4. Parceive
The Parasite tool depends on Parceive, which provides information on call sites, functions, and
threads that Parasite uses for its work-span algorithm. In this chapter the details of Parceive’s
implementation will be described.
Parceive operates using the steps shown in Figure 4.1. Parceive takes an executable as input
that must meet the requirements described in section 4.1. Then, it performs static analysis of
the machine code and dynamically instruments predefined instructions such as function calls
and returns, threading API calls, or memory accesses. The instrumentation inserts callbacks
that are used to write trace data into a database at runtime. This will be described in section
4.2. Based on this data, trace analysis generates a visualization which the user can use to see
the overall structure of the program. Trace analysis also generates events that Parasite uses to
calculate scalability of function call sites. This will be described in section 4.3.
Runtime Analysis
Binary instrumentation
Event inspection
Input
Binary application
Debug symbols
Static Analysis
Data-flow
Control-flow
Trace Data
Visualization
Framework
Views
Scalability Profiling
Parasite
Figure 4.1.: Steps of the Parceive tool.
4.1. Acceptable Inputs
Currently, Parasite can only successfully analyze programs that satisfy the following condi-
tions:
• The program is written in C or C++.
• The program is parallelized or annotated using Pthread API calls.
• The Pthread API calls only include pthread create(...), pthread join(...),
pthread mutex lock(...), and pthread mutex unlock(...)
• The program’s behavior does not depend on collaborative synchronization.
The last condition means that Parasite cannot correctly analyze a program where the ex-
ecution behavior of one thread depends on the execution behavior of any other thread. This
situation can occur when mutexes are used to control the ordering of threads, because the order
of threads in which the mutex is acquired and released may differ between sequential and con-
current execution. For example, Parasite will deadlock if a mutex is acquired in a parent thread,
which then generates a child thread that needs to acquire the mutex. In a concurrent execution,
the parent thread would continue after spawning child threads, and unlock the mutex, so that
the child thread could acquire the mutex. In the sequential simulation of execution by the Par-
ceive interpreter, the parent thread will not continue until the child thread has completed, but
the child thread will not complete, because it is waiting on the parent thread’s acquired mutex.
19

4. Parceive
Even without collaborative synchronization, a successful run of Parasite that includes mu-
texes may not produce accurate estimates of parallelism, because Parasite may not correctly
calculate the addition to the work and span associated with the mutexes. The lock wait time
algorithm described in section 5.3 attempts to estimate these additions, but does not take into
account overhead associated with acquiring and releasing the mutexes. Other synchronization
primitives such as conditional waits and barriers are not handled by the Parasite algorithm, so
Pthread programs that use these primitives should not be used as inputs to Parasite.
4.2. Trace Generation
“Parceive analyzes programs by utilizing dynamic binary instrumentation at the level of ma-
chine code during runtime” [2]. It employs the Pin framework because “it is efficient and
supports high-level, easy-to-use instrumentation” [2, 11]. The pintool “injects analysis calls
to write trace data into an SQLite database” [2]. The following instrumentation is used for
data-gathering:
• Call stack: function entries and exits are tracked to maintain a shadow call stack. For each
call, the call instructions, threads, and spent execution time are captured. Additionally,
function signatures, file descriptions, and loops are extracted from debug information.
• Memory accesses: analysis calls are injected to capture information about each memory
access (e.g., memory type, memory address, access instruction). For stack variables, de-
bug information is utilized to resolve variable names.
• Memory management: to handle heap memory, memory allocation and release function
calls are instrumented. The tracked locations are used during analysis to match data
accesses using pointers.
• Threading: Parceive tracks calls of threading APIs, like Pthread, to capture thread opera-
tions and synchronization.
4.3. Trace Analysis
Some information contained in the SQLite database is context-free and can be found by sim-
ple queries to the database. This includes data dependencies between functions, which are
detected by comparing the memory accesses of each function, that are in turn found by ab-
stracting instances of function calls. Other information depends on the control and data flow.
To extract this information, the trace stored in the database is read sequentially in chronologi-
cal order. This information includes runtime of a function call and its nested functions, counts
of specific function calls, and counts of specific memory accesses. An API allows the user to
acquire information from events of interest generated from reading the database. The Parasite
tool interfaces directly with the following events to acquire information it needs for its work-
span algorithm:
1. A function calls another function.
2. A function returns to its parent function.
3. A thread creates a child thread.
4. A child thread’s execution ends.
5. A thread join: an implicit barrier where a thread must join its parent thread.
20

4.3. Trace Analysis
6. A function acquires a lock.
7. A function releases a lock.
The actions that occur in Parasite’s algorithm with each event are described in section 5.1.
Shadow locks and threads associated with the events three to seven allow the event informa-
tion to be independent from whatever programming language is used in the programs. This
allows future extensions of Parasite to analyze not only Pthread programs, but also programs
parallelized using other threading APIs such as OpenMP.
21

5. Algorithm
In this chapter the elements of the Parasite algorithm will be discussed. Section 5.1 describes
the algorithm that Parasite employs to measure the work and span of call sites. Section 5.2
describes the data structures used in this algorithm. Section 5.3 describes the algorithm that
adjusts the work and span measurements to take effects of mutexes into account. Finally, sec-
tion 5.4 describes the directed acyclic graphs that Parasite constructs to verify its algorithm.
5.1. The Parasite Work-Span Algorithm
Conceptually, the Parasite algorithm is the same as the Cilkprof algorithm in [6], but imple-
mented to respond to the Parceive interpeter’s events, described in section 4.3, instead of Cilk
constructs. This requires an explanation of how Cilk constructs translate to these events, in
terms of the algorithm (the actions of the operating system are not equivalent). In the Cilkprof
work-span algorithm, a cilk spawn() is equivalent a new thread event. A cilk sync() is
equivalent to a join event where, after the join event, the thread has no current child threads
that have not already joined their parent thread.
Figure 5.1 shows, with pseudocode, the actions of the Cilkprof algorithm as it responds to
the Cilk constructs.
Figure 5.1.: The Cilkprof Work-Span Algorithm. (w = work, p = prefix, l = longest-child, c =
continuation)
The figure uses the following variable names, which are defined in [6], but here, the defini-
tions are written instead in terms of Parceive interpreter events:
1. F is a thread; “Called G” in the figure is a function called from F; otherwise G in the figure
is a child thread of F.
2. The time u is initially set to the beginning of F. As the execution proceeds, it is set to the
time of the new thread event that created the child thread of F, which realizes the longest
span of any child encountered so far since the last join event.
23

5. Algorithm
3. The work F.w is the serial runtime of call site F - its total computation.
4. The continuation F.c stores the span of the trace from the continuation of u through the
most recently executed instruction in F.
5. The longest-child F.l stores the span of the trace from the start of F through the thread end
event of the child thread that F creates at u.
6. The prefix F.p stores the span of the trace starting from the first instruction of F and ending
with u. The path through the DAG representing the program trace that has the length F.p
is guaranteed to be on the critical path of F.
Figure 5.2 illustrates the Cilkprof algorithm as it progresses through the execution of a pro-
gram trace. If the algorithm is still unclear, the reader is encouraged to read section 3 of [6], or
view the documentation and source code of the Parasite tool at [12].
Figure 5.2.: Updates to the span variables as Parasite is executed on a program trace. Each ar-
row represents a different thread. An arrow starting from another arrow is a new
thread event; an arrow intersecting with another is a join event. Colors indicate
different span variables. Before the first join, the continuation and longest child are
compared. The longest child is longer, so the prefix is updated to be the longest
child. Before the second join, the sum of the prefix and the continuation is com-
pared to the new longest child. The longest child is longer, so the prefix is updated
to become this longest child. After the second join, the prefix is now what was the
longest child before the join. After the end of the main thread, the remaining con-
tinuation of the main thread has been added to the prefix. Now the prefix is equal
to the entire span of the program.
Complexity Analysis. The Parasite algorithm has time complexity O(Ne) , where Ne is the
number of Events that the Parceive interpreter sends the Parasite tool. The number of these
events depends entirely on the Pthreads program that the algorithm analyzes. Inputs with
large numbers of threads, function calls, or mutex locks and unlocks will take much longer for
Parasite to analyze than inputs with few function calls or threads created.
The space complexity of the algorithm is also highly input dependent. There is a work
hashtable that includes an entry for every call site in the program being profiled. In addi-
tion to this, each thread has three span hashtables that each have an entry for every call site
24

5.2. Work-Span Data Structures
that is called on the thread. If there are Nt threads, and each thread calls a fixed fraction f of all
of the Ncs call sites, then the complexity would be O(3 ∗ f ∗ Ncs ∗ Nt + Ncs) = O(Ncs ∗ Nt) .
5.2. Work-Span Data Structures
In this section, the stack and hashtable structures used by the algorithm described in section
5.1 are described.
Work and Span Hashtables. Unique call site IDs are generated for each line of a program
where a function is called. Parasite uses hash tables that map these IDs to information about
the work or span of their respective call sites. For every call site, the work hashtable contains
the number of invocations, the total work (measured in time), and the function signature. A
span hashtable contains the longest-child, continuation, or prefix span of each call site on a
thread. It also contains an estimate of the time the thread spends waiting to acquire mutexes.
Function and Thread Stacks. As the Parceive interpreter simulates the execution of a pro-
gram, Parasite updates two stacks, a thread stack and a function stack. These stack data struc-
tures support the traditional stack push and pop operations. For each function call, the function
stack contains a frame with the function signature, the call site ID, and an object that tracks the
lock time intervals. It also contains two integers: the first indicates whether the function call
is the top invocation of its call site on the function stack, and the second indicates whether the
function call is the top invocation of its call site on the current thread. These integers are needed
to avoid the double-counting of work and span in call sites that are called recursively. For each
thread, the thread stack contains a frame with the following information:
1. The unique ID of the thread.
2. A list of interval data structures that stores the times in which mutexes in the thread and
the thread’s children are acquired.
3. Prefix, longest-child, and continuation spans of the thread.
4. A counter that represents the number of child threads spawned from the thread that are
currently on the thread stack.
5. A set that contains call sites which were pushed to the call stack while this thread was
the bottom thread. This set is used to set the integer on each function frame that indicates
whether it is the top invocation of the function’s call site on that thread.
Additionally, a set is used to track all the function call sites currently on the function stack,
in order to correctly set the integer on each function frame that indicates whether it is the top
invocation of the function’s call site in the program.
5.3. Estimation of Mutex Effects
The effects of mutexes on runtime are non-deterministic, and can only be measured accurately
by running the program under test with concurrent execution. However, the goal of Parasite is
to estimate scalability using its mathematical work-span algorithm, instead of using direct mea-
surement. Therefore, a simple heuristic is used to estimate the impacts of mutex contention on
the span of call sites. This heuristic corrects the span and work of each thread if the time that a
mutex in the thread or its child threads is acquired is greater than the span of the thread without
considering mutexes. The approach is outlined in the following pseudocode, which calculates
an addition to the span and work, called mutex wait time in the source code. The correction is
only applied when the Parasite tool processes a sync - a join event after which the parent thread
25

5. Algorithm
has no current children. In the pseudocode, a mutex interval is a data structure storing the
start, span, and mutex ID (a unique ID for each mutex generated by the Parceive interpreter)
that describe a time interval in which a mutex is acquired. The child thread mutex list is
the list of mutexes that any of the child threads in the parent thread have acquired since the last
sync event. This approach does not take into account any overhead associated with acquiring
or releasing mutexes.
mutex_total_span_list = []
for mutex in child_thread_mutex_list:
for mutex_interval in mutex.mutex_interval_list:
mutex.total_span += mutex_interval.span
mutex_total_span_list.append(mutex.total_span)
maximum_mutex_span = mutex_total_span_list.max()
correction = max(0, maximum_mutex_span - longest_child_span)
longest_child_span = longest_child_span + correction
parent_thread_work += correction
5.4. Graph Validation
Parasite constructs a directed acyclic graph while it profiles a program. In order to confirm that
the dynamic algorithm described in section 5.1 produces the correct result, this graph is used
to calculate the span of the program being profiled. Figure 5.3 shows such a DAG, for a parallel
program with a master-worker pattern and four worker threads.
26

Figure 5.3.: Directed acyclic graph of vector-vector multiplication program. In this ﬁgure, the
numbers on edges represent the time spent between events. TS = new thread event,
TE = thread end event, and R = return event. The numbers in the new thread and
thread event labels are the IDs of the threads. The numbers in the return event
labels are the call site IDs of the returning functions.
27

5. Algorithm
For thread start and thread end events one vertex and one edge is added to the graph. The
length of the edge is the time elapsed since the last event, and the vertex represents the event
just generated. For a join event, two edges are created: the first edge connects the most recent
event on the parent thread to the join event. The second edge connects the thread end event of
the child thread to the join event. Therefore, a thread join event always has an inward degree of
2, as it joins two threads. A new thread event always has an outward degree of 2, as it creates
a new thread starting from the new thread vertex, and the parent thread continues.
After Parasite has completed its analysis of the program, it calculates the span of the DAG
it has constructed. The graph is stored using data structures of the BOOST graph library [13],
and stored at the completion of Parasite using a DOT file. Then, this DOT file is loaded into
a Python script, which calculates the longest path of the graph. This graph does not include
estimates of mutex wait times, so the longest path of the graph should be equal to the difference
between the span of the main function, when including mutex wait times, and the mutex wait
time of the main function, both calculated by Parasite’s work-span algorithm. This check was
useful in the initial development of Parasite to confirm that the algorithm was implemented
correctly.
Longest Path Algorithm. Since a shared-memory parallel program can be represented as a
DAG, one way of finding the span of the program is to employ a longest-path algorithm on
the DAG. To check the correctness of its work-span algorithm, Parasite uses the code in listing
5.1 to calculate the longest path in the graph it constructs to represent its input program. This
algorithm is taken directly from the Python networkx library [14].
Listing 5.1: Longest path algorithm for a DAG [14].
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
ifdp pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
Complexity Analysis. It is interesting to compare the complexity of the algorithm in listing
5.1 with Parasite’s algorithm. This algorithm first uses a topological sort, which orders the
vertices of the graph so that for very edge from m to n, m comes before n in the ordering. This
is possible for any directed acyclic graph. The complexity of the topological sort is Θ(V + E),
where V is the number of vertices, and E is the number of edges [15].
After sorting, the algorithm looks at, for each node, the edges from predecessors of this node
to the node itself. Therefore, its complexity is O(V + E). In the DAG generated by Parasite, a
vertex is a thread start, thread end, or thread join event. Every thread except the main thread
has three of these events. Therefore, the number of vertices is O(Nt) , where Nt is the number
of threads spawned during the program execution. The number of edges is also O(Nt), so
the complexity of the algorithm is O(Nt). If the same algorithm was applied to each call site
in Parasite, and each call site had the same number of threads, then the complexity would be
28

O(Nt
2
)
The Parasite algorithm has time complexity O(Ne), where Ne is the number of all events. The
number of events could be linearly proportional to the number of threads, or have a different
relation, hence, the Parasite algorithm could also be of a similar complexity to the longest path
algorithm or much greater. The relative complexity depends completely on the program being
proﬁled. The Parasite algorithm, however, provides more information than the longest path
of the entire program. It gives information about the parallelism for each call site, as well as
estimating the effect of mutexes on the parallelism.
29

Part III.
Results and Conclusion
31

6. Results
In this section Parasite is applied to a diverse set of programs. For each program, the method
of parallelization is discussed, and Parasite is used to estimate the resulting scalability. First,
a simple program that calculates the Nth Fibonacci number is used to verify the correctness
of Parasite’s algorithm. Second, for vector-vector multiplication and four parallel sorting algo-
rithms, Parasite is applied to the programs multiple times to show the dependence of the paral-
lelism on input parameters. Third, using a simulation of the European football championships,
Parasite is shown to be able to determine the scalability of an application with irregular par-
allelism. Fourth, for a molecular dynamics simulation, the parallelism of call sites in different
parallelizations of the same program are compared. Finally, CPPCheck is used to show that
Parasite can be used to quickly test a potential parallelization of a sequential program. For
some of the programs, theoretical parallelisms of certain call sites can be calculated, which is
compared to the actual parallelism that Parasite calculates for these call sites.
6.1. Fibonacci Sequence
The code below is an abbreviated parallelization of calculating the Nth Fibonacci number [16]:
#define N 20
void* fibonacci_thread(void* arg) {
size_t n = (size_t) arg, fib;
pthread_t thread_1, thread_2;
void* pvalue;
if ((n == 0) or (n == 1))
return 1;
pthread_create(&thread_1, 0, fibonacci_thread, (void*)(n - 1));
pthread_create(&thread_2, 0, fibonacci_thread, (void*)(n - 2));
pthread_join(thread_1, &pvalue)));
fib = (size_t) pvalue;
pthread_join(thread_2, &pvalue)
fib += (size_t) pvalue;
return (void*) fib;
}
size_t fibonacci(size_t n) {
return (size_t) fibonacci_thread((void*) n);
}
int main() {
fibonacci(N);
}
33

6. Results
Table 6.1 shows the parallelism of the different function calls in this program, for finding the
20th Fibonacci number.
Parallelism Percentage of Work Count
fibonacci thread 642.401 99.4487 6764
fibonacci thread 241.458 99.856 1
fibonacci 221.677 99.9273 1
main 205.314 100 1
Table 6.1.: Call site parallelism for calculating the 20th Fibonacci number.
The parallelism of calculating the Nth Fibonacci number in the method of the code used has
been calculated to be:
Parallelism(n) = Θ(
φn
n
) (6.1)
where φ is the golden ratio [17]. Figure 6.1 confirms that Parasite’s measurements for the
parallelism of the fibonacci function in this code follow the theoretical prediction. This is a
useful validation that Parasite’s algorithm is implemented correctly.
0 500 1,000 1,500 2,000 2,500 3,000
0
100
200
300
400
500
600
700
800
900
1,000
φn
n
ParallelismofFibonaccifunction.
Figure 6.1.: Dependence of parallelism of Fibonacci function on the number N it calculates.
6.2. Vector-Vector Multiplication
The first test of Parasite was the following parallel vector-vector multiplication program [18]:
/* INPUT VARIABLES */
#define NUM_THREADS 5
#define VECTOR_SIZE 1000000000
pthread_mutex_t mutex_sum = PTHREAD_MUTEX_INITIALIZER;
int *VecA, *VecB, sum = 0, dist;
/* Thread callback function */
void * doMyWork(int myId)
{
34

6.2. Vector-Vector Multiplication
int counter, mySum = 0;
/*calculating local sum by each thread */
for (counter = ((myId - 1) * dist); counter <= ((myId * dist) - 1);
counter++)
mySum += VecA[counter] * VecB[counter];
/*updating global sum using mutex lock */
pthread_mutex_lock(&mutex_sum);
sum += mySum;
pthread_mutex_unlock(&mutex_sum);
return;
}
/*Main function start */
int main(int argc, char *argv[])
{
/*variable declaration */
int ret_count;
pthread_t * threads;
pthread_attr_t pta;
double time_start, time_end, diff;
struct timeval tv;
struct timezone tz;
int counter, NumThreads, VecSize;
NumThreads = NUM_THREADS;
VecSize = VECTOR_SIZE;
/*Memory allocation for vectors */
VecA = (int *) malloc(sizeof(int) * VecSize);
VecB = (int *) malloc(sizeof(int) * VecSize);
pthread_attr_init(&pta);
threads = (pthread_t *) malloc(sizeof(pthread_t) * NumThreads);
dist = VecSize / NumThreads;
/*Vector A and Vector B intialization */
for (counter = 0; counter < VecSize; counter++) {
VecA[counter] = 2;
VecB[counter] = 3;
}
/*Thread Creation */
for (counter = 0; counter < NumThreads; counter++) {
pthread_create(&threads[counter], &pta, (void *(*) (void *)) doMyWork,
(void *) (counter + 1));
}
/*joining threads */
for (counter = 0; counter < NumThreads; counter++) {
pthread_join(threads[counter], NULL);
}
printf("n The Sum is: %d.", sum);
pthread_attr_destroy(&pta);
return;
}
35

6. Results
This is the simplest style of Pthread program, with two functions: a main function and a
worker function. The main function creates several threads which perform the worker function.
In this case, the worker function doMyWork takes sections of the vectors, multiplies these sec-
tions, and adds the result to a global sum variable, which is protected by the mutex mutex sum
to avoid race conditions.
Figure 6.2 shows the dependence of the parallelism on the worker size, using ten worker
threads. The result approaches a limit of about 9.9. If the threads had equal work, the paral-
lelism would be equal to 10. The parallelism approaches 9.9 instead because the work is not
perfectly balanced between the threads, even if they multiply vectors of identical size, as fac-
tors such as memory access time and the time to lock and unlock the mutex can vary between
the threads.
0 50 100 150 200
4
6
8
10
Vector Size * 107
ParallelismofWorkerFunction
Figure 6.2.: Dependence of parallelism of worker function on vector size for parallel vector-
vector multiplication, with a fixed number of 10 worker threads
Figure 6.3 shows the dependence of the parallelism on number of threads, using a fixed input
size of 109. As would be expected, parallelism increases linearly with the number of threads.
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
Number of Threads
ParallelismofWorkerFunction
Figure 6.3.: Dependence of parallelism of worker function on number of threads for parallel
vector-vector multiplication, using a fixed input size of 109.
36

6.3. Sorting Algorithms
Most computer scientists are familiar with the sorting algorithms bubble sort, quick sort, radix
sort, and merge sort. In this section Parasite will be applied to programs that implement each of
these sorting algorithms in parallel, to illustrate Parasite’s ability to show the overall and local
parallelism in a program. Speciﬁcally, with these sorting algorithms, Parceive shows quantita-
tively how the parallelism depends on input size, recursion depth, and granularity.
6.3.1. Bubble Sort
The following abbreviated code is a simple parallelization of the bubble sort algorithm, which
passes through an array of values, and swaps neighboring values if their order is not cor-
rect [19]. In the sequential version of bubble sort, the ﬁrst iteration starts with index 0 of the
array and the iterations continue until index N, where all the values are sorted. In the parallel
version, individual swaps are done in parallel of all the even elements, then individual swaps
are done in parallel of all the odd elements. This process is repeated until all the elements are
sorted.
#define DIM 200
int a[DIM], swapped = 0;
pthread_t thread[DIM];
void bubble(int i) {
int tmp;
if (i != DIM-1) {
if(a[i] > a[i+1]) {
tmp = a[i];
a[i] = a[i+1];
a[i+1] = tmp;
swapped = 1;
}
}
}
int main() {
int i;
fill_a_with_random_integers();
do {
swapped = 0;
for(i = 0; i < DIM; i+=2)
pthread_create(&thread[i], NULL, &bubble, i);
for(i = 0; i < DIM; i+=2)
pthread_join(thread[i], NULL);
swapped = 0;
for(i = 1; i < DIM; i+=2)
pthread_create(&thread[i], NULL, &bubble, i);
for(i = 1; i < DIM; i+=2)
pthread_join(thread[i], NULL);
} while(swapped == 1);
}
37

6. Results
Figure 6.4 shows that this bubble sort implementation is highly parallel, and the parallelism
increases quickly with input size for small inputs. This is expected as the number of threads
spawned to perform the swaps increases quadratically with the input size, as there are O(N2)
swaps. However, after the input size reaches about 200, the parallelism hits a limit of about 2.69.
This could be because some swaps take more time than others. There are O(N2) threads created
for this parallel bubble sort, where N is the input size. Higher input sizes were not tested
because they cause the runtime of Parasite to be very high. This is because the complexity
of Parasite depends on the number of function and thread events, which for this code is also
O(N2).
0 100 200
1
1.5
2
2.5
3
Input size
ParallelismofBubblesort
Figure 6.4.: Dependence of parallelism on input size for parallel bubble sort.
Table 6.2 shows that the the two call sites of bubble have approximately equal work. Gen-
erating these tables for progessively larger input sizes shows that the parallelism and the work
percentage of the two bubble call sites approach each other as the input sizes increases, so
that they are eventually equal. This should be expected, because both call sites have the same
number of calls plus or minus one, and they sort random integers.
bubble 98.5142 8.38823 9100
bubble 83.046 8.41496 9100
main 2.69487 100 1
v initiate 1 0.99462 1
Table 6.2.: Call site parallelism for parallel bubble sort on 200 integers.
6.3.2. Quicksort
In the sequential version of Quicksort, in a partition function, elements are sorted around a
randomly chosen pivot element so that all elements less than the pivot are in one array, and
elements greater than the pivot are in another array. The process is repeated recursively on
both of these arrays until each array is sorted, and then the arrays are combined together with
the pivot in order. The following code is a simple parallelization of the quicksort algorithm (for
brevity, the partition function is omitted):
38

#define RECURSIVE_DEPTH 16
#define INPUT_SIZE 100000
/**
* Structure containing the arguments to the parallel_quicksort function.
Used
* when starting it in a new thread, because pthread_create() can only pass
one
* (pointer) argument.
*/
struct qsort_starter
{
int *array;
int left;
int right;
int depth;
};
void parallel_quicksort(int *array, int left, int right, int depth);
/**
* Thread trampoline that extracts the arguments from a qsort_starter
structure
* and calls parallel_quicksort.
*/
void* quicksort_thread(void *init)
{
struct qsort_starter *start = init;
parallel_quicksort(start->array, start->left, start->right,
start->depth);
return NULL;
}
/**
* Parallel version of the quicksort function. Takes an extra parameter:
* depth. This indicates the number of recursive calls that should be run in
* parallel. The total number of threads will be 2ˆdepth. If this is 0, this
* function is equivalent to the serial quicksort.
*/
void parallel_quicksort(int *array, int left, int right, int depth)
{
if (right > left)
{
int pivotIndex = left + (right - left)/2;
pivotIndex = partition(array, left, right, pivotIndex);
// Either do the parallel or serial quicksort, depending on the depth
// specified.
if (depth-- > 0)
{
// Create the thread for the first recursive calla
struct qsort_starter arg = {array, left, pivotIndex-1, depth};
pthread_t thread;
int ret = pthread_create(&thread, NULL, quicksort_thread, &arg);
assert((ret == 0) && "Thread creation failed");
// Perform the second recursive call in this thread
parallel_quicksort(array, pivotIndex+1, right, depth);
// Wait for the first call to finish.
pthread_join(thread, NULL);
}
39

6. Results
else
{
quicksort(array, left, pivotIndex-1);
quicksort(array, pivotIndex+1, right);
}
}
}
int main(int argc, char **argv)
{
int depth = RECURSIVE_DEPTH;
// Size of the array to sort. Optionally specified as the second
argument
// to the program.
int size = INPUT_SIZE;
// Allocate the array and initialise it with pseudorandom numbers. The
// random number generator is always seeded with the same value, so this
// should give the same sequence of numbers.
int *values = calloc(size, sizeof(int));
assert(values && "Allocation failed");
int i = 0;
for (i=0 ; i<size ; i++)
{
values[i] = i * (size - 1);
}
// Sort the array
parallel_quicksort(values, 0, size-1, depth);
return 0;
}
Here, the recursive calls to quicksort on the arrays smaller and larger than the pivot are made
in parallel, up to a developer-speciﬁed depth [20]. After this depth a sequential quicksort is
applied to the array. The parallelism of this parallel Quicksort algorithm can be derived in the
ideal case, where the partition algorithm evenly splits every step around the pivot equally. The
work and span have the following recurrence relations, presented in [3]:
T1(N) = 1 + 2T1(N/2) (6.2)
T∞(N) = N + T∞(N/2) (6.3)
The solution of the recurrence relations, derived in [3], is:
T1(N) = Θ(NlgN) (6.4)
T∞(N) = N (6.5)
Therefore the theoretical parallelism is:
T1(N)/T∞(N) = Θ(NlgN)/Θ(NlgN) = Θ(lg(N)) (6.6)
40

quicksort 32.8953 39.6884 32
quicksort 32.3863 73.4517 5841
partition 32.3009 12.7183 5841
quicksort 32.2813 73.6963 5841
quicksort 32.2134 39.3518 32
parallel quicksort 31.2151 78.496 31
quicksort thread 25.3564 82.0705 31
main 5.48624 100 1
partition 1.30958 0.761345 63
Table 6.3.: Call site parallelism for parallel quicksort with recursion depth 5 on 10,000 integers.
Table 6.3 shows the parallelism and span of call sites in the parallel quick sort for a recursion
depth 5, on an input size of 10,000 integers. The four sequential quicksort call sites all have
parallelism 32, as well as the partition call site within the quicksort function. These call
sites are all called after the recursion reaches depth 5. The parallel quicksort algorithm can be
viewed as a full binary tree where each node that is not a leaf is a call to parallel quicksort.
The leaf nodes are calls to the sequential quicksort function, and there are 2D leaves in a full
binary tree, where D is the recursion depth, so here, where the depth is 5, a parallelism of
32 in the leaf function is expected. Figure 6.5 conﬁrms that the average parallelism of the four
quicksort call sites is approximately equal to 2D, where D is the recursion depth of the quicksort
program.
2 4 6 8 10
2
4
6
8
10
Recursion Depth
Log2(QuicksortParallelism)
Figure 6.5.: Dependence of parallelism of sequential quicksort calls on recursion depth.
Next, the parallelism of the top call to parallel quicksort observed using Parasite was
compared to the theoretical parallelism of parallel quicksort, in Figure 6.6. To remain as close
as possible to the ideal case, recursion depths of ﬂoor(log2(N)) were used. Interestingly, this
parallel quicksort function seems to show a linear dependence of parallelism on input
size, instead of the logarithmic dependence that theory predicts. This is likely because the
algorithm used here applies the sequential quicksort algorithm after the recursion depth is
reached.
41

6. Results
0 10 20 30 40
0
100
200
300
400
500
Input Size * 103
Parallelismofparallelquicksortfunction.
Figure 6.6.: Dependence of parallelism of parallel quicksort on input size.
6.3.3. Radix Sort
The following code is a parallelization of the radix sort algorithm [21]:
#define NTHREADS 5
/* Bits of value to sort on. */
#define BITS 29
/* Thread arguments for radix sort. */
struct rs_args {
int id; /* thread index. */
unsigned *val; /* array. */
unsigned *tmp; /* temporary array. */
int n; /* size of array. */
int *nzeros; /* array of zero counters. */
int *nones; /* array of one counters. */
int t; /* number of threads. */
};
/* Global variables and utilities. */
struct rs_args *args;
/* Individual thread part of radix sort. */
void radix_sort_thread (unsigned *val, /* Array of values. */
unsigned *tmp, /* Temp array. */
int start, int n, /* Portion of array. */
int *nzeros, int *nones, /* Counters. */
int thread_index, /* My thread index. */
int t) /* Number of theads. */
{
unsigned *src, *dest;
int bit_pos;
int index0, index1;
42

int i;
/* Initialize source and destination. */
src = val;
dest = tmp;
/* For each bit... */
for ( bit_pos = 0; bit_pos < BITS; bit_pos++ ) {
/* Count elements with 0 in bit_pos. */
nzeros[thread_index] = 0;
for ( i = start; i < start + n; i++ ) {
if ( ((src[i] >> bit_pos) & 1) == 0 ) {
nzeros[thread_index]++;
}
}
nones[thread_index] = n - nzeros[thread_index];
/* Get starting indices. */
index0 = 0;
index1 = 0;
for ( i = 0; i < thread_index; i++ ) {
index0 += nzeros[i];
index1 += nones[i];
}
index1 += index0;
for ( ; i < t; i++ ) {
index1 += nzeros[i];
}
/* Move values to correct position. */
for ( i = start; i < start + n; i++ ) {
if ( ((src[i] >> bit_pos) & 1) == 0 ) {
dest[index0++] = src[i];
} else {
dest[index1++] = src[i];
}
}
/* Swap arrays. */
tmp = src;
src = dest;
dest = tmp;
}
}
/* Thread main routine. */
void thread_work (int rank)
{
int start, count, n;
int index = rank;
/* Ensure all threads have reached this point, and then let continue. */
pthread_barrier_wait(&barrier);
/* Get portion of array to process. */
43

6. Results
n = args[index].n / args[index].t; /* Number of elements this thread is
in charge of */
start = args[index].id * n; /* Thread is in charge of [start, start+n]
elements */
/* Perform radix sort. */
radix_sort_thread (args[index].val, args[index].tmp, start, n,
args[index].nzeros, args[index].nones, args[index].id,
args[index].t);
}
void radix_sort (unsigned *val, int n, int t)
{
unsigned *tmp;
int *nzeros, *nones;
int r, i;
/* Thread-related variables. */
long thread;
pthread_t* thread_handles;
/* Allocate temporary array. */
tmp = (unsigned *) malloc (n * sizeof(unsigned));
/* Allocate counter arrays. */
nzeros = (int *) malloc (t * sizeof(int));
nones = (int *) malloc (t * sizeof(int));
/* Initialize thread handles and barrier. */
thread_handles = malloc (t * sizeof(pthread_t));
/* Initialize thread arguments. */
for ( i = 0; i < t; i++ ) {
args[i].id = i;
args[i].val = val;
args[i].tmp = tmp;
args[i].n = n;
args[i].nzeros = nzeros;
args[i].nones = nones;
args[i].t = t;
/* Create a thread. */
pthread_create (&thread_handles[i], NULL, thread_work, i);
}
/* Wait for threads to join and terminate. */
for ( i = 0; i < t; i++ )
pthread_join (thread_handles[i], NULL);
/* Copy array if necessary. */
if ( BITS % 2 == 1 ) {
copy_array (val, tmp, n);
}
}
void main (int argc, char *argv[])
{
int n, t;
44

unsigned *val;
time_t start, end;
n = INPUT_SIZE;
t = NTHREADS;
val = (unsigned *) malloc (n * sizeof(unsigned));
random_array (val, n);
args = (struct rs_args *) malloc (t * sizeof(struct rs_args));
radix_sort (val, n, t); /* The main algorithm. */
}
The sequential version of this algorithm first sorts numbers by their least significant digit,
then by their next significant digit, until entire sequences of numbers are sorted. The paral-
lelization splits the array of numbers into equal portions, and determines their position in the
overall array using prefix sums on each digit; for a detailed explanation, see [22]. Table 6.4
shows the call site parallelism profiles. Interestingly, the parallelism of these call sites does
not change significantly when the input size is varied or when the number of threads used is
changed. If the developer wished to acquire higher speedup from the program, she could try
modifiying the implementation to make it more scalable, by decreasing the amount of time
spent in non-parallelizable sections (radix sort only has about 43% of the work) or possibly
choosing a different parallelization of the radix sort algorithm.
radix sort thread 1.82281 52.99 2
radix sort 1.77026 60.7733 1
thread work 1.70867 57.7119 2
main 1.36292 100 1
random array 1 35.2695 1
Table 6.4.: Call site parallelism for parallel radix sort on 106 integers with 5 threads.
6.3.4. Merge Sort
The following code is a simple parallelization of the merge sort algorithm [23]:
#define TYPE int
#define MIN_LENGTH 2
typedef struct {
TYPE *array;
int left;
int right;
int tid;
} thread_data_t;
int number_of_threads;
pthread_mutex_t lock_number_of_threads;
// The function passed to a pthread_t variable.
void *merge_sort_threaded(void *arg) {
thread_data_t *data = (thread_data_t *) arg;
int l = data->left;
int r = data->right;
int t = data->tid;
if (r - l + 1 <= MIN_LENGTH) {
45

6. Results
// Length is too short, let us do a |qsort|.
qsort(data->array + l, r - l + 1, sizeof(TYPE), my_comp);
} else {
// Try to create two threads and assign them work.
int m = l + ((r - l) / 2);
// Data for thread 1
thread_data_t data_0;
data_0.left = l;
data_0.right = m;
data_0.array = data->array;
pthread_mutex_lock(&lock_number_of_threads);
data_0.tid = number_of_threads++;
pthread_mutex_unlock(&lock_number_of_threads);
// Create thread 1
pthread_t thread0;
int rc = pthread_create(&thread0,
NULL,
merge_sort_threaded,
&data_0);
// Data for thread 2
thread_data_t data_1;
data_1.left = m + 1;
data_1.right = r;
data_1.array = data->array;
pthread_mutex_lock(&lock_number_of_threads);
data_1.tid = number_of_threads++;
pthread_mutex_unlock(&lock_number_of_threads);
// Create thread 2
pthread_t thread1;
pthread_create(&thread1,
NULL,
&data_1);
int created_thread_1 = 1;
// Wait for the created threads.
pthread_join(thread0, NULL);
pthread_join(thread1, NULL);
// Ok, both done, now merge.
// left - l, right - r
merge(data->array, l, r, t);
}
}
void merge_sort(TYPE *array, int start, int finish) {
thread_data_t data;
data.array = array;
data.left = start;
data.right = finish;
// Initialize the shared data.
number_of_threads = 0;
pthread_mutex_init(&lock_number_of_threads, NULL);
data.tid = 0;
46

// Create and initialize the thread
pthread_t thread;
pthread_create(&thread,
NULL,
&data);
// Wait for thread, i.e. the full merge sort algo.
pthread_join(thread, NULL);
}
int main(int argc, char **argv) {
int n = INPUT_SIZE;
int *p = random_array(n);
merge_sort(p, 0, n - 1);
free(p);
pthread_mutex_destroy(&lock_number_of_threads);
}
The sequential version of this algorithm first divides the unsorted list of numbers into n small
sublists, which are sorted using an algorithm of choice. Then the sublists are merged (combined
into larger sorted lists) until there is a single sorted list remaining.The parallel implementation
gives the two calls to merge sort on each recursion level to two threads, which then carry out
the parallel merge sort on their own arrays until the size of the array is less than a user-set
minimum merge sort size. Then, a sequential quicksort is applied to the array.
Table 6.5 shows Parasite’s profile for a run of this parallel merge sort with 10,000 integers
and a minimum merge sort size of 10. merge only occupies 3.6% of the work that merge sort
performs, indicating that the calls to the sequential C library function quicksort (not measured
by Parasite) are much more expensive. Therefore, decreasing the input size at which this quick-
sort is performed should increase the parallelism. Further tests confirmed that the parallelism
of the top-level merge sort call in main continues to increase until the minimum merge sort
size is one. However, Parasite cannot show the effect of the true operating system overhead
of pthread create(...) and pthread join(...) that would occur in concurrent execu-
tion, and with this effect, the minimum merge sort size for peak parallelism would likely be
greater than one.
Parallelism (P) P including Mutex Correction % of Work Count
merge 199.301 199.301 3.58896 1023
merge sort threaded 99.1834 21.7108 64.2943 1
merge sort 50.4269 18.1671 65.07 1
main 16.2865 11.9586 100 1
random array 1 1 4.66903 1
Table 6.5.: Call site parallelism for parallel merge sort on 10000 integers with minimum merge
sort size of 10.
Note that the single mutex in this parallel merge sort, that protects the global count of num-
ber of threads, reduces the parallelism of the call to merge sort by about 60 percent. However,
this global thread count is only used for debugging purposes, so could be removed. Without
Parasite, the programmer would not necessarily know that this mutex was having a signifi-
cant effect on the parallelism. With Parasite, the programmer sees a clear contrast in the first
two columns of the table, and knows that removing the mutex can improve the parallelism
significantly.
47

6. Results
In [3] it is shown that parallel merge sort has the following work and span:
T1(N) = Θ(NlgN) (6.7)
T∞(N) = Θ(Nlg3
N) (6.8)
Therefore the theoretical parallelism is:
T1(N)/T∞(N) = Θ(NlgN)/Θ(Nlg3
N) = Θ(N/lg2
N) (6.9)
Figure 6.7 compares Parasite’s measured parallelism of the parallel mergesort function as a
function of input size. The plot is approximately linear, so this program shows that the imple-
mentation has the expected theoretical parallelism.
0 20 40 60 80 100 120
0
50
100
N/(log(N)*log(N))
Parallelismoftopcalltoparallelmergesortfunction
Figure 6.7.: Dependence of parallelism of parallel mergesort on N. the input vector size.
6.3.5. Summary
Parallel sorting programs act as a more demanding validation of the Parasite tool, by showing
that Parasite produces reasonable parallelism values for recursive algorithms with large depth,
such as the parallel quicksort and parallel merge sort algorithms, and that the parallelism mea-
sured agrees with the theoretical parallelism predicted in the case of mergesort. Furthermore,
the sorting tests demonstrate that Parasite can quickly help the developer ﬁnd information
useful for setting parameters in their program including the input size, the number of threads
used, and the recursion depth reached before a sequential algorithm is used for the sort. The de-
veloper could gain further information useful to the design of their parallel sorting algorithms
through the Parasite tool by examining the effect of different distributions of input number sets
on the parallelism of each sorting algorithm.
6.4. European Championships Simulation
In this section EMSim, a simulation of the European Football Championships, will be ana-
lyzed [24]. This program has a more complex structure than a simple master-worker pat-
48

tern. Without understanding football, one does not necessarily understand how many matches
could be played concurrently in this simulation, or how the simulation would scale with more
threads. Parasite gives the programmer unambiguous numbers for the parallelism of all the
call sites, that are useful for showing the scalability of individual parts of the simulation, as
well as problems with the load balancing.
Program Structure. Figure 6.8 shows Parceive’s visualization of the EMsim program.
49

6. Results
Figure6.8.:ThecalltreestructureoftheEuropeanchampionshipssimulation.Forclarityonlyspeciﬁcsubtressareexpanded.
50

As the calling context tree view in the center of this figure shows visually, the program fol-
lows the following steps:
1. In the initDB method, read information containing statistics on previous EM matches into
a database.
2. Get the 24 teams who will be playing.
3. Simulate a full execution of the European championships, through the following steps:
a) Initialize the simulation.
b) Call, in parallel, six calls to parallel calls to playGroup, which each simu-
late six matches in the group play of the EM simulation. The matches are played
sequentially.
c) Sort the team scores based on their group results.
d) Call, in parallel, 16 calls to playMatchInPar to simulate the round of 16.
e) Call, in parallel, 8 calls to playMatchInPar to simulate the round of 8.
f) Call, in parallel, 4 calls to playMatchInPar to simulate the quarterfinals.
g) Call, in parallel, 2 calls to playMatchInPar to simulate the semifinals.
h) Simulate the final match.
Parceive’s visualization shows some useful information about the program, including that
the work is not evenly balanced between the calls to playMatchGen, the function call at the
second to bottom level of the performance view. However, its visualization only provides a
qualitative view of the parallelism in the EM simulation.
Parallelism. Parasite provides quantitative views: Table 6.6 shows all the function calls in
the simulation with parallelism greater than 1.
51

6. Results
team1DominatesTeam2 15.5336 0.00218563 72
getNumMatches 6.84717 0.746724 52
getMatches 6.4472 0.963656 51
getGoalsPerGame 6.09455 0.00455375 111
getMatchesInternal 5.60061 0.113177 52
getMatchesInternal 5.47997 0.103317 52
fillPlayer 4.71968 9.90176 2621
fillPlayer 4.60666 3.80308 222
parallel calls to playGroup 4.33232 69.1592 16
playGroup 4.33212 69.1557 6
playGroupMatch 4.33129 69.1352 36
playMatchGen 4.33077 69.1262 36
playEM 3.75038 97.9785 2
getPlayersOfMatch 3.73775 96.8404 111
getGoalsPerGame 3.63284 0.00139767 111
getNumPlayersOfMatch 3.62938 20.685 111
getNumPlayersOfMatch 3.6288 20.7125 111
main 3.58549 100 1
playFinalRound 2.84291 28.7715 4
playMatchInPar 2.84202 28.7286 15
playFinalMatch 2.84182 28.7262 15
playMatchGen 2.84162 28.7192 30
getTeam 1.24702 0.000830991 18
Table 6.6.: Parallelism of call sites in EMsim.
Table 6.6 provides the insight that none of the call sites in the EMsim has a parallelism greater
than 7, other than team1DominatesTeam2, which takes less than 0.01% of the work. There-
fore, with the current design of the simulation, there is probably no benefit to calling more than
seven threads at any point in the simulation. This insight is not immediately clear from Figure
6.8, the program description, or the DAG Parasite generates for the program. Furthermore, the
Parceive visualization suggests that
parallel calls to playGroup could have a parallelism of at most 6, as it is called six times
concurrently. The parallelism of parallel calls to playGroup is instead about 4.3, indi-
cating that the calls are not balanced in terms of span - one of the calls must take significantly
longer than the other. A similar observation can be made for playGroupMatch. This function
is called six times concurrently in playGroup, but only has a parallelism of 4.3.
The function playGroupMatch is called 6 times from each of the six calls of playGroup.
This function simulates a match, and in the group phase, all matches are independent, so it
could have a parallelism of 36 if all matches involve the same work. However, Parasite only
measures a parallelism of 4.33 for this function, indicating that the simulation could be re-
designed to be more scalable in the group match phase.
6.5. Molecular Dynamics
Appendix 8.1 contains a simple serial molecular dynamics code [25]. Table 6.7 shows the dis-
tribution of work for the sequential simulation.
52

6.5. Molecular Dynamics
main 1 100 1
update 1 1.08808 10
compute 1 10.0827 10
calKinetic 1 0.923045 110
compute 1 12.6647 1
main work 1 96.8591 1
distance 1 4.01697 990
initialize 1 3.35982 1
Table 6.7.: Call site parallelism and work percentage for the serial molecular dynamics code in
Appendix 8.1, using 10 atoms.
The compute function performs the largest percentage of the work, because it has nested
for loops over all atoms in the simulation. The distance function, which occurs within two
nested for loops over the atoms, contributes to about half the work of the compute function.
This suggests one way to parallelize, which is to split the calls to distance over different
threads. Examination of the compute function shows that the potential and force calculations
after distance depend on the result of the distance calls, so they should be included in the same
worker function as the distance calculation. This parallelization is included in Appendix 8.2,
and table 6.8 shows the call site parallelism:
distance 1.20118 2.34429 990
compute 1.16186 30.3021 10
distance work 1.14416 12.9843 990
main 1.05319 100 1
main work 1.04937 98.6271 1
compute 1.0108 39.2091 1
update 1 0.631825 10
calKinetic 1 0.694468 110
Table 6.8.: Call site parallelism and work percentage for the a parallelization over distance calls
for the code in 8.2, using 10 atoms, which creates a new thread for each distance
calculation.
This parallelization has a very high granularity, because is requires Θ(N2
a ), where Na is the
number of atoms, threads to be created for each step of the simulation. The parallelization
only increases the parallelism of main from 1 to 1.05, indicating that the time cost of creating
and joining threads cancels out most of the additional concurrency resulting from creating the
threads. A way to parallelize with larger granularity is to group the calculations for each atom
together, so for each time step, the number of threads created is equal to the number of atoms.
This parallelization is included in Appendix 8.3, and Table 6.9 shows the call site parallelism:
53

6. Results
compute 44.3736 80.8772 10
main work 8.65348 99.7773 1
main 8.55733 100 1
compute 4.41926 11.8808 1
update 1 0.106188 10
Table 6.9.: Call site parallelism and work percentage for the parallelization over distance calls
included in 8.3, which creates a new thread for the distance and potential calcula-
tions associated with each of the 10 atoms.
There are 10 threads spawned at each time step to perform distance calculations for the atoms
used in the simulation, and the parallelism of main is about 8.6, which indicates some overhead
in pthread create(...) and pthread join(...). There are also possibly some load
imbalances in the work assigned to each thread, which would mean the potential and kinetic
energy calculations are more costly for some atoms.
The molecular dynamics simulation illustrates an important difference in the use of Parasite
as opposed to a profiler that executes the program in parallel. Synchronization issues such as
race conditions do not have to be considered when using Parasite to gain information about a
Pthread program. Consider the two alternative molecular dynamics parallelizations in Appen-
dices 8.3 and 8.2. These likely have race conditions, because the pointer to an AtomInfo object
can be accessed concurrently by two different threads. Using a normal profiler, this race con-
dition creates indeterminism that may lead to undefined behavior, so the programmer would
have to implement a mutex for each atom to avoid the race condition. Parasite operates se-
quentially, so this race condition does not present a problem. With Parasite, the programmer
can quickly test a “rough draft” of a possible parallelization for comparison with other paral-
lelizations, without implementing synchronization. In this case the programmer may want to
compare the parallelizations presented here, or merely see if the parallelism of either of these
parallelizations merits the effort required to verify and implement a correct parallelization with
proper synchronization primitives. The effort to implement the primitives may seem trivial for
this simple molecular dynamics application, but could be significantly higher for applications
with more complex synchronization requirements.
6.6. CPP Check
CPPcheck is a static code analysis tool that checks style and correctness in C++ files [26]. It is
much larger than the other test programs in this thesis, and is frequently used by programmers
across the world. CPPcheck comes with a multi-threaded execution that does not use Pthreads,
but to test Parasite, it has been parallelized by the following code excerpt, which has been
edited to only show lines relevant to the parallelization [26]:
struct thread_arg {
CppCheck* object;
const std::string* file_name;
unsigned int return_value;
};
void* pthread_worker(void* thread_arg) {
struct thread_arg* thrd_arg = (struct thread_arg*) thread_arg;
54

6.6. CPP Check
thrd_arg->return_value =
thrd_arg->object->check(*(thrd_arg->file_name));
}
int CppCheckExecutor::check_internal(CppCheck& cppcheck, int /*argc*/,
const char* const argv[])
{
// ...... CODE OMITTED FOR BREVITY .... //
if (settings.jobs == 1) {
// ...... CODE OMITTED FOR BREVITY .... //
std::size_t processedsize = 0;
pthread_t* thread;
thread = new pthread_t[_files.size()];
struct thread_arg* arg;
arg = new struct thread_arg[_files.size()];
for (std::map<std::string, std::size_t>::const_iterator i =
_files.begin(); i != _files.end(); ++i) {
if (!_settings->library.markupFile(i->first)
|| !_settings->library.processMarkupAfterCode(i->first)) {
arg[j].file_name = &i->first;
arg[j].return_value = 0;
arg[j].object = &cppcheck;
pthread_create(&thread[j], NULL, &pthread_worker, &(arg[j]));
}
}
j = 0;
for (std::map<std::string, std::size_t>::const_iterator i =
_files.begin(); i != _files.end(); ++i) {
pthread_join(thread[j], NULL);
returnValue += arg[j].return_value;
j++;
}
// ... EXCLUDED CODE ... //
This code creates a worker thread for each file that CPP check processes, in an attempt to scale
the application through concurrent processing of files. CPPCheck’s original multi-threaded ex-
ecution is more complex due to synchronization, but operates using a similar principle - each
thread analyzes a different file. However, the parallelization above requires very little effort
to produce compared to CPPcheck’s multithreaded execution. If you were a programmer at-
tempting to parallelize CPPcheck, the above parallelization could be used with Parasite’s se-
quential trace execution to test scalability, before making the effort to produce the synchroniza-
tion necessary for a fully-working parallelization.
Table 6.10 shows the results for running Parasite on this parallelization of CPPcheck with 2
of the sample C++ files contained in CPPcheck’s Github repository:
55

Knapp_Masterarbeit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Knapp_Masterarbeit

Similar to Knapp_Masterarbeit (20)

Knapp_Masterarbeit