SlideShare a Scribd company logo
1 of 82
Download to read offline
Computational Science and Engineering
(International Master’s Program)
Technische Universit¨at M¨unchen
Master’s Thesis
Parasite: Local Scalability Profiling for Parallelization
Author: Nathaniel Knapp
1st
examiner: Prof. Dr. Michael Gerndt
2nd
examiner: Prof. Dr. Michael Bader
Advisor: M. Sc. Andreas Wilhelm
Thesis handed in on: October 25, 2016
I hereby declare that this thesis is entirely the result of my own work except where otherwise
indicated. I have only used the resources given in the list of references.
October 24, 2016 Nathaniel Knapp
ii
Acknowledgments
I would first like to thank Andreas Wilhelm for advising me over the past year I have worked
on this project. His advice has been invaluable to the completion of this thesis. Second, I thank
Prof. Bader and Prof. Gerndt for agreeing to be examiners. Third, I thank my teachers and
mentors during CSE at TUM, especially Alexander P¨oppl for mentoring me on my previous
project. Fourth, I thank Prof. Corey O’Hern, Prof. Rimas Vaisnys, Carl Schreck, and Wendell
Smith for their mentoring at Yale, which inspired me to apply to study CSE at TUM. Fifth, I
thank my classmates in CSE who have made studying at TUM a wonderful experience.
iii
Abstract
In this master’s thesis, Parasite, a local scalability profiling tool, is presented. Parasite mea-
sures the parallelism of function call sites in C and C++ applications parallelized using Pthreads.
The parallelism, the ratio of a program’s work to its critical path, is an upper bound on speedup
for an infinite number of processors, and therefore a useful measure of scalability. The use of
Parasite is demonstrated on sorting algorithms, a molecular dynamics simulation, and other
programs. These tests use Parasite to compare methods of parallelization, elicit the depen-
dence of parallelism on input parameters, and find the factors in program design that limit
parallelism. Future extensions of the tool are also discussed.
v
Contents
Acknowledgements iii
Abstract v
Outline ix
I. Introduction and Background 1
1. Introduction 3
2. Background 5
2.1. Shared Memory Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Parallel Program Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1. Deciding Optimal Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2. Speedup Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3. Limitations on Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4. Using Parasite for Parallel Program Design . . . . . . . . . . . . . . . . . . 8
2.3. Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Related Work 13
3.1. Cilk Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2. Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
II. The Parasite Scalability Profiler 17
4. Parceive 19
4.1. Acceptable Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2. Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3. Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. Algorithm 23
5.1. The Parasite Work-Span Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2. Work-Span Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3. Estimation of Mutex Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4. Graph Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
III. Results and Conclusion 31
6. Results 33
6.1. Fibonacci Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2. Vector-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3. Sorting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.1. Bubble Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii
Contents
6.3.2. Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.3. Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3.4. Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4. European Championships Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5. Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6. CPP Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7. Conclusion 59
IV. Appendices 61
8. Molecular Dynamics Code 63
8.1. Serial Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2. Fine Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3. Coarse Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliography 71
viii
Contents
Outline
Part I: Introduction and Theory
CHAPTER 1: INTRODUCTION
This chapter presents an overview of the thesis and its purpose.
CHAPTER 2: THEORY
This chapter discusses parallel programming theory relevant to local scalability profiling.
CHAPTER 3: RELATED WORK
This chapter discusses research and commercial tools similar to the Parasite tool.
Part II: The Parasite Scalability Profiler
CHAPTER 3: PARCEIVE
This chapter describes how programs are processed by Parceive before analysis by Parasite.
CHAPTER 4: ALGORITHM
This chapter describes the algorithms and data structures of the Parasite tool, which provide
local scalability profiling.
Part III: Results and Conclusion
CHAPTER 5: RESULTS
This chapter describes tests of Parasite on a diverse selection of programs.
CHAPTER 6: CONCLUSION
This chapter summarizes the current capabilities of the Parasite tool and discusses future ex-
tensions.
ix
Part I.
Introduction and Background
1
1. Introduction
Over the last half-century rapid IT advances “have depended critically on the rapid growth of
single-processor performance,” and much of this growth depended on increasing the number
and speed of transistors on a processor chip by decreasing their size [1]. However, since the
early 21st century, improvements in the speed of single processors have been very slow, as
limits in efficiencies of single-processor architectures have been reached. The size of transistors
continues to be reduced at the same rate, and the hardware industry builds chips that contain
several to hundreds of processors.
The so-far exponential increase in the computing potential of hardware is not equal to the
actual performance of this hardware that the user sees. This is because performance depends
not only on the capabilities of the hardware, but also the utilization of these capabilities. To
fully utilize these chips, they must be programmed using parallel programming models, which
present many software challenges. Therefore, research into methods of parallelization is essen-
tial for true improvement of hardware performance, as opposed to just improvements of the
hardware’s potential performance. Additionally, existing legacy code must be parallelized to
fully use the scalabilty potential of multicore processors. However, parallelization is time-
consuming and error-prone, so most legacy software “still operates sequentially and single-
threaded” [2]. Several challenges of parallel programming explain the gap in development
between hardware and software.
One challenge is successful design of a parallel program that operates concurrently. The
program’s computation may be split into components that run on different threads. Without
proper synchronization, operations that access the same memory locations can easily lead to
nondeterministic behavior. Hence, parallelization of serial programs requires identifiying de-
pendencies and refactoring to use multiple threads in a way in which these dependencies do
not create race conditions. Another challenge is load balancing: evenly dividing the work of
the threads so that full scalability potential is realized. This requires some understanding of the
amount of work associated with each task, as well as separating the program into tasks that are
small enough for even balancing to be possible.
The Chair of Computer Architecture at TUM has developed an interactive parallelization
tool, Parceive, which helps programmers overcome these design challenges for shared memory
systems [2]. Figure 1.1 illustrates the high-level components of Parceive.
Runtime Analysis
Binary instrumentation
Event inspection
Input
Binary application
Debug symbols
Static Analysis
Data-flow
Control-flow
Trace Data
Visualization
Framework
Views
Scalability Profiling
Parasite
Figure 1.1.: Steps of the Parceive tool.
The Parceive tool takes an executable as input, and using Intel’s Pin tool, dynamically instru-
ments predefined instructions, including function calls and returns, memory accesses, memory
allocation and release, and Pthread API calls. This instrumentation inserts callbacks that are
used to write trace data into a database at runtime. Then, the Parceive interpreter reads the
3
1. Introduction
trace stored in the database sequentially in chronological order. The interpreter API allows
the user to acquire information from events of interest generated from reading the database.
These events include function calls and returns, thread creation and ends, thread joins with
their parent, and mutex locks and unlocks.
In this thesis, a scalabiliy profiling tool called Parasite is described. This tool analyzes the
events generated by the Parceive interpreter to calculate the parallelism of call sites in Pthread
programs. The parallelism, the ratio of an application’s total work to its critical path, is the
upper bound on speedup possible on any number of processors.
Parasite’s parallelism calculations are useful in two ways. First, they allow the programmer
to quickly identify areas of high and low parallelism. This allows the programmer to focus par-
allelization effort where this effort can result in speedup, and to avoid spending unnecessary
parallelization effort on functions with inherently low parallelism. Low call site parallelism
values might also indicate the need to redesign the program to increase parallelism. Second,
the parallelism calculations allow the programmer to quickly see if the measured speedup of
their program is far from the upper bound on speedup shown by the parallelism. A large
gap between the parallelism and the measured speedup indicates design problems, synchro-
nization problems or operating system problems such as scheduling overhead and memory
bandwidth bottlenecks.
This thesis is structured as follows. Chapter 2 will describe parallel programming theory
relevant to the Parasite tool. Chapter 3 will describe other scalability profiling tools that have
been developed. Chapter 4 will describe how Parceive processes input programs before their
analysis by the Parasite tool. Chapter 5 will describe the algorithms Parasite uses to calculate
the parallelism for function call sites, estimate lock effects on parallelism, and verify the cal-
culations using directed acyclic graphs. Chapter 6 will describe tests of the Parasite tool on a
diverse selection of C and C++ programs parallelized using Pthreads. Finally, chapter 7 will
discuss the impact of the Parasite tool and possible future extensions.
4
2. Background
In this chapter theory relevant to the Parasite tool will be discussed. Section 2.1 describes
shared memory parallel programming, the Pthread API, types of parallelism, and the directed
acyclic graph model of multithreading. Section 2.2 discusses the scalability and performance
of shared memory parallel programs, and how the Parasite tool can be used to improve the
performance and scalability. Section 2.3 discusses synchronization and how it relates to the
Parasite tool.
2.1. Shared Memory Parallel Programming
One way to classify parallel programming models is the way that they access memory. In
shared memory parallel programs, multiple processors are allowed to share the same location
in memory, without any restrictions [3]. In distributed memory parallel programs, processors
do not share memory, and messages are used instead to transfer data between processors. Para-
site analyzes Pthread programs, which use the shared-memory model of parallel programming
performance. In this section the Pthread API, types of parallelism, and a way to model shared-
memory programs using graphs are discussed.
Pthreads. Pthreads, short for POSIX threads, is a programming API that can be implemented
using C, C++, or FORTRAN. Using Pthreads does not modify the language - instead Pthread
functions are inserted into the code to dynamically create and destroy parallelism and syn-
chronization [1]. A pthread create(...) call takes a thread ID, a function pointer, and
an optional pointer as arguments. The call creates a new thread with the thread ID that be-
gins running the function whose pointer it is passed. This function can also use the arguments
passed to it through the pointer in the pthread create(...) call. A pthread join(...)
call, always in a parent thread, takes a child thread ID argument and an optional pointer to a
return argument. The pthread join(...) statement creates an implicit barrier: execution of
the parent thread will not continue until the child thread has completed its execution. The only
Pthread synchronization calls that Parceive can analyze are pthread mutex lock(...) and
pthread mutex unlock(...). Mutexes are described in section 2.3. A serious limitation
of Pthreads is its lack of locality control. Locality control is the ability for the programmer to
explicitly direct the location of memory in the operating system. Other limitations include the
overhead of the thread creation and deletion, and the limited control over thread scheduling in
the operating system.
Types of Parallelism. There are many ways to classify parallelism. One way is to split
parallelism into the two categories data parallelism and functional decomposition. Data paral-
lelism is parallelism that increases with the amount of data, or the problem size [3]. Programs
analyzed in this thesis that have data parallelism include vector-vector multiplication, whose
available parallelism increases with the size of vectors being multiplied. Another example is
CPPCheck, analyzed in section 6.6, which is a static analysis tool for correctness and style. As
the work of this program grows with the number of files, this program shows data parallelism,
even though the operations the program executes for each file may differ. Functional decom-
position, in contrast, splits a program into tasks that perform different functions. At maximum,
programs with functional decomposition can scale by the number of tasks, but this requires the
tasks to have equal work - perfect load balancing [3].
Parallelism can also be split into regular and irregular parallelism. Programs with regular
5
2. Background
parallelism can be split into tasks that have predictable dependencies. Programs with irregular
parallelism can only be split into tasks with unpredictable dependencies. Usually, programs
with regular parallelism can be modeled by only one single directed acyclic graph, while
programs with irregular parallelism could be modeled by several different directed acyclic
graphs [4].
The Directed Acyclic Graph Model of Multithreading. To examine the structure of shared-
memory parallel programs, it is useful to abstract the programs as directed acyclic graphs
(DAGs). A directed acyclic graph has an ordering of all its vertices, called a topological or-
dering, in which for all directed edges from u to v, u comes before v in the ordering. This
requires the DAG to be acyclic; it is not possible to follow directed edges from any vertex of
the DAG so that the same vertex is reached again. A shared memory parallel program can be
represented as a DAG in which the vertices are strands, “sequences of serially executed instruc-
tions containing no parallel control,” and where graph edges indicate parallel control, such as
thread creation or thread joining [5].
Pthread applications can be modeled with the DAG model of multithreading. In this model,
strands are vertices, and the ordering of strands is shown by the edges between the strands. A
pthread create(...) statement creates two edges. The first edge, the continuation edge,
leads to the next strand in the same parent thread. The second edge, the child edge, goes to the
first strand in the spawned thread. A pthread join(...) statement creates an edge from
the last strand in the spawned thread to the strand following it in the parent thread. Figure 2.1
shows a DAG of a minimal Pthread program where one child thread is created by its parent
thread and then rejoins its parent thread.
Figure 2.1.: A DAG representing a simple Pthread program.
2.2. Parallel Program Performance
In this section the scalability and performance of parallel programs will be discussed, as well as
ways that the Parasite profiling tool can be used to improve performance and scalability. The
definitions introduced in this section come from [5] and [6].
6
2.2. Parallel Program Performance
2.2.1. Deciding Optimal Scalability
For measuring performance of sequential applications, developers are interested in the execu-
tion time of the application, and the proportion of this execution time spent in different func-
tions. For multithreaded applications, developers are also interested in how this execution time
depends on the number of processing cores [5]. This is an important question, as developers
must decide how many cores to use. Additional cores should only be used when their marginal
benefit is greater than their marginal cost. The benefit comes from speedup of the application.
The cost comes from two factors. The first factor is complexity of parallelization. Developers
must ensure that parallel code does not suffer from indeterminism, try to split computational
load as evenly as possible, and decide what size the load on each concurrently running pro-
cessor should be. This size is called the granularity of the parallelization. The second factor is
the added power consumption cost of additional processors or hardware needed for parallel
computation, but this is usually not a concern compared to the cost of developing parallel code.
One way of determining scalability is to measure it directly. This requires parallel programs
that can easily be changed to use a greater or fewer number of threads by changing an input
parameter. Unfortunately, these programs are limited to those which have independent tasks
within for loops, where an equal fraction of the independent iterations can easily be assigned
to each processor. For these programs, runtimes can be measured by using different number
of threads, and seeing the corresponding speedup. However, this process does not show the
scalability for separate function call sites. In this thesis, a call site is used to refer to each line
of a program where a function is called. For programs that contain more complex task-level
parallelism, or call sites with varying parallelism, it is not so easy to decide on the optimal
number of threads, as threads may have uneven workloads. For these programs, it is useful
to have Parasite because it shows individual call sites with high potential for parallelization,
without having to measure the scalability directly by profiling program runs with different
numbers of threads.
2.2.2. Speedup Bounds
Upper Bounds. Work and span are two measurements of the computation in a parallel pro-
gram. The work is the time it would take to execute all the strands in the computation sequen-
tially. This is the same as the time it takes to execute the computation on one processor, so it is
denoted as T1. The span is the time it takes to execute the critical path of the computation. This
is the same as the time it takes to execute the computation on an infinite number of processors,
so it is denoted as T∞.
“P processors can execute at most P instructions in unit time” [5], which creates the first
speedup constraint, the work law, where Tp is the parallel execution time:
Tp ≥ work/P (2.1)
The maximum speedup from parallelization increases linearly with the number of procesors
at first, because it is determined by the work law. However, as the number of processors in-
creases, they eventually cannot affect the speedup, because at least one of the processors must
execute all instructions on the critical path. This upper bound for the speedup possible on any
number of processors is called the parallelism. Parallelism “is the ratio of a computation’s work
to its span” [6]. This is stated in equation 2.2, where Sp is the speedup on P processors:
Sp = Tp/T1 ≤ T1/T∞ (2.2)
A Lower Bound. The work and span can also be used to calculate a lower bound on speedup
for an ideal machine. An ideal machine is one in which memory bandwidth does not limit
7
2. Background
performance, the scheduler is greedy, and there is no speculative work [3]. Speculative work
is when the machine performs work that may not be needed, before it would be needed, in
case this is faster than performing the work after it is needed. This lower bound on speedup is
called Brent’s Lemma [3].
Tp ≤ (T1 − T∞)/P + T∞ (2.3)
This formula is explained by the fact that the program always take at least T∞ time, but the
p processors can split up the remaining work, T1 − T∞, evenly. Therefore the sum in Equation
2.3 describes a lower bound on speedup for an ideal machine. An ideal machine does not have
the limitations on speedup described in the next section, so if this lower bound on speedup is
not met, it indicates that one of these limitations is acting on the program.
2.2.3. Limitations on Speedup
The goal of Parasite is to help the programmer identify why programs that are parallelized
do not reach their theoretical upper bound on speedup. There are six types of limitations on
speedup in a parallel program described in [3] and [6]:
• Insufficient parallelism: The program contains serial sections that prevent speedup when
using more processors.
• Contention: A processor is slowed down by competing accesses to synchronization prim-
itives, such as mutexes, or by the true or false sharing of cache lines.
• Insufficient memory bandwidth: The processors access memory at a rate higher than the
bandwidth of the machine's memory network can sustain.
• Strangled scaling occurs when synchronization primitives serialize execution and limit
scalability. This problem is often coupled with attempts to solve deadlocks or race condi-
tions, as synchronization primitives implemented to deal with these can lead to strangled
scaling.
• Load imbalance is when some worker threads have significantly more work than others.
This increases the span unnecessarily, as the threads with less work must wait idly while
the other threads complete. This can be dealt with by overdecomposition: splitting tasks
into many more concurrent portions than there are available threads. It is easier to spread
many small blocks of serial work evenly over threads, than a few large blocks.
• Overhead occurs from the cost of creating threads and destroying threads. This problem
is often coupled with load imbalance, as overdecomposition leads to a greater overhead.
Therefore, an appropriate granularity, size of concurrent workloads, should be chosen to
limit both the overhead and load imbalance.
2.2.4. Using Parasite for Parallel Program Design
This section describes how the programmer can use Parasite to diagnose limitations in their
program design, and in some cases, guide possible improvements to the program’s design.
Figure 2.2 provides a visualization of this process.
8
2.2. Parallel Program Performance
Figure 2.2.: Using Parasite to guide parallel program design.
First, the programmer immediately sees, from Parasite’s call site profiles, call sites in their
program where there is insufficient parallelism. A call site is a specific line in the program
where a function is called, and measurements for this call site include all child function calls
of the function. Parasite can be used to compare the number of processors employed for each
call site to the parallelism of each call site. This helps identify call sites where the parallelism
does not greatly outnumber of processors in use for the call site, and so the speedup may not
be linear [6]. Equation 2.4, derived from Equation 2.3 shows this mathematically:
Sp = T1/Tp =≈ P if T1/T∞ P (2.4)
Parallel slack is the ratio of the parallelism to the number of processors. With enough par-
allel slack, the program shows linear speedup. Scheduling overhead occurs when there is not
enough parallel slack for each processor to be given a task when it is free. This requires some
processors to wait for available work, potentially increasing the span of the program. The
amount of parallel slack needed depends on the operating system scheduler. Intel Cilk Plus
and Intel TBB task schedulers work well with high amounts of parallel slack, because they
only use as much parallelism as the hardware is capable to handle [3]. Pthreads, in contrast,
require the threads to run concurrently, so high parallel slack can decrease the speedup pos-
sible if the operating system has fewer threads than those created in Pthreads. In this case, to
simulate concurrency, the operating system must “time-slice” between the concurrent threads,
adding overhead for context switching and changing the items in the cache [3].
Second, the programmer will be able to use Parasite to investigate the impact of mutex con-
tention on parallelism, using an interactive visualization that allows easy selection of shared
9
2. Background
memory locations to lock using mutexes. The tool will then automatically calculate a new up-
per bound on speedup with locks on these shared memory locations, without the programmer
changing the source code.
Finally, the parallelism of a call site can be compared to the speedup measured using a dif-
ferent profiling tool. If there is a gap in the speedup, and use of the Parasite tool has ruled
out insufficient parallelism, scheduling overhead, and contention as possible causes, the pro-
grammer must consider alternative problems with their program design or operating system.
Insufficient memory bandwidth, synchronization primitives other than locks such as barriers,
and speculative work are remaining possibilities preventing the parallelization from achieving
its potential speedup.
2.3. Synchronization
Synchronization is coordination of events; synchronization constraints are specific orders of
events required in a concurrent program. The most common types of synchronization con-
traints are serialization, where one event must happen before another, and mutual exclusion,
where one event must not happen at the same time as another [7]. When the two events in ques-
tion are on the same thread, these constraints are easy to satisfy. The serialization constraint is
met by placing events in the order intended. The mutual exclusion constraint is automatically
met, because only one event can happen at the same time on the same thread. When two events
that need to have a specific order or need to be mutually exclusive occur on different threads,
synchronization constraints are harder to meet. From the programmer’s perspective, the or-
der of events on different threads is non-deterministic, as it depends on the operating system
scheduling.
Problems. There are a number of problems associated with synchronization. Two common
examples are race conditions, which occur “when concurrent tasks perform operations on the
same memory location without proper synchronization, and one of the memory operations is
a write” [3]. These can have no negative effect in some cases, but are nondeterministic, and
therefore can fail, so are unacceptable in parallel code. Another example is a deadlock, which
“occurs when at least two tasks wait for each other and each cannot resume until the other task
proceeds” [3].
It is both an advantage and disadvantage of Parasite that it cannot detect synchronization
problems, as it operates using a sequential execution of a Pthread program trace. This sequen-
tial execution acts as if the Pthread program was operating using a single thread. The advan-
tage is that programs can be tested for parallelism before synchronization problems are dealt
with. This saves the programmer time when their only goal for using Parasite is to quickly
compare different parallelizations, to assess if a parallelization provides some minimum scal-
ability requirement, and to identify regions of high and low scalability. The disadvantage is
that Parasite’s parallelism calculations are not necessarily accurate for programs that employ
sychronization. A program that deadlocks has no parallelism, as it will never complete. The
parallelism of a program where semaphores, conditional waits, or barriers are employed can-
not be accurately measured using Parasite, as wait times due to these primitives will increase
both the work and span. Even Parasite’s mutex wait time correction, described in section 5.3,
only provides a rough estimate of the additional work and span that waiting for mutexes re-
quires.
Semaphores. A general solution to many synchronization problems is called a semaphore.
A semaphore is defined in [7] by the following three conditions:
• The semaphore can be initialized to any integer value, but after that it can only be incre-
mented or decremented by one.
10
2.3. Synchronization
• When a thread decrements the semaphore to a negative value, the thread blocks (does
not continue) and cannot continue until a different thread increments the semaphores.
• When a thread increments the semaphore and there are waiting threads, then one of the
waiting threads in unblocked.
The application of semaphores to diverse synchronization problems is described in detail
in [7]. The lock wait time estimation algorithm used in Parasite only deals with the case of
mutexes, which are semaphores initialized to values of one. Mutexes are often used to protect
variables that are shared in memory between different threads, to avoid race conditions.
11
3. Related Work
In this section five profiling tools similar to Parasite will be described. The first two, Cilkview
and Cilkprof, are designed for programs parallelized using the Cilk and Cilk++ multithreading
APIs. The third, ParaMeter, analyzes programs with irregular parallelism. The fourth, Intel
Advisor, is a commercial tool that can be used for scalability profiling. The fifth, Kismet, profiles
potential parallelism in serial programs.
3.1. Cilk Profiling Tools
In this section two tools that profile programs using Cilk and Cilk++ will be described. Cilk
and Cilk++ are programming languages designed for multithreaded computing, that extend
C or C++ code with three constructs, cilk spawn(), cilk sync(), and cilk for(), that
support writing task-parallel programs. The first two constructs are similar to
pthread create(...) and pthread join(...), respectively, in Pthreads. Cilk is dif-
ferent from Pthreads, however, in that it does not allow the developer to explicity choose if
threads are created; cilk spawn() only creates a new thread if the Cilk scheduling algo-
rithms decide this will help the performance. Therefore, Pthreads is better for shared-memory
parallel applications in which complete control over thread creation is necessary. Cilk is better
for shared-memory parallel applications that require excellent load balancing, as the backend
of Cilk decides how to balance tasks between threads, unlike Pthreads, where the user is re-
sponsible for load balancing. Normally a user’s attempt at load balancing will not be as good
as Cilk’s backend algorithms for load balancing, as the user will not devote the time to load
balancing that was required to develop the Cilk backend algorithms.
Cilkview. The Cilkview scalability analyzer is a software tool for profiling multithreaded
Cilk++ applications [5]. Like Parceive, Cilkview uses the Pin dynamic instrumentation frame-
work to instrument threading API calls. By analyzing the instrumented binary, Cilkview mea-
sures work and span during a simulation of a serial execution of parallel Cilk++ code. In this
measurement, parallel control constructs such as cilk spawn() or cilk sync() statements
are identified by “metadata embedded by the Cilk++ compiler in the binary executable” [5].
Unlike Parasite, Cilkview can analyze scheduling overhead by using the burdened DAG model
of multithreading, which extends the DAG model described in section 2.1.
In the burdened-DAG model, the work and span of some computations are weighted ac-
cording to their grain size, by including a burden on each edge that continues after a thread
end event, and each edge that continues on the parent thread after a new thread event. The
burdens estimate the cost of migrating tasks, and assume all the tasks that can be migrated are
migrated. Task migration is performed by the underlying Cilk scheduler. The main influence
of using a burdened DAG instead of a DAG is that it increases the work and span values used
in the parallelism calculation, and it decreases the parallelism. The decrease in parallelism is
much higher for programs that have fine-grained parallelism, as these programs have more
edges where burdens are added.
Cilkprof. Cilkprof is a scalability profiler developed for multithreaded Cilk computations [6].
It extends Cilkview to provide work, span and parallelism profiles for individual function call
sites as well as the overall program. It uses compiler instrumentation to create an instrumented
Cilk program, that it then runs serially, to analyze each call site: every location in the code
where a function is either called or spawned. The Cilkprof algorithm measures the work and
13
3. Related Work
span of each call site, in order to get their ratio: the parallelism. It is not described here as
it is used in Parasite and therefore described in detail in section 5.1. Conceptually, Parasite’s
algorithm is the same, with four differences in its implementation:
1. Cilkprof’s algorithm is used for Cilk or Cilk++, while the Parasite algorithm is designed
for Pthreads.
2. The Parasite algorithm includes an estimation of the effects of mutexes on parallelism.
This is useful to programmers, as they may be trying to parallelize code which requires
mutexes. Cilkprof and Cilkview do not consider synchronization.
3. The algorithm for this paper is implemented with object-oriented style in C++, unlike
Cilkprof’s algorithms, which are implemented in C. This has the advantage that the code
is more readable, and simpler, as it can use helpful data structures in the C++ standard li-
brary, such as unordered maps, in place of the C data structures programmed specifically
for Cilkprof.
4. The implementation of Parasite is more generalizable to other threading APIs than Cilk,
as it responds to thread and function events instead of Cilk function calls. These events
can be more easily mapped to threading constructs in other APIs than Cilk’s threading
constructs can.
3.2. Other Tools
ParaMeter: Profiling Irregular Parallelism. Kularni et al. developed a tool, called ParaMeter,
that “produces parallelism profiles for irregular programs” [4]. Irregular programs are orga-
nized with trees and graphs and many have amorphous data parallelism. This is a type of
parallelism where conflicting computations can be performed in any order, where each chosen
order is a DAG that may have its own parallelism. Parasite cannot easily analyze programs
with amorphous data parallelism for two reasons. First, Parasite can only analyze one of the
possible DAGs that models a program. Second, the structure of graphs representing programs
with amorphous data parallelism may depend on the scheduling decisions of the operating
system, and Parasite cannot take these scheduling decisions into account.
ParaMeter deals with these challenges by making the parallelism profile it generates imple-
mentation independent, and using greedy scheduling and incremental execution. Greedy
scheduling “means that at each step of execution, ParaMeter will try to execute as many el-
ements as possible.” Incremental execution means each step of computation is “scheduled tak-
ing work generated in the previous step in account” [4]. ParaMeter not only measures par-
allelism, like Parasite and CilkProf, but also parallelism intensity, which is the the amount of
available parallelism divided by the overall size of the worklist at a given time in the compu-
tation [4]. This metric is useful for deciding on work scheduling policies for tasks: random
policies perform better with high parallelism intensities because it is less likely the policies
create scheduling conflicts, which are situations where tasks must wait idly for other tasks to
complete due to dependencies.
The Intel Advisor: A Commercial Tool. The most similar Intel tool to Parasite is the In-
tel Advisor, which can profile serial programs with annotations that specify parallelism, C
and C++ programs parallelized using Intel Thread Building Blocks or OpenMP, C programs
parallelized using Microsoft TPL, or Fortran programs parallelized using OpenMP [8]. The
Threading Advisor workflow of the Intel Advisor provides similar features to Parasite; both
are designed to assist software developers and architects who are in the process of optimizing
parallelization. However, the Advisor tool is proprietary, which is a disadvantage compared
to Parasite, which is open-source, so Parasite’s algorithms are entirely transparent and open to
inspection by developers.
14
3.2. Other Tools
Parasite has the ability to quickly compare parallelism of different Pthread parallelizations of
the same program, without correct synchronization. Intel Advisor has a similar fast prototyp-
ing feature, that allows developers to look at different parallelizations of a program, conveyed
to the tool using annotations, to compare them before actually implementing their paralleliza-
tion [8]. The Advisor accomplishes this by keeping the code serial when comparing the par-
allelizations, so there can be no bugs related to concurrent execution in any of the potential
parallelizations.
The Intel Advisor provides scalability estimates for the entire program in its suitability anal-
ysis, shown in figure 3.1, but unlike Parasite, it only looks at the entire program for parallelism
estimates and does not provide individual scalability estimates for functions. Also unlike Para-
site, the tool contains features that analyze call sites and loops for their vectorization potential.
Like Parasite, it can be used to examine the proportion of work spent in different functions, to
help the programmer see where execution time is spent in tasks that can be parallelized [9].
Figure 3.1.: Intel Advisor suitability analysis screenshot.
Kismet: Parallel Speedup Estimates for Serial Programs. The Parasite tool, as well as the
tools described in the previous sections, all require the input program to already be parallelized
in some way. In contrast, Jeon et al. developed a tool, Kismet, that creates parallel speedup
estimates for serial programs [10]. Like Parasite, Kismet calculates an upper bound on the pro-
gram’s attainable speedup. Unlike Parasite, it takes into account operating system conditions
including “number of cores, synchronization overhead, cache effects, and expressible paral-
lelism types” [10]. The speedup algorithm uses a parallel execution time model that depends
on these operating system conditions as well as the amount of parallelism available. Kismet
determines the amount of parallelism available using summarizing hierarchical critical path
analysis, which measures the critical path and work, like the Cilkprof work-span algorithm,
but uses a different approach to take these measurements. This involves building a hierarchi-
cal region structure from source code, consisting of different regions that help separate different
levels of parallelism.
The advantage of Kismet’s approach over Parasite and the other tools described in this sec-
tion is that it does not require additional effort by the programmer. The Intel Advisor requires
annotations that show parallel control, while the other tools require a parallelization. However,
unlike the other tools, Kismet cannot be used to compare different parallelizations of the same
serial program.
15
Part II.
The Parasite Scalability Profiler
17
4. Parceive
The Parasite tool depends on Parceive, which provides information on call sites, functions, and
threads that Parasite uses for its work-span algorithm. In this chapter the details of Parceive’s
implementation will be described.
Parceive operates using the steps shown in Figure 4.1. Parceive takes an executable as input
that must meet the requirements described in section 4.1. Then, it performs static analysis of
the machine code and dynamically instruments predefined instructions such as function calls
and returns, threading API calls, or memory accesses. The instrumentation inserts callbacks
that are used to write trace data into a database at runtime. This will be described in section
4.2. Based on this data, trace analysis generates a visualization which the user can use to see
the overall structure of the program. Trace analysis also generates events that Parasite uses to
calculate scalability of function call sites. This will be described in section 4.3.
Runtime Analysis
Binary instrumentation
Event inspection
Input
Binary application
Debug symbols
Static Analysis
Data-flow
Control-flow
Trace Data
Visualization
Framework
Views
Scalability Profiling
Parasite
Figure 4.1.: Steps of the Parceive tool.
4.1. Acceptable Inputs
Currently, Parasite can only successfully analyze programs that satisfy the following condi-
tions:
• The program is written in C or C++.
• The program is parallelized or annotated using Pthread API calls.
• The Pthread API calls only include pthread create(...), pthread join(...),
pthread mutex lock(...), and pthread mutex unlock(...)
• The program’s behavior does not depend on collaborative synchronization.
The last condition means that Parasite cannot correctly analyze a program where the ex-
ecution behavior of one thread depends on the execution behavior of any other thread. This
situation can occur when mutexes are used to control the ordering of threads, because the order
of threads in which the mutex is acquired and released may differ between sequential and con-
current execution. For example, Parasite will deadlock if a mutex is acquired in a parent thread,
which then generates a child thread that needs to acquire the mutex. In a concurrent execution,
the parent thread would continue after spawning child threads, and unlock the mutex, so that
the child thread could acquire the mutex. In the sequential simulation of execution by the Par-
ceive interpreter, the parent thread will not continue until the child thread has completed, but
the child thread will not complete, because it is waiting on the parent thread’s acquired mutex.
19
4. Parceive
Even without collaborative synchronization, a successful run of Parasite that includes mu-
texes may not produce accurate estimates of parallelism, because Parasite may not correctly
calculate the addition to the work and span associated with the mutexes. The lock wait time
algorithm described in section 5.3 attempts to estimate these additions, but does not take into
account overhead associated with acquiring and releasing the mutexes. Other synchronization
primitives such as conditional waits and barriers are not handled by the Parasite algorithm, so
Pthread programs that use these primitives should not be used as inputs to Parasite.
4.2. Trace Generation
“Parceive analyzes programs by utilizing dynamic binary instrumentation at the level of ma-
chine code during runtime” [2]. It employs the Pin framework because “it is efficient and
supports high-level, easy-to-use instrumentation” [2, 11]. The pintool “injects analysis calls
to write trace data into an SQLite database” [2]. The following instrumentation is used for
data-gathering:
• Call stack: function entries and exits are tracked to maintain a shadow call stack. For each
call, the call instructions, threads, and spent execution time are captured. Additionally,
function signatures, file descriptions, and loops are extracted from debug information.
• Memory accesses: analysis calls are injected to capture information about each memory
access (e.g., memory type, memory address, access instruction). For stack variables, de-
bug information is utilized to resolve variable names.
• Memory management: to handle heap memory, memory allocation and release function
calls are instrumented. The tracked locations are used during analysis to match data
accesses using pointers.
• Threading: Parceive tracks calls of threading APIs, like Pthread, to capture thread opera-
tions and synchronization.
4.3. Trace Analysis
Some information contained in the SQLite database is context-free and can be found by sim-
ple queries to the database. This includes data dependencies between functions, which are
detected by comparing the memory accesses of each function, that are in turn found by ab-
stracting instances of function calls. Other information depends on the control and data flow.
To extract this information, the trace stored in the database is read sequentially in chronologi-
cal order. This information includes runtime of a function call and its nested functions, counts
of specific function calls, and counts of specific memory accesses. An API allows the user to
acquire information from events of interest generated from reading the database. The Parasite
tool interfaces directly with the following events to acquire information it needs for its work-
span algorithm:
1. A function calls another function.
2. A function returns to its parent function.
3. A thread creates a child thread.
4. A child thread’s execution ends.
5. A thread join: an implicit barrier where a thread must join its parent thread.
20
4.3. Trace Analysis
6. A function acquires a lock.
7. A function releases a lock.
The actions that occur in Parasite’s algorithm with each event are described in section 5.1.
Shadow locks and threads associated with the events three to seven allow the event informa-
tion to be independent from whatever programming language is used in the programs. This
allows future extensions of Parasite to analyze not only Pthread programs, but also programs
parallelized using other threading APIs such as OpenMP.
21
5. Algorithm
In this chapter the elements of the Parasite algorithm will be discussed. Section 5.1 describes
the algorithm that Parasite employs to measure the work and span of call sites. Section 5.2
describes the data structures used in this algorithm. Section 5.3 describes the algorithm that
adjusts the work and span measurements to take effects of mutexes into account. Finally, sec-
tion 5.4 describes the directed acyclic graphs that Parasite constructs to verify its algorithm.
5.1. The Parasite Work-Span Algorithm
Conceptually, the Parasite algorithm is the same as the Cilkprof algorithm in [6], but imple-
mented to respond to the Parceive interpeter’s events, described in section 4.3, instead of Cilk
constructs. This requires an explanation of how Cilk constructs translate to these events, in
terms of the algorithm (the actions of the operating system are not equivalent). In the Cilkprof
work-span algorithm, a cilk spawn() is equivalent a new thread event. A cilk sync() is
equivalent to a join event where, after the join event, the thread has no current child threads
that have not already joined their parent thread.
Figure 5.1 shows, with pseudocode, the actions of the Cilkprof algorithm as it responds to
the Cilk constructs.
Figure 5.1.: The Cilkprof Work-Span Algorithm. (w = work, p = prefix, l = longest-child, c =
continuation)
The figure uses the following variable names, which are defined in [6], but here, the defini-
tions are written instead in terms of Parceive interpreter events:
1. F is a thread; “Called G” in the figure is a function called from F; otherwise G in the figure
is a child thread of F.
2. The time u is initially set to the beginning of F. As the execution proceeds, it is set to the
time of the new thread event that created the child thread of F, which realizes the longest
span of any child encountered so far since the last join event.
23
5. Algorithm
3. The work F.w is the serial runtime of call site F - its total computation.
4. The continuation F.c stores the span of the trace from the continuation of u through the
most recently executed instruction in F.
5. The longest-child F.l stores the span of the trace from the start of F through the thread end
event of the child thread that F creates at u.
6. The prefix F.p stores the span of the trace starting from the first instruction of F and ending
with u. The path through the DAG representing the program trace that has the length F.p
is guaranteed to be on the critical path of F.
Figure 5.2 illustrates the Cilkprof algorithm as it progresses through the execution of a pro-
gram trace. If the algorithm is still unclear, the reader is encouraged to read section 3 of [6], or
view the documentation and source code of the Parasite tool at [12].
Figure 5.2.: Updates to the span variables as Parasite is executed on a program trace. Each ar-
row represents a different thread. An arrow starting from another arrow is a new
thread event; an arrow intersecting with another is a join event. Colors indicate
different span variables. Before the first join, the continuation and longest child are
compared. The longest child is longer, so the prefix is updated to be the longest
child. Before the second join, the sum of the prefix and the continuation is com-
pared to the new longest child. The longest child is longer, so the prefix is updated
to become this longest child. After the second join, the prefix is now what was the
longest child before the join. After the end of the main thread, the remaining con-
tinuation of the main thread has been added to the prefix. Now the prefix is equal
to the entire span of the program.
Complexity Analysis. The Parasite algorithm has time complexity O(Ne) , where Ne is the
number of Events that the Parceive interpreter sends the Parasite tool. The number of these
events depends entirely on the Pthreads program that the algorithm analyzes. Inputs with
large numbers of threads, function calls, or mutex locks and unlocks will take much longer for
Parasite to analyze than inputs with few function calls or threads created.
The space complexity of the algorithm is also highly input dependent. There is a work
hashtable that includes an entry for every call site in the program being profiled. In addi-
tion to this, each thread has three span hashtables that each have an entry for every call site
24
5.2. Work-Span Data Structures
that is called on the thread. If there are Nt threads, and each thread calls a fixed fraction f of all
of the Ncs call sites, then the complexity would be O(3 ∗ f ∗ Ncs ∗ Nt + Ncs) = O(Ncs ∗ Nt) .
5.2. Work-Span Data Structures
In this section, the stack and hashtable structures used by the algorithm described in section
5.1 are described.
Work and Span Hashtables. Unique call site IDs are generated for each line of a program
where a function is called. Parasite uses hash tables that map these IDs to information about
the work or span of their respective call sites. For every call site, the work hashtable contains
the number of invocations, the total work (measured in time), and the function signature. A
span hashtable contains the longest-child, continuation, or prefix span of each call site on a
thread. It also contains an estimate of the time the thread spends waiting to acquire mutexes.
Function and Thread Stacks. As the Parceive interpreter simulates the execution of a pro-
gram, Parasite updates two stacks, a thread stack and a function stack. These stack data struc-
tures support the traditional stack push and pop operations. For each function call, the function
stack contains a frame with the function signature, the call site ID, and an object that tracks the
lock time intervals. It also contains two integers: the first indicates whether the function call
is the top invocation of its call site on the function stack, and the second indicates whether the
function call is the top invocation of its call site on the current thread. These integers are needed
to avoid the double-counting of work and span in call sites that are called recursively. For each
thread, the thread stack contains a frame with the following information:
1. The unique ID of the thread.
2. A list of interval data structures that stores the times in which mutexes in the thread and
the thread’s children are acquired.
3. Prefix, longest-child, and continuation spans of the thread.
4. A counter that represents the number of child threads spawned from the thread that are
currently on the thread stack.
5. A set that contains call sites which were pushed to the call stack while this thread was
the bottom thread. This set is used to set the integer on each function frame that indicates
whether it is the top invocation of the function’s call site on that thread.
Additionally, a set is used to track all the function call sites currently on the function stack,
in order to correctly set the integer on each function frame that indicates whether it is the top
invocation of the function’s call site in the program.
5.3. Estimation of Mutex Effects
The effects of mutexes on runtime are non-deterministic, and can only be measured accurately
by running the program under test with concurrent execution. However, the goal of Parasite is
to estimate scalability using its mathematical work-span algorithm, instead of using direct mea-
surement. Therefore, a simple heuristic is used to estimate the impacts of mutex contention on
the span of call sites. This heuristic corrects the span and work of each thread if the time that a
mutex in the thread or its child threads is acquired is greater than the span of the thread without
considering mutexes. The approach is outlined in the following pseudocode, which calculates
an addition to the span and work, called mutex wait time in the source code. The correction is
only applied when the Parasite tool processes a sync - a join event after which the parent thread
25
5. Algorithm
has no current children. In the pseudocode, a mutex interval is a data structure storing the
start, span, and mutex ID (a unique ID for each mutex generated by the Parceive interpreter)
that describe a time interval in which a mutex is acquired. The child thread mutex list is
the list of mutexes that any of the child threads in the parent thread have acquired since the last
sync event. This approach does not take into account any overhead associated with acquiring
or releasing mutexes.
mutex_total_span_list = []
for mutex in child_thread_mutex_list:
for mutex_interval in mutex.mutex_interval_list:
mutex.total_span += mutex_interval.span
mutex_total_span_list.append(mutex.total_span)
maximum_mutex_span = mutex_total_span_list.max()
correction = max(0, maximum_mutex_span - longest_child_span)
longest_child_span = longest_child_span + correction
parent_thread_work += correction
5.4. Graph Validation
Parasite constructs a directed acyclic graph while it profiles a program. In order to confirm that
the dynamic algorithm described in section 5.1 produces the correct result, this graph is used
to calculate the span of the program being profiled. Figure 5.3 shows such a DAG, for a parallel
program with a master-worker pattern and four worker threads.
26
5.4. Graph Validation
Figure 5.3.: Directed acyclic graph of vector-vector multiplication program. In this figure, the
numbers on edges represent the time spent between events. TS = new thread event,
TE = thread end event, and R = return event. The numbers in the new thread and
thread event labels are the IDs of the threads. The numbers in the return event
labels are the call site IDs of the returning functions.
27
5. Algorithm
For thread start and thread end events one vertex and one edge is added to the graph. The
length of the edge is the time elapsed since the last event, and the vertex represents the event
just generated. For a join event, two edges are created: the first edge connects the most recent
event on the parent thread to the join event. The second edge connects the thread end event of
the child thread to the join event. Therefore, a thread join event always has an inward degree of
2, as it joins two threads. A new thread event always has an outward degree of 2, as it creates
a new thread starting from the new thread vertex, and the parent thread continues.
After Parasite has completed its analysis of the program, it calculates the span of the DAG
it has constructed. The graph is stored using data structures of the BOOST graph library [13],
and stored at the completion of Parasite using a DOT file. Then, this DOT file is loaded into
a Python script, which calculates the longest path of the graph. This graph does not include
estimates of mutex wait times, so the longest path of the graph should be equal to the difference
between the span of the main function, when including mutex wait times, and the mutex wait
time of the main function, both calculated by Parasite’s work-span algorithm. This check was
useful in the initial development of Parasite to confirm that the algorithm was implemented
correctly.
Longest Path Algorithm. Since a shared-memory parallel program can be represented as a
DAG, one way of finding the span of the program is to employ a longest-path algorithm on
the DAG. To check the correctness of its work-span algorithm, Parasite uses the code in listing
5.1 to calculate the longest path in the graph it constructs to represent its input program. This
algorithm is taken directly from the Python networkx library [14].
Listing 5.1: Longest path algorithm for a DAG [14].
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
ifdp pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
Complexity Analysis. It is interesting to compare the complexity of the algorithm in listing
5.1 with Parasite’s algorithm. This algorithm first uses a topological sort, which orders the
vertices of the graph so that for very edge from m to n, m comes before n in the ordering. This
is possible for any directed acyclic graph. The complexity of the topological sort is Θ(V + E),
where V is the number of vertices, and E is the number of edges [15].
After sorting, the algorithm looks at, for each node, the edges from predecessors of this node
to the node itself. Therefore, its complexity is O(V + E). In the DAG generated by Parasite, a
vertex is a thread start, thread end, or thread join event. Every thread except the main thread
has three of these events. Therefore, the number of vertices is O(Nt) , where Nt is the number
of threads spawned during the program execution. The number of edges is also O(Nt), so
the complexity of the algorithm is O(Nt). If the same algorithm was applied to each call site
in Parasite, and each call site had the same number of threads, then the complexity would be
28
5.4. Graph Validation
O(Nt
2
)
The Parasite algorithm has time complexity O(Ne), where Ne is the number of all events. The
number of events could be linearly proportional to the number of threads, or have a different
relation, hence, the Parasite algorithm could also be of a similar complexity to the longest path
algorithm or much greater. The relative complexity depends completely on the program being
profiled. The Parasite algorithm, however, provides more information than the longest path
of the entire program. It gives information about the parallelism for each call site, as well as
estimating the effect of mutexes on the parallelism.
29
Part III.
Results and Conclusion
31
6. Results
In this section Parasite is applied to a diverse set of programs. For each program, the method
of parallelization is discussed, and Parasite is used to estimate the resulting scalability. First,
a simple program that calculates the Nth Fibonacci number is used to verify the correctness
of Parasite’s algorithm. Second, for vector-vector multiplication and four parallel sorting algo-
rithms, Parasite is applied to the programs multiple times to show the dependence of the paral-
lelism on input parameters. Third, using a simulation of the European football championships,
Parasite is shown to be able to determine the scalability of an application with irregular par-
allelism. Fourth, for a molecular dynamics simulation, the parallelism of call sites in different
parallelizations of the same program are compared. Finally, CPPCheck is used to show that
Parasite can be used to quickly test a potential parallelization of a sequential program. For
some of the programs, theoretical parallelisms of certain call sites can be calculated, which is
compared to the actual parallelism that Parasite calculates for these call sites.
6.1. Fibonacci Sequence
The code below is an abbreviated parallelization of calculating the Nth Fibonacci number [16]:
#define N 20
void* fibonacci_thread(void* arg) {
size_t n = (size_t) arg, fib;
pthread_t thread_1, thread_2;
void* pvalue;
if ((n == 0) or (n == 1))
return 1;
pthread_create(&thread_1, 0, fibonacci_thread, (void*)(n - 1));
pthread_create(&thread_2, 0, fibonacci_thread, (void*)(n - 2));
pthread_join(thread_1, &pvalue)));
fib = (size_t) pvalue;
pthread_join(thread_2, &pvalue)
fib += (size_t) pvalue;
return (void*) fib;
}
size_t fibonacci(size_t n) {
return (size_t) fibonacci_thread((void*) n);
}
int main() {
fibonacci(N);
}
33
6. Results
Table 6.1 shows the parallelism of the different function calls in this program, for finding the
20th Fibonacci number.
Parallelism Percentage of Work Count
fibonacci thread 642.401 99.4487 6764
fibonacci thread 241.458 99.856 1
fibonacci 221.677 99.9273 1
main 205.314 100 1
Table 6.1.: Call site parallelism for calculating the 20th Fibonacci number.
The parallelism of calculating the Nth Fibonacci number in the method of the code used has
been calculated to be:
Parallelism(n) = Θ(
φn
n
) (6.1)
where φ is the golden ratio [17]. Figure 6.1 confirms that Parasite’s measurements for the
parallelism of the fibonacci function in this code follow the theoretical prediction. This is a
useful validation that Parasite’s algorithm is implemented correctly.
0 500 1,000 1,500 2,000 2,500 3,000
0
100
200
300
400
500
600
700
800
900
1,000
φn
n
ParallelismofFibonaccifunction.
Figure 6.1.: Dependence of parallelism of Fibonacci function on the number N it calculates.
6.2. Vector-Vector Multiplication
The first test of Parasite was the following parallel vector-vector multiplication program [18]:
/* INPUT VARIABLES */
#define NUM_THREADS 5
#define VECTOR_SIZE 1000000000
pthread_mutex_t mutex_sum = PTHREAD_MUTEX_INITIALIZER;
int *VecA, *VecB, sum = 0, dist;
/* Thread callback function */
void * doMyWork(int myId)
{
34
6.2. Vector-Vector Multiplication
int counter, mySum = 0;
/*calculating local sum by each thread */
for (counter = ((myId - 1) * dist); counter <= ((myId * dist) - 1);
counter++)
mySum += VecA[counter] * VecB[counter];
/*updating global sum using mutex lock */
pthread_mutex_lock(&mutex_sum);
sum += mySum;
pthread_mutex_unlock(&mutex_sum);
return;
}
/*Main function start */
int main(int argc, char *argv[])
{
/*variable declaration */
int ret_count;
pthread_t * threads;
pthread_attr_t pta;
double time_start, time_end, diff;
struct timeval tv;
struct timezone tz;
int counter, NumThreads, VecSize;
NumThreads = NUM_THREADS;
VecSize = VECTOR_SIZE;
/*Memory allocation for vectors */
VecA = (int *) malloc(sizeof(int) * VecSize);
VecB = (int *) malloc(sizeof(int) * VecSize);
pthread_attr_init(&pta);
threads = (pthread_t *) malloc(sizeof(pthread_t) * NumThreads);
dist = VecSize / NumThreads;
/*Vector A and Vector B intialization */
for (counter = 0; counter < VecSize; counter++) {
VecA[counter] = 2;
VecB[counter] = 3;
}
/*Thread Creation */
for (counter = 0; counter < NumThreads; counter++) {
pthread_create(&threads[counter], &pta, (void *(*) (void *)) doMyWork,
(void *) (counter + 1));
}
/*joining threads */
for (counter = 0; counter < NumThreads; counter++) {
pthread_join(threads[counter], NULL);
}
printf("n The Sum is: %d.", sum);
pthread_attr_destroy(&pta);
return;
}
35
6. Results
This is the simplest style of Pthread program, with two functions: a main function and a
worker function. The main function creates several threads which perform the worker function.
In this case, the worker function doMyWork takes sections of the vectors, multiplies these sec-
tions, and adds the result to a global sum variable, which is protected by the mutex mutex sum
to avoid race conditions.
Figure 6.2 shows the dependence of the parallelism on the worker size, using ten worker
threads. The result approaches a limit of about 9.9. If the threads had equal work, the paral-
lelism would be equal to 10. The parallelism approaches 9.9 instead because the work is not
perfectly balanced between the threads, even if they multiply vectors of identical size, as fac-
tors such as memory access time and the time to lock and unlock the mutex can vary between
the threads.
0 50 100 150 200
4
6
8
10
Vector Size * 107
ParallelismofWorkerFunction
Figure 6.2.: Dependence of parallelism of worker function on vector size for parallel vector-
vector multiplication, with a fixed number of 10 worker threads
Figure 6.3 shows the dependence of the parallelism on number of threads, using a fixed input
size of 109. As would be expected, parallelism increases linearly with the number of threads.
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
Number of Threads
ParallelismofWorkerFunction
Figure 6.3.: Dependence of parallelism of worker function on number of threads for parallel
vector-vector multiplication, using a fixed input size of 109.
36
6.3. Sorting Algorithms
6.3. Sorting Algorithms
Most computer scientists are familiar with the sorting algorithms bubble sort, quick sort, radix
sort, and merge sort. In this section Parasite will be applied to programs that implement each of
these sorting algorithms in parallel, to illustrate Parasite’s ability to show the overall and local
parallelism in a program. Specifically, with these sorting algorithms, Parceive shows quantita-
tively how the parallelism depends on input size, recursion depth, and granularity.
6.3.1. Bubble Sort
The following abbreviated code is a simple parallelization of the bubble sort algorithm, which
passes through an array of values, and swaps neighboring values if their order is not cor-
rect [19]. In the sequential version of bubble sort, the first iteration starts with index 0 of the
array and the iterations continue until index N, where all the values are sorted. In the parallel
version, individual swaps are done in parallel of all the even elements, then individual swaps
are done in parallel of all the odd elements. This process is repeated until all the elements are
sorted.
#define DIM 200
int a[DIM], swapped = 0;
pthread_t thread[DIM];
void bubble(int i) {
int tmp;
if (i != DIM-1) {
if(a[i] > a[i+1]) {
tmp = a[i];
a[i] = a[i+1];
a[i+1] = tmp;
swapped = 1;
}
}
}
int main() {
int i;
fill_a_with_random_integers();
do {
swapped = 0;
for(i = 0; i < DIM; i+=2)
pthread_create(&thread[i], NULL, &bubble, i);
for(i = 0; i < DIM; i+=2)
pthread_join(thread[i], NULL);
swapped = 0;
for(i = 1; i < DIM; i+=2)
pthread_create(&thread[i], NULL, &bubble, i);
for(i = 1; i < DIM; i+=2)
pthread_join(thread[i], NULL);
} while(swapped == 1);
}
37
6. Results
Figure 6.4 shows that this bubble sort implementation is highly parallel, and the parallelism
increases quickly with input size for small inputs. This is expected as the number of threads
spawned to perform the swaps increases quadratically with the input size, as there are O(N2)
swaps. However, after the input size reaches about 200, the parallelism hits a limit of about 2.69.
This could be because some swaps take more time than others. There are O(N2) threads created
for this parallel bubble sort, where N is the input size. Higher input sizes were not tested
because they cause the runtime of Parasite to be very high. This is because the complexity
of Parasite depends on the number of function and thread events, which for this code is also
O(N2).
0 100 200
1
1.5
2
2.5
3
Input size
ParallelismofBubblesort
Figure 6.4.: Dependence of parallelism on input size for parallel bubble sort.
Table 6.2 shows that the the two call sites of bubble have approximately equal work. Gen-
erating these tables for progessively larger input sizes shows that the parallelism and the work
percentage of the two bubble call sites approach each other as the input sizes increases, so
that they are eventually equal. This should be expected, because both call sites have the same
number of calls plus or minus one, and they sort random integers.
Parallelism Percentage of Work Count
bubble 98.5142 8.38823 9100
bubble 83.046 8.41496 9100
main 2.69487 100 1
v initiate 1 0.99462 1
Table 6.2.: Call site parallelism for parallel bubble sort on 200 integers.
6.3.2. Quicksort
In the sequential version of Quicksort, in a partition function, elements are sorted around a
randomly chosen pivot element so that all elements less than the pivot are in one array, and
elements greater than the pivot are in another array. The process is repeated recursively on
both of these arrays until each array is sorted, and then the arrays are combined together with
the pivot in order. The following code is a simple parallelization of the quicksort algorithm (for
brevity, the partition function is omitted):
38
6.3. Sorting Algorithms
#define RECURSIVE_DEPTH 16
#define INPUT_SIZE 100000
/**
* Structure containing the arguments to the parallel_quicksort function.
Used
* when starting it in a new thread, because pthread_create() can only pass
one
* (pointer) argument.
*/
struct qsort_starter
{
int *array;
int left;
int right;
int depth;
};
void parallel_quicksort(int *array, int left, int right, int depth);
/**
* Thread trampoline that extracts the arguments from a qsort_starter
structure
* and calls parallel_quicksort.
*/
void* quicksort_thread(void *init)
{
struct qsort_starter *start = init;
parallel_quicksort(start->array, start->left, start->right,
start->depth);
return NULL;
}
/**
* Parallel version of the quicksort function. Takes an extra parameter:
* depth. This indicates the number of recursive calls that should be run in
* parallel. The total number of threads will be 2ˆdepth. If this is 0, this
* function is equivalent to the serial quicksort.
*/
void parallel_quicksort(int *array, int left, int right, int depth)
{
if (right > left)
{
int pivotIndex = left + (right - left)/2;
pivotIndex = partition(array, left, right, pivotIndex);
// Either do the parallel or serial quicksort, depending on the depth
// specified.
if (depth-- > 0)
{
// Create the thread for the first recursive calla
struct qsort_starter arg = {array, left, pivotIndex-1, depth};
pthread_t thread;
int ret = pthread_create(&thread, NULL, quicksort_thread, &arg);
assert((ret == 0) && "Thread creation failed");
// Perform the second recursive call in this thread
parallel_quicksort(array, pivotIndex+1, right, depth);
// Wait for the first call to finish.
pthread_join(thread, NULL);
}
39
6. Results
else
{
quicksort(array, left, pivotIndex-1);
quicksort(array, pivotIndex+1, right);
}
}
}
int main(int argc, char **argv)
{
int depth = RECURSIVE_DEPTH;
// Size of the array to sort. Optionally specified as the second
argument
// to the program.
int size = INPUT_SIZE;
// Allocate the array and initialise it with pseudorandom numbers. The
// random number generator is always seeded with the same value, so this
// should give the same sequence of numbers.
int *values = calloc(size, sizeof(int));
assert(values && "Allocation failed");
int i = 0;
for (i=0 ; i<size ; i++)
{
values[i] = i * (size - 1);
}
// Sort the array
parallel_quicksort(values, 0, size-1, depth);
return 0;
}
Here, the recursive calls to quicksort on the arrays smaller and larger than the pivot are made
in parallel, up to a developer-specified depth [20]. After this depth a sequential quicksort is
applied to the array. The parallelism of this parallel Quicksort algorithm can be derived in the
ideal case, where the partition algorithm evenly splits every step around the pivot equally. The
work and span have the following recurrence relations, presented in [3]:
T1(N) = 1 + 2T1(N/2) (6.2)
T∞(N) = N + T∞(N/2) (6.3)
The solution of the recurrence relations, derived in [3], is:
T1(N) = Θ(NlgN) (6.4)
T∞(N) = N (6.5)
Therefore the theoretical parallelism is:
T1(N)/T∞(N) = Θ(NlgN)/Θ(NlgN) = Θ(lg(N)) (6.6)
40
6.3. Sorting Algorithms
Parallelism Percentage of Work Count
quicksort 32.8953 39.6884 32
quicksort 32.3863 73.4517 5841
partition 32.3009 12.7183 5841
quicksort 32.2813 73.6963 5841
quicksort 32.2134 39.3518 32
parallel quicksort 31.2151 78.496 31
parallel quicksort 25.8009 81.9838 31
quicksort thread 25.3564 82.0705 31
parallel quicksort 16.7058 86.7864 1
main 5.48624 100 1
partition 1.30958 0.761345 63
Table 6.3.: Call site parallelism for parallel quicksort with recursion depth 5 on 10,000 integers.
Table 6.3 shows the parallelism and span of call sites in the parallel quick sort for a recursion
depth 5, on an input size of 10,000 integers. The four sequential quicksort call sites all have
parallelism 32, as well as the partition call site within the quicksort function. These call
sites are all called after the recursion reaches depth 5. The parallel quicksort algorithm can be
viewed as a full binary tree where each node that is not a leaf is a call to parallel quicksort.
The leaf nodes are calls to the sequential quicksort function, and there are 2D leaves in a full
binary tree, where D is the recursion depth, so here, where the depth is 5, a parallelism of
32 in the leaf function is expected. Figure 6.5 confirms that the average parallelism of the four
quicksort call sites is approximately equal to 2D, where D is the recursion depth of the quicksort
program.
2 4 6 8 10
2
4
6
8
10
Recursion Depth
Log2(QuicksortParallelism)
Figure 6.5.: Dependence of parallelism of sequential quicksort calls on recursion depth.
Next, the parallelism of the top call to parallel quicksort observed using Parasite was
compared to the theoretical parallelism of parallel quicksort, in Figure 6.6. To remain as close
as possible to the ideal case, recursion depths of floor(log2(N)) were used. Interestingly, this
parallel quicksort function seems to show a linear dependence of parallelism on input
size, instead of the logarithmic dependence that theory predicts. This is likely because the
algorithm used here applies the sequential quicksort algorithm after the recursion depth is
reached.
41
6. Results
0 10 20 30 40
0
100
200
300
400
500
Input Size * 103
Parallelismofparallelquicksortfunction.
Figure 6.6.: Dependence of parallelism of parallel quicksort on input size.
6.3.3. Radix Sort
The following code is a parallelization of the radix sort algorithm [21]:
#define NTHREADS 5
#define INPUT_SIZE 1000000
/* Bits of value to sort on. */
#define BITS 29
/* Thread arguments for radix sort. */
struct rs_args {
int id; /* thread index. */
unsigned *val; /* array. */
unsigned *tmp; /* temporary array. */
int n; /* size of array. */
int *nzeros; /* array of zero counters. */
int *nones; /* array of one counters. */
int t; /* number of threads. */
};
/* Global variables and utilities. */
struct rs_args *args;
/* Individual thread part of radix sort. */
void radix_sort_thread (unsigned *val, /* Array of values. */
unsigned *tmp, /* Temp array. */
int start, int n, /* Portion of array. */
int *nzeros, int *nones, /* Counters. */
int thread_index, /* My thread index. */
int t) /* Number of theads. */
{
unsigned *src, *dest;
int bit_pos;
int index0, index1;
42
6.3. Sorting Algorithms
int i;
/* Initialize source and destination. */
src = val;
dest = tmp;
/* For each bit... */
for ( bit_pos = 0; bit_pos < BITS; bit_pos++ ) {
/* Count elements with 0 in bit_pos. */
nzeros[thread_index] = 0;
for ( i = start; i < start + n; i++ ) {
if ( ((src[i] >> bit_pos) & 1) == 0 ) {
nzeros[thread_index]++;
}
}
nones[thread_index] = n - nzeros[thread_index];
/* Get starting indices. */
index0 = 0;
index1 = 0;
for ( i = 0; i < thread_index; i++ ) {
index0 += nzeros[i];
index1 += nones[i];
}
index1 += index0;
for ( ; i < t; i++ ) {
index1 += nzeros[i];
}
/* Move values to correct position. */
for ( i = start; i < start + n; i++ ) {
if ( ((src[i] >> bit_pos) & 1) == 0 ) {
dest[index0++] = src[i];
} else {
dest[index1++] = src[i];
}
}
/* Swap arrays. */
tmp = src;
src = dest;
dest = tmp;
}
}
/* Thread main routine. */
void thread_work (int rank)
{
int start, count, n;
int index = rank;
/* Ensure all threads have reached this point, and then let continue. */
pthread_barrier_wait(&barrier);
/* Get portion of array to process. */
43
6. Results
n = args[index].n / args[index].t; /* Number of elements this thread is
in charge of */
start = args[index].id * n; /* Thread is in charge of [start, start+n]
elements */
/* Perform radix sort. */
radix_sort_thread (args[index].val, args[index].tmp, start, n,
args[index].nzeros, args[index].nones, args[index].id,
args[index].t);
}
void radix_sort (unsigned *val, int n, int t)
{
unsigned *tmp;
int *nzeros, *nones;
int r, i;
/* Thread-related variables. */
long thread;
pthread_t* thread_handles;
/* Allocate temporary array. */
tmp = (unsigned *) malloc (n * sizeof(unsigned));
/* Allocate counter arrays. */
nzeros = (int *) malloc (t * sizeof(int));
nones = (int *) malloc (t * sizeof(int));
/* Initialize thread handles and barrier. */
thread_handles = malloc (t * sizeof(pthread_t));
/* Initialize thread arguments. */
for ( i = 0; i < t; i++ ) {
args[i].id = i;
args[i].val = val;
args[i].tmp = tmp;
args[i].n = n;
args[i].nzeros = nzeros;
args[i].nones = nones;
args[i].t = t;
/* Create a thread. */
pthread_create (&thread_handles[i], NULL, thread_work, i);
}
/* Wait for threads to join and terminate. */
for ( i = 0; i < t; i++ )
pthread_join (thread_handles[i], NULL);
/* Copy array if necessary. */
if ( BITS % 2 == 1 ) {
copy_array (val, tmp, n);
}
}
void main (int argc, char *argv[])
{
int n, t;
44
6.3. Sorting Algorithms
unsigned *val;
time_t start, end;
n = INPUT_SIZE;
t = NTHREADS;
val = (unsigned *) malloc (n * sizeof(unsigned));
random_array (val, n);
args = (struct rs_args *) malloc (t * sizeof(struct rs_args));
radix_sort (val, n, t); /* The main algorithm. */
}
The sequential version of this algorithm first sorts numbers by their least significant digit,
then by their next significant digit, until entire sequences of numbers are sorted. The paral-
lelization splits the array of numbers into equal portions, and determines their position in the
overall array using prefix sums on each digit; for a detailed explanation, see [22]. Table 6.4
shows the call site parallelism profiles. Interestingly, the parallelism of these call sites does
not change significantly when the input size is varied or when the number of threads used is
changed. If the developer wished to acquire higher speedup from the program, she could try
modifiying the implementation to make it more scalable, by decreasing the amount of time
spent in non-parallelizable sections (radix sort only has about 43% of the work) or possibly
choosing a different parallelization of the radix sort algorithm.
Parallelism Percentage of Work Count
radix sort thread 1.82281 52.99 2
radix sort 1.77026 60.7733 1
thread work 1.70867 57.7119 2
main 1.36292 100 1
random array 1 35.2695 1
Table 6.4.: Call site parallelism for parallel radix sort on 106 integers with 5 threads.
6.3.4. Merge Sort
The following code is a simple parallelization of the merge sort algorithm [23]:
#define TYPE int
#define MIN_LENGTH 2
#define INPUT_SIZE 10000
typedef struct {
TYPE *array;
int left;
int right;
int tid;
} thread_data_t;
int number_of_threads;
pthread_mutex_t lock_number_of_threads;
// The function passed to a pthread_t variable.
void *merge_sort_threaded(void *arg) {
thread_data_t *data = (thread_data_t *) arg;
int l = data->left;
int r = data->right;
int t = data->tid;
if (r - l + 1 <= MIN_LENGTH) {
45
6. Results
// Length is too short, let us do a |qsort|.
qsort(data->array + l, r - l + 1, sizeof(TYPE), my_comp);
} else {
// Try to create two threads and assign them work.
int m = l + ((r - l) / 2);
// Data for thread 1
thread_data_t data_0;
data_0.left = l;
data_0.right = m;
data_0.array = data->array;
pthread_mutex_lock(&lock_number_of_threads);
data_0.tid = number_of_threads++;
pthread_mutex_unlock(&lock_number_of_threads);
// Create thread 1
pthread_t thread0;
int rc = pthread_create(&thread0,
NULL,
merge_sort_threaded,
&data_0);
// Data for thread 2
thread_data_t data_1;
data_1.left = m + 1;
data_1.right = r;
data_1.array = data->array;
pthread_mutex_lock(&lock_number_of_threads);
data_1.tid = number_of_threads++;
pthread_mutex_unlock(&lock_number_of_threads);
// Create thread 2
pthread_t thread1;
pthread_create(&thread1,
NULL,
merge_sort_threaded,
&data_1);
int created_thread_1 = 1;
// Wait for the created threads.
pthread_join(thread0, NULL);
pthread_join(thread1, NULL);
// Ok, both done, now merge.
// left - l, right - r
merge(data->array, l, r, t);
}
}
void merge_sort(TYPE *array, int start, int finish) {
thread_data_t data;
data.array = array;
data.left = start;
data.right = finish;
// Initialize the shared data.
number_of_threads = 0;
pthread_mutex_init(&lock_number_of_threads, NULL);
data.tid = 0;
46
6.3. Sorting Algorithms
// Create and initialize the thread
pthread_t thread;
pthread_create(&thread,
NULL,
merge_sort_threaded,
&data);
// Wait for thread, i.e. the full merge sort algo.
pthread_join(thread, NULL);
}
int main(int argc, char **argv) {
int n = INPUT_SIZE;
int *p = random_array(n);
merge_sort(p, 0, n - 1);
free(p);
pthread_mutex_destroy(&lock_number_of_threads);
}
The sequential version of this algorithm first divides the unsorted list of numbers into n small
sublists, which are sorted using an algorithm of choice. Then the sublists are merged (combined
into larger sorted lists) until there is a single sorted list remaining.The parallel implementation
gives the two calls to merge sort on each recursion level to two threads, which then carry out
the parallel merge sort on their own arrays until the size of the array is less than a user-set
minimum merge sort size. Then, a sequential quicksort is applied to the array.
Table 6.5 shows Parasite’s profile for a run of this parallel merge sort with 10,000 integers
and a minimum merge sort size of 10. merge only occupies 3.6% of the work that merge sort
performs, indicating that the calls to the sequential C library function quicksort (not measured
by Parasite) are much more expensive. Therefore, decreasing the input size at which this quick-
sort is performed should increase the parallelism. Further tests confirmed that the parallelism
of the top-level merge sort call in main continues to increase until the minimum merge sort
size is one. However, Parasite cannot show the effect of the true operating system overhead
of pthread create(...) and pthread join(...) that would occur in concurrent execu-
tion, and with this effect, the minimum merge sort size for peak parallelism would likely be
greater than one.
Parallelism (P) P including Mutex Correction % of Work Count
merge 199.301 199.301 3.58896 1023
merge sort threaded 99.1834 21.7108 64.2943 1
merge sort 50.4269 18.1671 65.07 1
main 16.2865 11.9586 100 1
random array 1 1 4.66903 1
Table 6.5.: Call site parallelism for parallel merge sort on 10000 integers with minimum merge
sort size of 10.
Note that the single mutex in this parallel merge sort, that protects the global count of num-
ber of threads, reduces the parallelism of the call to merge sort by about 60 percent. However,
this global thread count is only used for debugging purposes, so could be removed. Without
Parasite, the programmer would not necessarily know that this mutex was having a signifi-
cant effect on the parallelism. With Parasite, the programmer sees a clear contrast in the first
two columns of the table, and knows that removing the mutex can improve the parallelism
significantly.
47
6. Results
In [3] it is shown that parallel merge sort has the following work and span:
T1(N) = Θ(NlgN) (6.7)
T∞(N) = Θ(Nlg3
N) (6.8)
Therefore the theoretical parallelism is:
T1(N)/T∞(N) = Θ(NlgN)/Θ(Nlg3
N) = Θ(N/lg2
N) (6.9)
Figure 6.7 compares Parasite’s measured parallelism of the parallel mergesort function as a
function of input size. The plot is approximately linear, so this program shows that the imple-
mentation has the expected theoretical parallelism.
0 20 40 60 80 100 120
0
50
100
N/(log(N)*log(N))
Parallelismoftopcalltoparallelmergesortfunction
Figure 6.7.: Dependence of parallelism of parallel mergesort on N. the input vector size.
6.3.5. Summary
Parallel sorting programs act as a more demanding validation of the Parasite tool, by showing
that Parasite produces reasonable parallelism values for recursive algorithms with large depth,
such as the parallel quicksort and parallel merge sort algorithms, and that the parallelism mea-
sured agrees with the theoretical parallelism predicted in the case of mergesort. Furthermore,
the sorting tests demonstrate that Parasite can quickly help the developer find information
useful for setting parameters in their program including the input size, the number of threads
used, and the recursion depth reached before a sequential algorithm is used for the sort. The de-
veloper could gain further information useful to the design of their parallel sorting algorithms
through the Parasite tool by examining the effect of different distributions of input number sets
on the parallelism of each sorting algorithm.
6.4. European Championships Simulation
In this section EMSim, a simulation of the European Football Championships, will be ana-
lyzed [24]. This program has a more complex structure than a simple master-worker pat-
48
6.4. European Championships Simulation
tern. Without understanding football, one does not necessarily understand how many matches
could be played concurrently in this simulation, or how the simulation would scale with more
threads. Parasite gives the programmer unambiguous numbers for the parallelism of all the
call sites, that are useful for showing the scalability of individual parts of the simulation, as
well as problems with the load balancing.
Program Structure. Figure 6.8 shows Parceive’s visualization of the EMsim program.
49
6. Results
Figure6.8.:ThecalltreestructureoftheEuropeanchampionshipssimulation.Forclarityonlyspecificsubtressareexpanded.
50
6.4. European Championships Simulation
As the calling context tree view in the center of this figure shows visually, the program fol-
lows the following steps:
1. In the initDB method, read information containing statistics on previous EM matches into
a database.
2. Get the 24 teams who will be playing.
3. Simulate a full execution of the European championships, through the following steps:
a) Initialize the simulation.
b) Call, in parallel, six calls to parallel calls to playGroup, which each simu-
late six matches in the group play of the EM simulation. The matches are played
sequentially.
c) Sort the team scores based on their group results.
d) Call, in parallel, 16 calls to playMatchInPar to simulate the round of 16.
e) Call, in parallel, 8 calls to playMatchInPar to simulate the round of 8.
f) Call, in parallel, 4 calls to playMatchInPar to simulate the quarterfinals.
g) Call, in parallel, 2 calls to playMatchInPar to simulate the semifinals.
h) Simulate the final match.
Parceive’s visualization shows some useful information about the program, including that
the work is not evenly balanced between the calls to playMatchGen, the function call at the
second to bottom level of the performance view. However, its visualization only provides a
qualitative view of the parallelism in the EM simulation.
Parallelism. Parasite provides quantitative views: Table 6.6 shows all the function calls in
the simulation with parallelism greater than 1.
51
6. Results
Parallelism Percentage of Work Count
team1DominatesTeam2 15.5336 0.00218563 72
getNumMatches 6.84717 0.746724 52
getMatches 6.4472 0.963656 51
getGoalsPerGame 6.09455 0.00455375 111
getMatchesInternal 5.60061 0.113177 52
getMatchesInternal 5.47997 0.103317 52
fillPlayer 4.71968 9.90176 2621
fillPlayer 4.60666 3.80308 222
parallel calls to playGroup 4.33232 69.1592 16
playGroup 4.33212 69.1557 6
playGroupMatch 4.33129 69.1352 36
playMatchGen 4.33077 69.1262 36
playEM 3.75038 97.9785 2
getPlayersOfMatch 3.73775 96.8404 111
getGoalsPerGame 3.63284 0.00139767 111
getNumPlayersOfMatch 3.62938 20.685 111
getNumPlayersOfMatch 3.6288 20.7125 111
main 3.58549 100 1
playFinalRound 2.84291 28.7715 4
playMatchInPar 2.84202 28.7286 15
playFinalMatch 2.84182 28.7262 15
playMatchGen 2.84162 28.7192 30
getTeam 1.24702 0.000830991 18
Table 6.6.: Parallelism of call sites in EMsim.
Table 6.6 provides the insight that none of the call sites in the EMsim has a parallelism greater
than 7, other than team1DominatesTeam2, which takes less than 0.01% of the work. There-
fore, with the current design of the simulation, there is probably no benefit to calling more than
seven threads at any point in the simulation. This insight is not immediately clear from Figure
6.8, the program description, or the DAG Parasite generates for the program. Furthermore, the
Parceive visualization suggests that
parallel calls to playGroup could have a parallelism of at most 6, as it is called six times
concurrently. The parallelism of parallel calls to playGroup is instead about 4.3, indi-
cating that the calls are not balanced in terms of span - one of the calls must take significantly
longer than the other. A similar observation can be made for playGroupMatch. This function
is called six times concurrently in playGroup, but only has a parallelism of 4.3.
The function playGroupMatch is called 6 times from each of the six calls of playGroup.
This function simulates a match, and in the group phase, all matches are independent, so it
could have a parallelism of 36 if all matches involve the same work. However, Parasite only
measures a parallelism of 4.33 for this function, indicating that the simulation could be re-
designed to be more scalable in the group match phase.
6.5. Molecular Dynamics
Appendix 8.1 contains a simple serial molecular dynamics code [25]. Table 6.7 shows the dis-
tribution of work for the sequential simulation.
52
6.5. Molecular Dynamics
Parallelism Percentage of Work Count
main 1 100 1
update 1 1.08808 10
compute 1 10.0827 10
calKinetic 1 0.923045 110
compute 1 12.6647 1
main work 1 96.8591 1
distance 1 4.01697 990
initialize 1 3.35982 1
Table 6.7.: Call site parallelism and work percentage for the serial molecular dynamics code in
Appendix 8.1, using 10 atoms.
The compute function performs the largest percentage of the work, because it has nested
for loops over all atoms in the simulation. The distance function, which occurs within two
nested for loops over the atoms, contributes to about half the work of the compute function.
This suggests one way to parallelize, which is to split the calls to distance over different
threads. Examination of the compute function shows that the potential and force calculations
after distance depend on the result of the distance calls, so they should be included in the same
worker function as the distance calculation. This parallelization is included in Appendix 8.2,
and table 6.8 shows the call site parallelism:
Parallelism Percentage of Work Count
distance 1.20118 2.34429 990
compute 1.16186 30.3021 10
distance work 1.14416 12.9843 990
main 1.05319 100 1
main work 1.04937 98.6271 1
compute 1.0108 39.2091 1
update 1 0.631825 10
initialize 1 1.92633 1
calKinetic 1 0.694468 110
Table 6.8.: Call site parallelism and work percentage for the a parallelization over distance calls
for the code in 8.2, using 10 atoms, which creates a new thread for each distance
calculation.
This parallelization has a very high granularity, because is requires Θ(N2
a ), where Na is the
number of atoms, threads to be created for each step of the simulation. The parallelization
only increases the parallelism of main from 1 to 1.05, indicating that the time cost of creating
and joining threads cancels out most of the additional concurrency resulting from creating the
threads. A way to parallelize with larger granularity is to group the calculations for each atom
together, so for each time step, the number of threads created is equal to the number of atoms.
This parallelization is included in Appendix 8.3, and Table 6.9 shows the call site parallelism:
53
6. Results
Parallelism Percentage of Work Count
compute 44.3736 80.8772 10
main work 8.65348 99.7773 1
main 8.55733 100 1
compute 4.41926 11.8808 1
update 1 0.106188 10
initialize 1 0.335047 1
Table 6.9.: Call site parallelism and work percentage for the parallelization over distance calls
included in 8.3, which creates a new thread for the distance and potential calcula-
tions associated with each of the 10 atoms.
There are 10 threads spawned at each time step to perform distance calculations for the atoms
used in the simulation, and the parallelism of main is about 8.6, which indicates some overhead
in pthread create(...) and pthread join(...). There are also possibly some load
imbalances in the work assigned to each thread, which would mean the potential and kinetic
energy calculations are more costly for some atoms.
The molecular dynamics simulation illustrates an important difference in the use of Parasite
as opposed to a profiler that executes the program in parallel. Synchronization issues such as
race conditions do not have to be considered when using Parasite to gain information about a
Pthread program. Consider the two alternative molecular dynamics parallelizations in Appen-
dices 8.3 and 8.2. These likely have race conditions, because the pointer to an AtomInfo object
can be accessed concurrently by two different threads. Using a normal profiler, this race con-
dition creates indeterminism that may lead to undefined behavior, so the programmer would
have to implement a mutex for each atom to avoid the race condition. Parasite operates se-
quentially, so this race condition does not present a problem. With Parasite, the programmer
can quickly test a “rough draft” of a possible parallelization for comparison with other paral-
lelizations, without implementing synchronization. In this case the programmer may want to
compare the parallelizations presented here, or merely see if the parallelism of either of these
parallelizations merits the effort required to verify and implement a correct parallelization with
proper synchronization primitives. The effort to implement the primitives may seem trivial for
this simple molecular dynamics application, but could be significantly higher for applications
with more complex synchronization requirements.
6.6. CPP Check
CPPcheck is a static code analysis tool that checks style and correctness in C++ files [26]. It is
much larger than the other test programs in this thesis, and is frequently used by programmers
across the world. CPPcheck comes with a multi-threaded execution that does not use Pthreads,
but to test Parasite, it has been parallelized by the following code excerpt, which has been
edited to only show lines relevant to the parallelization [26]:
struct thread_arg {
CppCheck* object;
const std::string* file_name;
unsigned int return_value;
};
void* pthread_worker(void* thread_arg) {
struct thread_arg* thrd_arg = (struct thread_arg*) thread_arg;
54
6.6. CPP Check
thrd_arg->return_value =
thrd_arg->object->check(*(thrd_arg->file_name));
}
int CppCheckExecutor::check_internal(CppCheck& cppcheck, int /*argc*/,
const char* const argv[])
{
// ...... CODE OMITTED FOR BREVITY .... //
if (settings.jobs == 1) {
// ...... CODE OMITTED FOR BREVITY .... //
std::size_t processedsize = 0;
pthread_t* thread;
thread = new pthread_t[_files.size()];
struct thread_arg* arg;
arg = new struct thread_arg[_files.size()];
for (std::map<std::string, std::size_t>::const_iterator i =
_files.begin(); i != _files.end(); ++i) {
if (!_settings->library.markupFile(i->first)
|| !_settings->library.processMarkupAfterCode(i->first)) {
arg[j].file_name = &i->first;
arg[j].return_value = 0;
arg[j].object = &cppcheck;
pthread_create(&thread[j], NULL, &pthread_worker, &(arg[j]));
}
}
j = 0;
for (std::map<std::string, std::size_t>::const_iterator i =
_files.begin(); i != _files.end(); ++i) {
pthread_join(thread[j], NULL);
returnValue += arg[j].return_value;
j++;
}
// ... EXCLUDED CODE ... //
This code creates a worker thread for each file that CPP check processes, in an attempt to scale
the application through concurrent processing of files. CPPCheck’s original multi-threaded ex-
ecution is more complex due to synchronization, but operates using a similar principle - each
thread analyzes a different file. However, the parallelization above requires very little effort
to produce compared to CPPcheck’s multithreaded execution. If you were a programmer at-
tempting to parallelize CPPcheck, the above parallelization could be used with Parasite’s se-
quential trace execution to test scalability, before making the effort to produce the synchroniza-
tion necessary for a fully-working parallelization.
Table 6.10 shows the results for running Parasite on this parallelization of CPPcheck with 2
of the sample C++ files contained in CPPcheck’s Github repository:
55
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit
Knapp_Masterarbeit

More Related Content

What's hot

Aidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_Report
Aidan O Mahony
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_Thesis
Bryan Collazo Santiago
 
ImplementationOFDMFPGA
ImplementationOFDMFPGAImplementationOFDMFPGA
ImplementationOFDMFPGA
Nikita Pinto
 
SzaboGeza_disszertacio
SzaboGeza_disszertacioSzaboGeza_disszertacio
SzaboGeza_disszertacio
Géza Szabó
 
Cloud enabled business process management systems
Cloud enabled business process management systemsCloud enabled business process management systems
Cloud enabled business process management systems
Ja'far Railton
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
Felipe Diniz
 
Deployment guide series ibm tivoli composite application manager for web reso...
Deployment guide series ibm tivoli composite application manager for web reso...Deployment guide series ibm tivoli composite application manager for web reso...
Deployment guide series ibm tivoli composite application manager for web reso...
Banking at Ho Chi Minh city
 
Hub location models in public transport planning
Hub location models in public transport planningHub location models in public transport planning
Hub location models in public transport planning
sanazshn
 

What's hot (20)

Design patterns by example
Design patterns by exampleDesign patterns by example
Design patterns by example
 
Thesis: Slicing of Java Programs using the Soot Framework (2006)
Thesis:  Slicing of Java Programs using the Soot Framework (2006) Thesis:  Slicing of Java Programs using the Soot Framework (2006)
Thesis: Slicing of Java Programs using the Soot Framework (2006)
 
Master Thesis
Master ThesisMaster Thesis
Master Thesis
 
Viewcontent_jignesh
Viewcontent_jigneshViewcontent_jignesh
Viewcontent_jignesh
 
Aidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_Report
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_Thesis
 
ImplementationOFDMFPGA
ImplementationOFDMFPGAImplementationOFDMFPGA
ImplementationOFDMFPGA
 
Uml (grasp)
Uml (grasp)Uml (grasp)
Uml (grasp)
 
SzaboGeza_disszertacio
SzaboGeza_disszertacioSzaboGeza_disszertacio
SzaboGeza_disszertacio
 
Oop c++ tutorial
Oop c++ tutorialOop c++ tutorial
Oop c++ tutorial
 
Cloud enabled business process management systems
Cloud enabled business process management systemsCloud enabled business process management systems
Cloud enabled business process management systems
 
thesis
thesisthesis
thesis
 
document
documentdocument
document
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
 
thesis
thesisthesis
thesis
 
Deployment guide series ibm tivoli composite application manager for web reso...
Deployment guide series ibm tivoli composite application manager for web reso...Deployment guide series ibm tivoli composite application manager for web reso...
Deployment guide series ibm tivoli composite application manager for web reso...
 
Report-V1.5_with_comments
Report-V1.5_with_commentsReport-V1.5_with_comments
Report-V1.5_with_comments
 
Hub location models in public transport planning
Hub location models in public transport planningHub location models in public transport planning
Hub location models in public transport planning
 
Final Report
Final ReportFinal Report
Final Report
 
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case studyWorkflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
 

Similar to Knapp_Masterarbeit

UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
Gustavo Pabon
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
Gustavo Pabon
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Nóra Szepes
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
Dario Bonino
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
AimonJamali
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
Priyanka Kapoor
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
pfleidi
 

Similar to Knapp_Masterarbeit (20)

UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
Milan_thesis.pdf
Milan_thesis.pdfMilan_thesis.pdf
Milan_thesis.pdf
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
 
T401
T401T401
T401
 
My PhD Thesis
My PhD Thesis My PhD Thesis
My PhD Thesis
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
document
documentdocument
document
 
CS4099Report
CS4099ReportCS4099Report
CS4099Report
 
Nato1968
Nato1968Nato1968
Nato1968
 
E.M._Poot
E.M._PootE.M._Poot
E.M._Poot
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
 

Knapp_Masterarbeit

  • 1. Computational Science and Engineering (International Master’s Program) Technische Universit¨at M¨unchen Master’s Thesis Parasite: Local Scalability Profiling for Parallelization Author: Nathaniel Knapp 1st examiner: Prof. Dr. Michael Gerndt 2nd examiner: Prof. Dr. Michael Bader Advisor: M. Sc. Andreas Wilhelm Thesis handed in on: October 25, 2016
  • 2. I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references. October 24, 2016 Nathaniel Knapp ii
  • 3. Acknowledgments I would first like to thank Andreas Wilhelm for advising me over the past year I have worked on this project. His advice has been invaluable to the completion of this thesis. Second, I thank Prof. Bader and Prof. Gerndt for agreeing to be examiners. Third, I thank my teachers and mentors during CSE at TUM, especially Alexander P¨oppl for mentoring me on my previous project. Fourth, I thank Prof. Corey O’Hern, Prof. Rimas Vaisnys, Carl Schreck, and Wendell Smith for their mentoring at Yale, which inspired me to apply to study CSE at TUM. Fifth, I thank my classmates in CSE who have made studying at TUM a wonderful experience. iii
  • 4.
  • 5. Abstract In this master’s thesis, Parasite, a local scalability profiling tool, is presented. Parasite mea- sures the parallelism of function call sites in C and C++ applications parallelized using Pthreads. The parallelism, the ratio of a program’s work to its critical path, is an upper bound on speedup for an infinite number of processors, and therefore a useful measure of scalability. The use of Parasite is demonstrated on sorting algorithms, a molecular dynamics simulation, and other programs. These tests use Parasite to compare methods of parallelization, elicit the depen- dence of parallelism on input parameters, and find the factors in program design that limit parallelism. Future extensions of the tool are also discussed. v
  • 6.
  • 7. Contents Acknowledgements iii Abstract v Outline ix I. Introduction and Background 1 1. Introduction 3 2. Background 5 2.1. Shared Memory Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2. Parallel Program Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1. Deciding Optimal Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2. Speedup Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3. Limitations on Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4. Using Parasite for Parallel Program Design . . . . . . . . . . . . . . . . . . 8 2.3. Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3. Related Work 13 3.1. Cilk Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2. Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 II. The Parasite Scalability Profiler 17 4. Parceive 19 4.1. Acceptable Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2. Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3. Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5. Algorithm 23 5.1. The Parasite Work-Span Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2. Work-Span Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3. Estimation of Mutex Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.4. Graph Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 III. Results and Conclusion 31 6. Results 33 6.1. Fibonacci Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2. Vector-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.3. Sorting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3.1. Bubble Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 vii
  • 8. Contents 6.3.2. Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3.3. Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.3.4. Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.4. European Championships Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.5. Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.6. CPP Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7. Conclusion 59 IV. Appendices 61 8. Molecular Dynamics Code 63 8.1. Serial Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2. Fine Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.3. Coarse Grained Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Bibliography 71 viii
  • 9. Contents Outline Part I: Introduction and Theory CHAPTER 1: INTRODUCTION This chapter presents an overview of the thesis and its purpose. CHAPTER 2: THEORY This chapter discusses parallel programming theory relevant to local scalability profiling. CHAPTER 3: RELATED WORK This chapter discusses research and commercial tools similar to the Parasite tool. Part II: The Parasite Scalability Profiler CHAPTER 3: PARCEIVE This chapter describes how programs are processed by Parceive before analysis by Parasite. CHAPTER 4: ALGORITHM This chapter describes the algorithms and data structures of the Parasite tool, which provide local scalability profiling. Part III: Results and Conclusion CHAPTER 5: RESULTS This chapter describes tests of Parasite on a diverse selection of programs. CHAPTER 6: CONCLUSION This chapter summarizes the current capabilities of the Parasite tool and discusses future ex- tensions. ix
  • 10.
  • 11. Part I. Introduction and Background 1
  • 12.
  • 13. 1. Introduction Over the last half-century rapid IT advances “have depended critically on the rapid growth of single-processor performance,” and much of this growth depended on increasing the number and speed of transistors on a processor chip by decreasing their size [1]. However, since the early 21st century, improvements in the speed of single processors have been very slow, as limits in efficiencies of single-processor architectures have been reached. The size of transistors continues to be reduced at the same rate, and the hardware industry builds chips that contain several to hundreds of processors. The so-far exponential increase in the computing potential of hardware is not equal to the actual performance of this hardware that the user sees. This is because performance depends not only on the capabilities of the hardware, but also the utilization of these capabilities. To fully utilize these chips, they must be programmed using parallel programming models, which present many software challenges. Therefore, research into methods of parallelization is essen- tial for true improvement of hardware performance, as opposed to just improvements of the hardware’s potential performance. Additionally, existing legacy code must be parallelized to fully use the scalabilty potential of multicore processors. However, parallelization is time- consuming and error-prone, so most legacy software “still operates sequentially and single- threaded” [2]. Several challenges of parallel programming explain the gap in development between hardware and software. One challenge is successful design of a parallel program that operates concurrently. The program’s computation may be split into components that run on different threads. Without proper synchronization, operations that access the same memory locations can easily lead to nondeterministic behavior. Hence, parallelization of serial programs requires identifiying de- pendencies and refactoring to use multiple threads in a way in which these dependencies do not create race conditions. Another challenge is load balancing: evenly dividing the work of the threads so that full scalability potential is realized. This requires some understanding of the amount of work associated with each task, as well as separating the program into tasks that are small enough for even balancing to be possible. The Chair of Computer Architecture at TUM has developed an interactive parallelization tool, Parceive, which helps programmers overcome these design challenges for shared memory systems [2]. Figure 1.1 illustrates the high-level components of Parceive. Runtime Analysis Binary instrumentation Event inspection Input Binary application Debug symbols Static Analysis Data-flow Control-flow Trace Data Visualization Framework Views Scalability Profiling Parasite Figure 1.1.: Steps of the Parceive tool. The Parceive tool takes an executable as input, and using Intel’s Pin tool, dynamically instru- ments predefined instructions, including function calls and returns, memory accesses, memory allocation and release, and Pthread API calls. This instrumentation inserts callbacks that are used to write trace data into a database at runtime. Then, the Parceive interpreter reads the 3
  • 14. 1. Introduction trace stored in the database sequentially in chronological order. The interpreter API allows the user to acquire information from events of interest generated from reading the database. These events include function calls and returns, thread creation and ends, thread joins with their parent, and mutex locks and unlocks. In this thesis, a scalabiliy profiling tool called Parasite is described. This tool analyzes the events generated by the Parceive interpreter to calculate the parallelism of call sites in Pthread programs. The parallelism, the ratio of an application’s total work to its critical path, is the upper bound on speedup possible on any number of processors. Parasite’s parallelism calculations are useful in two ways. First, they allow the programmer to quickly identify areas of high and low parallelism. This allows the programmer to focus par- allelization effort where this effort can result in speedup, and to avoid spending unnecessary parallelization effort on functions with inherently low parallelism. Low call site parallelism values might also indicate the need to redesign the program to increase parallelism. Second, the parallelism calculations allow the programmer to quickly see if the measured speedup of their program is far from the upper bound on speedup shown by the parallelism. A large gap between the parallelism and the measured speedup indicates design problems, synchro- nization problems or operating system problems such as scheduling overhead and memory bandwidth bottlenecks. This thesis is structured as follows. Chapter 2 will describe parallel programming theory relevant to the Parasite tool. Chapter 3 will describe other scalability profiling tools that have been developed. Chapter 4 will describe how Parceive processes input programs before their analysis by the Parasite tool. Chapter 5 will describe the algorithms Parasite uses to calculate the parallelism for function call sites, estimate lock effects on parallelism, and verify the cal- culations using directed acyclic graphs. Chapter 6 will describe tests of the Parasite tool on a diverse selection of C and C++ programs parallelized using Pthreads. Finally, chapter 7 will discuss the impact of the Parasite tool and possible future extensions. 4
  • 15. 2. Background In this chapter theory relevant to the Parasite tool will be discussed. Section 2.1 describes shared memory parallel programming, the Pthread API, types of parallelism, and the directed acyclic graph model of multithreading. Section 2.2 discusses the scalability and performance of shared memory parallel programs, and how the Parasite tool can be used to improve the performance and scalability. Section 2.3 discusses synchronization and how it relates to the Parasite tool. 2.1. Shared Memory Parallel Programming One way to classify parallel programming models is the way that they access memory. In shared memory parallel programs, multiple processors are allowed to share the same location in memory, without any restrictions [3]. In distributed memory parallel programs, processors do not share memory, and messages are used instead to transfer data between processors. Para- site analyzes Pthread programs, which use the shared-memory model of parallel programming performance. In this section the Pthread API, types of parallelism, and a way to model shared- memory programs using graphs are discussed. Pthreads. Pthreads, short for POSIX threads, is a programming API that can be implemented using C, C++, or FORTRAN. Using Pthreads does not modify the language - instead Pthread functions are inserted into the code to dynamically create and destroy parallelism and syn- chronization [1]. A pthread create(...) call takes a thread ID, a function pointer, and an optional pointer as arguments. The call creates a new thread with the thread ID that be- gins running the function whose pointer it is passed. This function can also use the arguments passed to it through the pointer in the pthread create(...) call. A pthread join(...) call, always in a parent thread, takes a child thread ID argument and an optional pointer to a return argument. The pthread join(...) statement creates an implicit barrier: execution of the parent thread will not continue until the child thread has completed its execution. The only Pthread synchronization calls that Parceive can analyze are pthread mutex lock(...) and pthread mutex unlock(...). Mutexes are described in section 2.3. A serious limitation of Pthreads is its lack of locality control. Locality control is the ability for the programmer to explicitly direct the location of memory in the operating system. Other limitations include the overhead of the thread creation and deletion, and the limited control over thread scheduling in the operating system. Types of Parallelism. There are many ways to classify parallelism. One way is to split parallelism into the two categories data parallelism and functional decomposition. Data paral- lelism is parallelism that increases with the amount of data, or the problem size [3]. Programs analyzed in this thesis that have data parallelism include vector-vector multiplication, whose available parallelism increases with the size of vectors being multiplied. Another example is CPPCheck, analyzed in section 6.6, which is a static analysis tool for correctness and style. As the work of this program grows with the number of files, this program shows data parallelism, even though the operations the program executes for each file may differ. Functional decom- position, in contrast, splits a program into tasks that perform different functions. At maximum, programs with functional decomposition can scale by the number of tasks, but this requires the tasks to have equal work - perfect load balancing [3]. Parallelism can also be split into regular and irregular parallelism. Programs with regular 5
  • 16. 2. Background parallelism can be split into tasks that have predictable dependencies. Programs with irregular parallelism can only be split into tasks with unpredictable dependencies. Usually, programs with regular parallelism can be modeled by only one single directed acyclic graph, while programs with irregular parallelism could be modeled by several different directed acyclic graphs [4]. The Directed Acyclic Graph Model of Multithreading. To examine the structure of shared- memory parallel programs, it is useful to abstract the programs as directed acyclic graphs (DAGs). A directed acyclic graph has an ordering of all its vertices, called a topological or- dering, in which for all directed edges from u to v, u comes before v in the ordering. This requires the DAG to be acyclic; it is not possible to follow directed edges from any vertex of the DAG so that the same vertex is reached again. A shared memory parallel program can be represented as a DAG in which the vertices are strands, “sequences of serially executed instruc- tions containing no parallel control,” and where graph edges indicate parallel control, such as thread creation or thread joining [5]. Pthread applications can be modeled with the DAG model of multithreading. In this model, strands are vertices, and the ordering of strands is shown by the edges between the strands. A pthread create(...) statement creates two edges. The first edge, the continuation edge, leads to the next strand in the same parent thread. The second edge, the child edge, goes to the first strand in the spawned thread. A pthread join(...) statement creates an edge from the last strand in the spawned thread to the strand following it in the parent thread. Figure 2.1 shows a DAG of a minimal Pthread program where one child thread is created by its parent thread and then rejoins its parent thread. Figure 2.1.: A DAG representing a simple Pthread program. 2.2. Parallel Program Performance In this section the scalability and performance of parallel programs will be discussed, as well as ways that the Parasite profiling tool can be used to improve performance and scalability. The definitions introduced in this section come from [5] and [6]. 6
  • 17. 2.2. Parallel Program Performance 2.2.1. Deciding Optimal Scalability For measuring performance of sequential applications, developers are interested in the execu- tion time of the application, and the proportion of this execution time spent in different func- tions. For multithreaded applications, developers are also interested in how this execution time depends on the number of processing cores [5]. This is an important question, as developers must decide how many cores to use. Additional cores should only be used when their marginal benefit is greater than their marginal cost. The benefit comes from speedup of the application. The cost comes from two factors. The first factor is complexity of parallelization. Developers must ensure that parallel code does not suffer from indeterminism, try to split computational load as evenly as possible, and decide what size the load on each concurrently running pro- cessor should be. This size is called the granularity of the parallelization. The second factor is the added power consumption cost of additional processors or hardware needed for parallel computation, but this is usually not a concern compared to the cost of developing parallel code. One way of determining scalability is to measure it directly. This requires parallel programs that can easily be changed to use a greater or fewer number of threads by changing an input parameter. Unfortunately, these programs are limited to those which have independent tasks within for loops, where an equal fraction of the independent iterations can easily be assigned to each processor. For these programs, runtimes can be measured by using different number of threads, and seeing the corresponding speedup. However, this process does not show the scalability for separate function call sites. In this thesis, a call site is used to refer to each line of a program where a function is called. For programs that contain more complex task-level parallelism, or call sites with varying parallelism, it is not so easy to decide on the optimal number of threads, as threads may have uneven workloads. For these programs, it is useful to have Parasite because it shows individual call sites with high potential for parallelization, without having to measure the scalability directly by profiling program runs with different numbers of threads. 2.2.2. Speedup Bounds Upper Bounds. Work and span are two measurements of the computation in a parallel pro- gram. The work is the time it would take to execute all the strands in the computation sequen- tially. This is the same as the time it takes to execute the computation on one processor, so it is denoted as T1. The span is the time it takes to execute the critical path of the computation. This is the same as the time it takes to execute the computation on an infinite number of processors, so it is denoted as T∞. “P processors can execute at most P instructions in unit time” [5], which creates the first speedup constraint, the work law, where Tp is the parallel execution time: Tp ≥ work/P (2.1) The maximum speedup from parallelization increases linearly with the number of procesors at first, because it is determined by the work law. However, as the number of processors in- creases, they eventually cannot affect the speedup, because at least one of the processors must execute all instructions on the critical path. This upper bound for the speedup possible on any number of processors is called the parallelism. Parallelism “is the ratio of a computation’s work to its span” [6]. This is stated in equation 2.2, where Sp is the speedup on P processors: Sp = Tp/T1 ≤ T1/T∞ (2.2) A Lower Bound. The work and span can also be used to calculate a lower bound on speedup for an ideal machine. An ideal machine is one in which memory bandwidth does not limit 7
  • 18. 2. Background performance, the scheduler is greedy, and there is no speculative work [3]. Speculative work is when the machine performs work that may not be needed, before it would be needed, in case this is faster than performing the work after it is needed. This lower bound on speedup is called Brent’s Lemma [3]. Tp ≤ (T1 − T∞)/P + T∞ (2.3) This formula is explained by the fact that the program always take at least T∞ time, but the p processors can split up the remaining work, T1 − T∞, evenly. Therefore the sum in Equation 2.3 describes a lower bound on speedup for an ideal machine. An ideal machine does not have the limitations on speedup described in the next section, so if this lower bound on speedup is not met, it indicates that one of these limitations is acting on the program. 2.2.3. Limitations on Speedup The goal of Parasite is to help the programmer identify why programs that are parallelized do not reach their theoretical upper bound on speedup. There are six types of limitations on speedup in a parallel program described in [3] and [6]: • Insufficient parallelism: The program contains serial sections that prevent speedup when using more processors. • Contention: A processor is slowed down by competing accesses to synchronization prim- itives, such as mutexes, or by the true or false sharing of cache lines. • Insufficient memory bandwidth: The processors access memory at a rate higher than the bandwidth of the machine's memory network can sustain. • Strangled scaling occurs when synchronization primitives serialize execution and limit scalability. This problem is often coupled with attempts to solve deadlocks or race condi- tions, as synchronization primitives implemented to deal with these can lead to strangled scaling. • Load imbalance is when some worker threads have significantly more work than others. This increases the span unnecessarily, as the threads with less work must wait idly while the other threads complete. This can be dealt with by overdecomposition: splitting tasks into many more concurrent portions than there are available threads. It is easier to spread many small blocks of serial work evenly over threads, than a few large blocks. • Overhead occurs from the cost of creating threads and destroying threads. This problem is often coupled with load imbalance, as overdecomposition leads to a greater overhead. Therefore, an appropriate granularity, size of concurrent workloads, should be chosen to limit both the overhead and load imbalance. 2.2.4. Using Parasite for Parallel Program Design This section describes how the programmer can use Parasite to diagnose limitations in their program design, and in some cases, guide possible improvements to the program’s design. Figure 2.2 provides a visualization of this process. 8
  • 19. 2.2. Parallel Program Performance Figure 2.2.: Using Parasite to guide parallel program design. First, the programmer immediately sees, from Parasite’s call site profiles, call sites in their program where there is insufficient parallelism. A call site is a specific line in the program where a function is called, and measurements for this call site include all child function calls of the function. Parasite can be used to compare the number of processors employed for each call site to the parallelism of each call site. This helps identify call sites where the parallelism does not greatly outnumber of processors in use for the call site, and so the speedup may not be linear [6]. Equation 2.4, derived from Equation 2.3 shows this mathematically: Sp = T1/Tp =≈ P if T1/T∞ P (2.4) Parallel slack is the ratio of the parallelism to the number of processors. With enough par- allel slack, the program shows linear speedup. Scheduling overhead occurs when there is not enough parallel slack for each processor to be given a task when it is free. This requires some processors to wait for available work, potentially increasing the span of the program. The amount of parallel slack needed depends on the operating system scheduler. Intel Cilk Plus and Intel TBB task schedulers work well with high amounts of parallel slack, because they only use as much parallelism as the hardware is capable to handle [3]. Pthreads, in contrast, require the threads to run concurrently, so high parallel slack can decrease the speedup pos- sible if the operating system has fewer threads than those created in Pthreads. In this case, to simulate concurrency, the operating system must “time-slice” between the concurrent threads, adding overhead for context switching and changing the items in the cache [3]. Second, the programmer will be able to use Parasite to investigate the impact of mutex con- tention on parallelism, using an interactive visualization that allows easy selection of shared 9
  • 20. 2. Background memory locations to lock using mutexes. The tool will then automatically calculate a new up- per bound on speedup with locks on these shared memory locations, without the programmer changing the source code. Finally, the parallelism of a call site can be compared to the speedup measured using a dif- ferent profiling tool. If there is a gap in the speedup, and use of the Parasite tool has ruled out insufficient parallelism, scheduling overhead, and contention as possible causes, the pro- grammer must consider alternative problems with their program design or operating system. Insufficient memory bandwidth, synchronization primitives other than locks such as barriers, and speculative work are remaining possibilities preventing the parallelization from achieving its potential speedup. 2.3. Synchronization Synchronization is coordination of events; synchronization constraints are specific orders of events required in a concurrent program. The most common types of synchronization con- traints are serialization, where one event must happen before another, and mutual exclusion, where one event must not happen at the same time as another [7]. When the two events in ques- tion are on the same thread, these constraints are easy to satisfy. The serialization constraint is met by placing events in the order intended. The mutual exclusion constraint is automatically met, because only one event can happen at the same time on the same thread. When two events that need to have a specific order or need to be mutually exclusive occur on different threads, synchronization constraints are harder to meet. From the programmer’s perspective, the or- der of events on different threads is non-deterministic, as it depends on the operating system scheduling. Problems. There are a number of problems associated with synchronization. Two common examples are race conditions, which occur “when concurrent tasks perform operations on the same memory location without proper synchronization, and one of the memory operations is a write” [3]. These can have no negative effect in some cases, but are nondeterministic, and therefore can fail, so are unacceptable in parallel code. Another example is a deadlock, which “occurs when at least two tasks wait for each other and each cannot resume until the other task proceeds” [3]. It is both an advantage and disadvantage of Parasite that it cannot detect synchronization problems, as it operates using a sequential execution of a Pthread program trace. This sequen- tial execution acts as if the Pthread program was operating using a single thread. The advan- tage is that programs can be tested for parallelism before synchronization problems are dealt with. This saves the programmer time when their only goal for using Parasite is to quickly compare different parallelizations, to assess if a parallelization provides some minimum scal- ability requirement, and to identify regions of high and low scalability. The disadvantage is that Parasite’s parallelism calculations are not necessarily accurate for programs that employ sychronization. A program that deadlocks has no parallelism, as it will never complete. The parallelism of a program where semaphores, conditional waits, or barriers are employed can- not be accurately measured using Parasite, as wait times due to these primitives will increase both the work and span. Even Parasite’s mutex wait time correction, described in section 5.3, only provides a rough estimate of the additional work and span that waiting for mutexes re- quires. Semaphores. A general solution to many synchronization problems is called a semaphore. A semaphore is defined in [7] by the following three conditions: • The semaphore can be initialized to any integer value, but after that it can only be incre- mented or decremented by one. 10
  • 21. 2.3. Synchronization • When a thread decrements the semaphore to a negative value, the thread blocks (does not continue) and cannot continue until a different thread increments the semaphores. • When a thread increments the semaphore and there are waiting threads, then one of the waiting threads in unblocked. The application of semaphores to diverse synchronization problems is described in detail in [7]. The lock wait time estimation algorithm used in Parasite only deals with the case of mutexes, which are semaphores initialized to values of one. Mutexes are often used to protect variables that are shared in memory between different threads, to avoid race conditions. 11
  • 22.
  • 23. 3. Related Work In this section five profiling tools similar to Parasite will be described. The first two, Cilkview and Cilkprof, are designed for programs parallelized using the Cilk and Cilk++ multithreading APIs. The third, ParaMeter, analyzes programs with irregular parallelism. The fourth, Intel Advisor, is a commercial tool that can be used for scalability profiling. The fifth, Kismet, profiles potential parallelism in serial programs. 3.1. Cilk Profiling Tools In this section two tools that profile programs using Cilk and Cilk++ will be described. Cilk and Cilk++ are programming languages designed for multithreaded computing, that extend C or C++ code with three constructs, cilk spawn(), cilk sync(), and cilk for(), that support writing task-parallel programs. The first two constructs are similar to pthread create(...) and pthread join(...), respectively, in Pthreads. Cilk is dif- ferent from Pthreads, however, in that it does not allow the developer to explicity choose if threads are created; cilk spawn() only creates a new thread if the Cilk scheduling algo- rithms decide this will help the performance. Therefore, Pthreads is better for shared-memory parallel applications in which complete control over thread creation is necessary. Cilk is better for shared-memory parallel applications that require excellent load balancing, as the backend of Cilk decides how to balance tasks between threads, unlike Pthreads, where the user is re- sponsible for load balancing. Normally a user’s attempt at load balancing will not be as good as Cilk’s backend algorithms for load balancing, as the user will not devote the time to load balancing that was required to develop the Cilk backend algorithms. Cilkview. The Cilkview scalability analyzer is a software tool for profiling multithreaded Cilk++ applications [5]. Like Parceive, Cilkview uses the Pin dynamic instrumentation frame- work to instrument threading API calls. By analyzing the instrumented binary, Cilkview mea- sures work and span during a simulation of a serial execution of parallel Cilk++ code. In this measurement, parallel control constructs such as cilk spawn() or cilk sync() statements are identified by “metadata embedded by the Cilk++ compiler in the binary executable” [5]. Unlike Parasite, Cilkview can analyze scheduling overhead by using the burdened DAG model of multithreading, which extends the DAG model described in section 2.1. In the burdened-DAG model, the work and span of some computations are weighted ac- cording to their grain size, by including a burden on each edge that continues after a thread end event, and each edge that continues on the parent thread after a new thread event. The burdens estimate the cost of migrating tasks, and assume all the tasks that can be migrated are migrated. Task migration is performed by the underlying Cilk scheduler. The main influence of using a burdened DAG instead of a DAG is that it increases the work and span values used in the parallelism calculation, and it decreases the parallelism. The decrease in parallelism is much higher for programs that have fine-grained parallelism, as these programs have more edges where burdens are added. Cilkprof. Cilkprof is a scalability profiler developed for multithreaded Cilk computations [6]. It extends Cilkview to provide work, span and parallelism profiles for individual function call sites as well as the overall program. It uses compiler instrumentation to create an instrumented Cilk program, that it then runs serially, to analyze each call site: every location in the code where a function is either called or spawned. The Cilkprof algorithm measures the work and 13
  • 24. 3. Related Work span of each call site, in order to get their ratio: the parallelism. It is not described here as it is used in Parasite and therefore described in detail in section 5.1. Conceptually, Parasite’s algorithm is the same, with four differences in its implementation: 1. Cilkprof’s algorithm is used for Cilk or Cilk++, while the Parasite algorithm is designed for Pthreads. 2. The Parasite algorithm includes an estimation of the effects of mutexes on parallelism. This is useful to programmers, as they may be trying to parallelize code which requires mutexes. Cilkprof and Cilkview do not consider synchronization. 3. The algorithm for this paper is implemented with object-oriented style in C++, unlike Cilkprof’s algorithms, which are implemented in C. This has the advantage that the code is more readable, and simpler, as it can use helpful data structures in the C++ standard li- brary, such as unordered maps, in place of the C data structures programmed specifically for Cilkprof. 4. The implementation of Parasite is more generalizable to other threading APIs than Cilk, as it responds to thread and function events instead of Cilk function calls. These events can be more easily mapped to threading constructs in other APIs than Cilk’s threading constructs can. 3.2. Other Tools ParaMeter: Profiling Irregular Parallelism. Kularni et al. developed a tool, called ParaMeter, that “produces parallelism profiles for irregular programs” [4]. Irregular programs are orga- nized with trees and graphs and many have amorphous data parallelism. This is a type of parallelism where conflicting computations can be performed in any order, where each chosen order is a DAG that may have its own parallelism. Parasite cannot easily analyze programs with amorphous data parallelism for two reasons. First, Parasite can only analyze one of the possible DAGs that models a program. Second, the structure of graphs representing programs with amorphous data parallelism may depend on the scheduling decisions of the operating system, and Parasite cannot take these scheduling decisions into account. ParaMeter deals with these challenges by making the parallelism profile it generates imple- mentation independent, and using greedy scheduling and incremental execution. Greedy scheduling “means that at each step of execution, ParaMeter will try to execute as many el- ements as possible.” Incremental execution means each step of computation is “scheduled tak- ing work generated in the previous step in account” [4]. ParaMeter not only measures par- allelism, like Parasite and CilkProf, but also parallelism intensity, which is the the amount of available parallelism divided by the overall size of the worklist at a given time in the compu- tation [4]. This metric is useful for deciding on work scheduling policies for tasks: random policies perform better with high parallelism intensities because it is less likely the policies create scheduling conflicts, which are situations where tasks must wait idly for other tasks to complete due to dependencies. The Intel Advisor: A Commercial Tool. The most similar Intel tool to Parasite is the In- tel Advisor, which can profile serial programs with annotations that specify parallelism, C and C++ programs parallelized using Intel Thread Building Blocks or OpenMP, C programs parallelized using Microsoft TPL, or Fortran programs parallelized using OpenMP [8]. The Threading Advisor workflow of the Intel Advisor provides similar features to Parasite; both are designed to assist software developers and architects who are in the process of optimizing parallelization. However, the Advisor tool is proprietary, which is a disadvantage compared to Parasite, which is open-source, so Parasite’s algorithms are entirely transparent and open to inspection by developers. 14
  • 25. 3.2. Other Tools Parasite has the ability to quickly compare parallelism of different Pthread parallelizations of the same program, without correct synchronization. Intel Advisor has a similar fast prototyp- ing feature, that allows developers to look at different parallelizations of a program, conveyed to the tool using annotations, to compare them before actually implementing their paralleliza- tion [8]. The Advisor accomplishes this by keeping the code serial when comparing the par- allelizations, so there can be no bugs related to concurrent execution in any of the potential parallelizations. The Intel Advisor provides scalability estimates for the entire program in its suitability anal- ysis, shown in figure 3.1, but unlike Parasite, it only looks at the entire program for parallelism estimates and does not provide individual scalability estimates for functions. Also unlike Para- site, the tool contains features that analyze call sites and loops for their vectorization potential. Like Parasite, it can be used to examine the proportion of work spent in different functions, to help the programmer see where execution time is spent in tasks that can be parallelized [9]. Figure 3.1.: Intel Advisor suitability analysis screenshot. Kismet: Parallel Speedup Estimates for Serial Programs. The Parasite tool, as well as the tools described in the previous sections, all require the input program to already be parallelized in some way. In contrast, Jeon et al. developed a tool, Kismet, that creates parallel speedup estimates for serial programs [10]. Like Parasite, Kismet calculates an upper bound on the pro- gram’s attainable speedup. Unlike Parasite, it takes into account operating system conditions including “number of cores, synchronization overhead, cache effects, and expressible paral- lelism types” [10]. The speedup algorithm uses a parallel execution time model that depends on these operating system conditions as well as the amount of parallelism available. Kismet determines the amount of parallelism available using summarizing hierarchical critical path analysis, which measures the critical path and work, like the Cilkprof work-span algorithm, but uses a different approach to take these measurements. This involves building a hierarchi- cal region structure from source code, consisting of different regions that help separate different levels of parallelism. The advantage of Kismet’s approach over Parasite and the other tools described in this sec- tion is that it does not require additional effort by the programmer. The Intel Advisor requires annotations that show parallel control, while the other tools require a parallelization. However, unlike the other tools, Kismet cannot be used to compare different parallelizations of the same serial program. 15
  • 26.
  • 27. Part II. The Parasite Scalability Profiler 17
  • 28.
  • 29. 4. Parceive The Parasite tool depends on Parceive, which provides information on call sites, functions, and threads that Parasite uses for its work-span algorithm. In this chapter the details of Parceive’s implementation will be described. Parceive operates using the steps shown in Figure 4.1. Parceive takes an executable as input that must meet the requirements described in section 4.1. Then, it performs static analysis of the machine code and dynamically instruments predefined instructions such as function calls and returns, threading API calls, or memory accesses. The instrumentation inserts callbacks that are used to write trace data into a database at runtime. This will be described in section 4.2. Based on this data, trace analysis generates a visualization which the user can use to see the overall structure of the program. Trace analysis also generates events that Parasite uses to calculate scalability of function call sites. This will be described in section 4.3. Runtime Analysis Binary instrumentation Event inspection Input Binary application Debug symbols Static Analysis Data-flow Control-flow Trace Data Visualization Framework Views Scalability Profiling Parasite Figure 4.1.: Steps of the Parceive tool. 4.1. Acceptable Inputs Currently, Parasite can only successfully analyze programs that satisfy the following condi- tions: • The program is written in C or C++. • The program is parallelized or annotated using Pthread API calls. • The Pthread API calls only include pthread create(...), pthread join(...), pthread mutex lock(...), and pthread mutex unlock(...) • The program’s behavior does not depend on collaborative synchronization. The last condition means that Parasite cannot correctly analyze a program where the ex- ecution behavior of one thread depends on the execution behavior of any other thread. This situation can occur when mutexes are used to control the ordering of threads, because the order of threads in which the mutex is acquired and released may differ between sequential and con- current execution. For example, Parasite will deadlock if a mutex is acquired in a parent thread, which then generates a child thread that needs to acquire the mutex. In a concurrent execution, the parent thread would continue after spawning child threads, and unlock the mutex, so that the child thread could acquire the mutex. In the sequential simulation of execution by the Par- ceive interpreter, the parent thread will not continue until the child thread has completed, but the child thread will not complete, because it is waiting on the parent thread’s acquired mutex. 19
  • 30. 4. Parceive Even without collaborative synchronization, a successful run of Parasite that includes mu- texes may not produce accurate estimates of parallelism, because Parasite may not correctly calculate the addition to the work and span associated with the mutexes. The lock wait time algorithm described in section 5.3 attempts to estimate these additions, but does not take into account overhead associated with acquiring and releasing the mutexes. Other synchronization primitives such as conditional waits and barriers are not handled by the Parasite algorithm, so Pthread programs that use these primitives should not be used as inputs to Parasite. 4.2. Trace Generation “Parceive analyzes programs by utilizing dynamic binary instrumentation at the level of ma- chine code during runtime” [2]. It employs the Pin framework because “it is efficient and supports high-level, easy-to-use instrumentation” [2, 11]. The pintool “injects analysis calls to write trace data into an SQLite database” [2]. The following instrumentation is used for data-gathering: • Call stack: function entries and exits are tracked to maintain a shadow call stack. For each call, the call instructions, threads, and spent execution time are captured. Additionally, function signatures, file descriptions, and loops are extracted from debug information. • Memory accesses: analysis calls are injected to capture information about each memory access (e.g., memory type, memory address, access instruction). For stack variables, de- bug information is utilized to resolve variable names. • Memory management: to handle heap memory, memory allocation and release function calls are instrumented. The tracked locations are used during analysis to match data accesses using pointers. • Threading: Parceive tracks calls of threading APIs, like Pthread, to capture thread opera- tions and synchronization. 4.3. Trace Analysis Some information contained in the SQLite database is context-free and can be found by sim- ple queries to the database. This includes data dependencies between functions, which are detected by comparing the memory accesses of each function, that are in turn found by ab- stracting instances of function calls. Other information depends on the control and data flow. To extract this information, the trace stored in the database is read sequentially in chronologi- cal order. This information includes runtime of a function call and its nested functions, counts of specific function calls, and counts of specific memory accesses. An API allows the user to acquire information from events of interest generated from reading the database. The Parasite tool interfaces directly with the following events to acquire information it needs for its work- span algorithm: 1. A function calls another function. 2. A function returns to its parent function. 3. A thread creates a child thread. 4. A child thread’s execution ends. 5. A thread join: an implicit barrier where a thread must join its parent thread. 20
  • 31. 4.3. Trace Analysis 6. A function acquires a lock. 7. A function releases a lock. The actions that occur in Parasite’s algorithm with each event are described in section 5.1. Shadow locks and threads associated with the events three to seven allow the event informa- tion to be independent from whatever programming language is used in the programs. This allows future extensions of Parasite to analyze not only Pthread programs, but also programs parallelized using other threading APIs such as OpenMP. 21
  • 32.
  • 33. 5. Algorithm In this chapter the elements of the Parasite algorithm will be discussed. Section 5.1 describes the algorithm that Parasite employs to measure the work and span of call sites. Section 5.2 describes the data structures used in this algorithm. Section 5.3 describes the algorithm that adjusts the work and span measurements to take effects of mutexes into account. Finally, sec- tion 5.4 describes the directed acyclic graphs that Parasite constructs to verify its algorithm. 5.1. The Parasite Work-Span Algorithm Conceptually, the Parasite algorithm is the same as the Cilkprof algorithm in [6], but imple- mented to respond to the Parceive interpeter’s events, described in section 4.3, instead of Cilk constructs. This requires an explanation of how Cilk constructs translate to these events, in terms of the algorithm (the actions of the operating system are not equivalent). In the Cilkprof work-span algorithm, a cilk spawn() is equivalent a new thread event. A cilk sync() is equivalent to a join event where, after the join event, the thread has no current child threads that have not already joined their parent thread. Figure 5.1 shows, with pseudocode, the actions of the Cilkprof algorithm as it responds to the Cilk constructs. Figure 5.1.: The Cilkprof Work-Span Algorithm. (w = work, p = prefix, l = longest-child, c = continuation) The figure uses the following variable names, which are defined in [6], but here, the defini- tions are written instead in terms of Parceive interpreter events: 1. F is a thread; “Called G” in the figure is a function called from F; otherwise G in the figure is a child thread of F. 2. The time u is initially set to the beginning of F. As the execution proceeds, it is set to the time of the new thread event that created the child thread of F, which realizes the longest span of any child encountered so far since the last join event. 23
  • 34. 5. Algorithm 3. The work F.w is the serial runtime of call site F - its total computation. 4. The continuation F.c stores the span of the trace from the continuation of u through the most recently executed instruction in F. 5. The longest-child F.l stores the span of the trace from the start of F through the thread end event of the child thread that F creates at u. 6. The prefix F.p stores the span of the trace starting from the first instruction of F and ending with u. The path through the DAG representing the program trace that has the length F.p is guaranteed to be on the critical path of F. Figure 5.2 illustrates the Cilkprof algorithm as it progresses through the execution of a pro- gram trace. If the algorithm is still unclear, the reader is encouraged to read section 3 of [6], or view the documentation and source code of the Parasite tool at [12]. Figure 5.2.: Updates to the span variables as Parasite is executed on a program trace. Each ar- row represents a different thread. An arrow starting from another arrow is a new thread event; an arrow intersecting with another is a join event. Colors indicate different span variables. Before the first join, the continuation and longest child are compared. The longest child is longer, so the prefix is updated to be the longest child. Before the second join, the sum of the prefix and the continuation is com- pared to the new longest child. The longest child is longer, so the prefix is updated to become this longest child. After the second join, the prefix is now what was the longest child before the join. After the end of the main thread, the remaining con- tinuation of the main thread has been added to the prefix. Now the prefix is equal to the entire span of the program. Complexity Analysis. The Parasite algorithm has time complexity O(Ne) , where Ne is the number of Events that the Parceive interpreter sends the Parasite tool. The number of these events depends entirely on the Pthreads program that the algorithm analyzes. Inputs with large numbers of threads, function calls, or mutex locks and unlocks will take much longer for Parasite to analyze than inputs with few function calls or threads created. The space complexity of the algorithm is also highly input dependent. There is a work hashtable that includes an entry for every call site in the program being profiled. In addi- tion to this, each thread has three span hashtables that each have an entry for every call site 24
  • 35. 5.2. Work-Span Data Structures that is called on the thread. If there are Nt threads, and each thread calls a fixed fraction f of all of the Ncs call sites, then the complexity would be O(3 ∗ f ∗ Ncs ∗ Nt + Ncs) = O(Ncs ∗ Nt) . 5.2. Work-Span Data Structures In this section, the stack and hashtable structures used by the algorithm described in section 5.1 are described. Work and Span Hashtables. Unique call site IDs are generated for each line of a program where a function is called. Parasite uses hash tables that map these IDs to information about the work or span of their respective call sites. For every call site, the work hashtable contains the number of invocations, the total work (measured in time), and the function signature. A span hashtable contains the longest-child, continuation, or prefix span of each call site on a thread. It also contains an estimate of the time the thread spends waiting to acquire mutexes. Function and Thread Stacks. As the Parceive interpreter simulates the execution of a pro- gram, Parasite updates two stacks, a thread stack and a function stack. These stack data struc- tures support the traditional stack push and pop operations. For each function call, the function stack contains a frame with the function signature, the call site ID, and an object that tracks the lock time intervals. It also contains two integers: the first indicates whether the function call is the top invocation of its call site on the function stack, and the second indicates whether the function call is the top invocation of its call site on the current thread. These integers are needed to avoid the double-counting of work and span in call sites that are called recursively. For each thread, the thread stack contains a frame with the following information: 1. The unique ID of the thread. 2. A list of interval data structures that stores the times in which mutexes in the thread and the thread’s children are acquired. 3. Prefix, longest-child, and continuation spans of the thread. 4. A counter that represents the number of child threads spawned from the thread that are currently on the thread stack. 5. A set that contains call sites which were pushed to the call stack while this thread was the bottom thread. This set is used to set the integer on each function frame that indicates whether it is the top invocation of the function’s call site on that thread. Additionally, a set is used to track all the function call sites currently on the function stack, in order to correctly set the integer on each function frame that indicates whether it is the top invocation of the function’s call site in the program. 5.3. Estimation of Mutex Effects The effects of mutexes on runtime are non-deterministic, and can only be measured accurately by running the program under test with concurrent execution. However, the goal of Parasite is to estimate scalability using its mathematical work-span algorithm, instead of using direct mea- surement. Therefore, a simple heuristic is used to estimate the impacts of mutex contention on the span of call sites. This heuristic corrects the span and work of each thread if the time that a mutex in the thread or its child threads is acquired is greater than the span of the thread without considering mutexes. The approach is outlined in the following pseudocode, which calculates an addition to the span and work, called mutex wait time in the source code. The correction is only applied when the Parasite tool processes a sync - a join event after which the parent thread 25
  • 36. 5. Algorithm has no current children. In the pseudocode, a mutex interval is a data structure storing the start, span, and mutex ID (a unique ID for each mutex generated by the Parceive interpreter) that describe a time interval in which a mutex is acquired. The child thread mutex list is the list of mutexes that any of the child threads in the parent thread have acquired since the last sync event. This approach does not take into account any overhead associated with acquiring or releasing mutexes. mutex_total_span_list = [] for mutex in child_thread_mutex_list: for mutex_interval in mutex.mutex_interval_list: mutex.total_span += mutex_interval.span mutex_total_span_list.append(mutex.total_span) maximum_mutex_span = mutex_total_span_list.max() correction = max(0, maximum_mutex_span - longest_child_span) longest_child_span = longest_child_span + correction parent_thread_work += correction 5.4. Graph Validation Parasite constructs a directed acyclic graph while it profiles a program. In order to confirm that the dynamic algorithm described in section 5.1 produces the correct result, this graph is used to calculate the span of the program being profiled. Figure 5.3 shows such a DAG, for a parallel program with a master-worker pattern and four worker threads. 26
  • 37. 5.4. Graph Validation Figure 5.3.: Directed acyclic graph of vector-vector multiplication program. In this figure, the numbers on edges represent the time spent between events. TS = new thread event, TE = thread end event, and R = return event. The numbers in the new thread and thread event labels are the IDs of the threads. The numbers in the return event labels are the call site IDs of the returning functions. 27
  • 38. 5. Algorithm For thread start and thread end events one vertex and one edge is added to the graph. The length of the edge is the time elapsed since the last event, and the vertex represents the event just generated. For a join event, two edges are created: the first edge connects the most recent event on the parent thread to the join event. The second edge connects the thread end event of the child thread to the join event. Therefore, a thread join event always has an inward degree of 2, as it joins two threads. A new thread event always has an outward degree of 2, as it creates a new thread starting from the new thread vertex, and the parent thread continues. After Parasite has completed its analysis of the program, it calculates the span of the DAG it has constructed. The graph is stored using data structures of the BOOST graph library [13], and stored at the completion of Parasite using a DOT file. Then, this DOT file is loaded into a Python script, which calculates the longest path of the graph. This graph does not include estimates of mutex wait times, so the longest path of the graph should be equal to the difference between the span of the main function, when including mutex wait times, and the mutex wait time of the main function, both calculated by Parasite’s work-span algorithm. This check was useful in the initial development of Parasite to confirm that the algorithm was implemented correctly. Longest Path Algorithm. Since a shared-memory parallel program can be represented as a DAG, one way of finding the span of the program is to employ a longest-path algorithm on the DAG. To check the correctness of its work-span algorithm, Parasite uses the code in listing 5.1 to calculate the longest path in the graph it constructs to represent its input program. This algorithm is taken directly from the Python networkx library [14]. Listing 5.1: Longest path algorithm for a DAG [14]. import networkx as nx def longest_path(G): dist = {} # stores [node, distance] pair for node in nx.topological_sort(G): # pairs of dist,node for all incoming edges pairs = [(dist[v][0]+1,v) for v in G.pred[node]] ifdp pairs: dist[node] = max(pairs) else: dist[node] = (0, node) node,(length,_) = max(dist.items(), key=lambda x:x[1]) path = [] while length > 0: path.append(node) length,node = dist[node] return list(reversed(path)) Complexity Analysis. It is interesting to compare the complexity of the algorithm in listing 5.1 with Parasite’s algorithm. This algorithm first uses a topological sort, which orders the vertices of the graph so that for very edge from m to n, m comes before n in the ordering. This is possible for any directed acyclic graph. The complexity of the topological sort is Θ(V + E), where V is the number of vertices, and E is the number of edges [15]. After sorting, the algorithm looks at, for each node, the edges from predecessors of this node to the node itself. Therefore, its complexity is O(V + E). In the DAG generated by Parasite, a vertex is a thread start, thread end, or thread join event. Every thread except the main thread has three of these events. Therefore, the number of vertices is O(Nt) , where Nt is the number of threads spawned during the program execution. The number of edges is also O(Nt), so the complexity of the algorithm is O(Nt). If the same algorithm was applied to each call site in Parasite, and each call site had the same number of threads, then the complexity would be 28
  • 39. 5.4. Graph Validation O(Nt 2 ) The Parasite algorithm has time complexity O(Ne), where Ne is the number of all events. The number of events could be linearly proportional to the number of threads, or have a different relation, hence, the Parasite algorithm could also be of a similar complexity to the longest path algorithm or much greater. The relative complexity depends completely on the program being profiled. The Parasite algorithm, however, provides more information than the longest path of the entire program. It gives information about the parallelism for each call site, as well as estimating the effect of mutexes on the parallelism. 29
  • 40.
  • 41. Part III. Results and Conclusion 31
  • 42.
  • 43. 6. Results In this section Parasite is applied to a diverse set of programs. For each program, the method of parallelization is discussed, and Parasite is used to estimate the resulting scalability. First, a simple program that calculates the Nth Fibonacci number is used to verify the correctness of Parasite’s algorithm. Second, for vector-vector multiplication and four parallel sorting algo- rithms, Parasite is applied to the programs multiple times to show the dependence of the paral- lelism on input parameters. Third, using a simulation of the European football championships, Parasite is shown to be able to determine the scalability of an application with irregular par- allelism. Fourth, for a molecular dynamics simulation, the parallelism of call sites in different parallelizations of the same program are compared. Finally, CPPCheck is used to show that Parasite can be used to quickly test a potential parallelization of a sequential program. For some of the programs, theoretical parallelisms of certain call sites can be calculated, which is compared to the actual parallelism that Parasite calculates for these call sites. 6.1. Fibonacci Sequence The code below is an abbreviated parallelization of calculating the Nth Fibonacci number [16]: #define N 20 void* fibonacci_thread(void* arg) { size_t n = (size_t) arg, fib; pthread_t thread_1, thread_2; void* pvalue; if ((n == 0) or (n == 1)) return 1; pthread_create(&thread_1, 0, fibonacci_thread, (void*)(n - 1)); pthread_create(&thread_2, 0, fibonacci_thread, (void*)(n - 2)); pthread_join(thread_1, &pvalue))); fib = (size_t) pvalue; pthread_join(thread_2, &pvalue) fib += (size_t) pvalue; return (void*) fib; } size_t fibonacci(size_t n) { return (size_t) fibonacci_thread((void*) n); } int main() { fibonacci(N); } 33
  • 44. 6. Results Table 6.1 shows the parallelism of the different function calls in this program, for finding the 20th Fibonacci number. Parallelism Percentage of Work Count fibonacci thread 642.401 99.4487 6764 fibonacci thread 241.458 99.856 1 fibonacci 221.677 99.9273 1 main 205.314 100 1 Table 6.1.: Call site parallelism for calculating the 20th Fibonacci number. The parallelism of calculating the Nth Fibonacci number in the method of the code used has been calculated to be: Parallelism(n) = Θ( φn n ) (6.1) where φ is the golden ratio [17]. Figure 6.1 confirms that Parasite’s measurements for the parallelism of the fibonacci function in this code follow the theoretical prediction. This is a useful validation that Parasite’s algorithm is implemented correctly. 0 500 1,000 1,500 2,000 2,500 3,000 0 100 200 300 400 500 600 700 800 900 1,000 φn n ParallelismofFibonaccifunction. Figure 6.1.: Dependence of parallelism of Fibonacci function on the number N it calculates. 6.2. Vector-Vector Multiplication The first test of Parasite was the following parallel vector-vector multiplication program [18]: /* INPUT VARIABLES */ #define NUM_THREADS 5 #define VECTOR_SIZE 1000000000 pthread_mutex_t mutex_sum = PTHREAD_MUTEX_INITIALIZER; int *VecA, *VecB, sum = 0, dist; /* Thread callback function */ void * doMyWork(int myId) { 34
  • 45. 6.2. Vector-Vector Multiplication int counter, mySum = 0; /*calculating local sum by each thread */ for (counter = ((myId - 1) * dist); counter <= ((myId * dist) - 1); counter++) mySum += VecA[counter] * VecB[counter]; /*updating global sum using mutex lock */ pthread_mutex_lock(&mutex_sum); sum += mySum; pthread_mutex_unlock(&mutex_sum); return; } /*Main function start */ int main(int argc, char *argv[]) { /*variable declaration */ int ret_count; pthread_t * threads; pthread_attr_t pta; double time_start, time_end, diff; struct timeval tv; struct timezone tz; int counter, NumThreads, VecSize; NumThreads = NUM_THREADS; VecSize = VECTOR_SIZE; /*Memory allocation for vectors */ VecA = (int *) malloc(sizeof(int) * VecSize); VecB = (int *) malloc(sizeof(int) * VecSize); pthread_attr_init(&pta); threads = (pthread_t *) malloc(sizeof(pthread_t) * NumThreads); dist = VecSize / NumThreads; /*Vector A and Vector B intialization */ for (counter = 0; counter < VecSize; counter++) { VecA[counter] = 2; VecB[counter] = 3; } /*Thread Creation */ for (counter = 0; counter < NumThreads; counter++) { pthread_create(&threads[counter], &pta, (void *(*) (void *)) doMyWork, (void *) (counter + 1)); } /*joining threads */ for (counter = 0; counter < NumThreads; counter++) { pthread_join(threads[counter], NULL); } printf("n The Sum is: %d.", sum); pthread_attr_destroy(&pta); return; } 35
  • 46. 6. Results This is the simplest style of Pthread program, with two functions: a main function and a worker function. The main function creates several threads which perform the worker function. In this case, the worker function doMyWork takes sections of the vectors, multiplies these sec- tions, and adds the result to a global sum variable, which is protected by the mutex mutex sum to avoid race conditions. Figure 6.2 shows the dependence of the parallelism on the worker size, using ten worker threads. The result approaches a limit of about 9.9. If the threads had equal work, the paral- lelism would be equal to 10. The parallelism approaches 9.9 instead because the work is not perfectly balanced between the threads, even if they multiply vectors of identical size, as fac- tors such as memory access time and the time to lock and unlock the mutex can vary between the threads. 0 50 100 150 200 4 6 8 10 Vector Size * 107 ParallelismofWorkerFunction Figure 6.2.: Dependence of parallelism of worker function on vector size for parallel vector- vector multiplication, with a fixed number of 10 worker threads Figure 6.3 shows the dependence of the parallelism on number of threads, using a fixed input size of 109. As would be expected, parallelism increases linearly with the number of threads. 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Number of Threads ParallelismofWorkerFunction Figure 6.3.: Dependence of parallelism of worker function on number of threads for parallel vector-vector multiplication, using a fixed input size of 109. 36
  • 47. 6.3. Sorting Algorithms 6.3. Sorting Algorithms Most computer scientists are familiar with the sorting algorithms bubble sort, quick sort, radix sort, and merge sort. In this section Parasite will be applied to programs that implement each of these sorting algorithms in parallel, to illustrate Parasite’s ability to show the overall and local parallelism in a program. Specifically, with these sorting algorithms, Parceive shows quantita- tively how the parallelism depends on input size, recursion depth, and granularity. 6.3.1. Bubble Sort The following abbreviated code is a simple parallelization of the bubble sort algorithm, which passes through an array of values, and swaps neighboring values if their order is not cor- rect [19]. In the sequential version of bubble sort, the first iteration starts with index 0 of the array and the iterations continue until index N, where all the values are sorted. In the parallel version, individual swaps are done in parallel of all the even elements, then individual swaps are done in parallel of all the odd elements. This process is repeated until all the elements are sorted. #define DIM 200 int a[DIM], swapped = 0; pthread_t thread[DIM]; void bubble(int i) { int tmp; if (i != DIM-1) { if(a[i] > a[i+1]) { tmp = a[i]; a[i] = a[i+1]; a[i+1] = tmp; swapped = 1; } } } int main() { int i; fill_a_with_random_integers(); do { swapped = 0; for(i = 0; i < DIM; i+=2) pthread_create(&thread[i], NULL, &bubble, i); for(i = 0; i < DIM; i+=2) pthread_join(thread[i], NULL); swapped = 0; for(i = 1; i < DIM; i+=2) pthread_create(&thread[i], NULL, &bubble, i); for(i = 1; i < DIM; i+=2) pthread_join(thread[i], NULL); } while(swapped == 1); } 37
  • 48. 6. Results Figure 6.4 shows that this bubble sort implementation is highly parallel, and the parallelism increases quickly with input size for small inputs. This is expected as the number of threads spawned to perform the swaps increases quadratically with the input size, as there are O(N2) swaps. However, after the input size reaches about 200, the parallelism hits a limit of about 2.69. This could be because some swaps take more time than others. There are O(N2) threads created for this parallel bubble sort, where N is the input size. Higher input sizes were not tested because they cause the runtime of Parasite to be very high. This is because the complexity of Parasite depends on the number of function and thread events, which for this code is also O(N2). 0 100 200 1 1.5 2 2.5 3 Input size ParallelismofBubblesort Figure 6.4.: Dependence of parallelism on input size for parallel bubble sort. Table 6.2 shows that the the two call sites of bubble have approximately equal work. Gen- erating these tables for progessively larger input sizes shows that the parallelism and the work percentage of the two bubble call sites approach each other as the input sizes increases, so that they are eventually equal. This should be expected, because both call sites have the same number of calls plus or minus one, and they sort random integers. Parallelism Percentage of Work Count bubble 98.5142 8.38823 9100 bubble 83.046 8.41496 9100 main 2.69487 100 1 v initiate 1 0.99462 1 Table 6.2.: Call site parallelism for parallel bubble sort on 200 integers. 6.3.2. Quicksort In the sequential version of Quicksort, in a partition function, elements are sorted around a randomly chosen pivot element so that all elements less than the pivot are in one array, and elements greater than the pivot are in another array. The process is repeated recursively on both of these arrays until each array is sorted, and then the arrays are combined together with the pivot in order. The following code is a simple parallelization of the quicksort algorithm (for brevity, the partition function is omitted): 38
  • 49. 6.3. Sorting Algorithms #define RECURSIVE_DEPTH 16 #define INPUT_SIZE 100000 /** * Structure containing the arguments to the parallel_quicksort function. Used * when starting it in a new thread, because pthread_create() can only pass one * (pointer) argument. */ struct qsort_starter { int *array; int left; int right; int depth; }; void parallel_quicksort(int *array, int left, int right, int depth); /** * Thread trampoline that extracts the arguments from a qsort_starter structure * and calls parallel_quicksort. */ void* quicksort_thread(void *init) { struct qsort_starter *start = init; parallel_quicksort(start->array, start->left, start->right, start->depth); return NULL; } /** * Parallel version of the quicksort function. Takes an extra parameter: * depth. This indicates the number of recursive calls that should be run in * parallel. The total number of threads will be 2ˆdepth. If this is 0, this * function is equivalent to the serial quicksort. */ void parallel_quicksort(int *array, int left, int right, int depth) { if (right > left) { int pivotIndex = left + (right - left)/2; pivotIndex = partition(array, left, right, pivotIndex); // Either do the parallel or serial quicksort, depending on the depth // specified. if (depth-- > 0) { // Create the thread for the first recursive calla struct qsort_starter arg = {array, left, pivotIndex-1, depth}; pthread_t thread; int ret = pthread_create(&thread, NULL, quicksort_thread, &arg); assert((ret == 0) && "Thread creation failed"); // Perform the second recursive call in this thread parallel_quicksort(array, pivotIndex+1, right, depth); // Wait for the first call to finish. pthread_join(thread, NULL); } 39
  • 50. 6. Results else { quicksort(array, left, pivotIndex-1); quicksort(array, pivotIndex+1, right); } } } int main(int argc, char **argv) { int depth = RECURSIVE_DEPTH; // Size of the array to sort. Optionally specified as the second argument // to the program. int size = INPUT_SIZE; // Allocate the array and initialise it with pseudorandom numbers. The // random number generator is always seeded with the same value, so this // should give the same sequence of numbers. int *values = calloc(size, sizeof(int)); assert(values && "Allocation failed"); int i = 0; for (i=0 ; i<size ; i++) { values[i] = i * (size - 1); } // Sort the array parallel_quicksort(values, 0, size-1, depth); return 0; } Here, the recursive calls to quicksort on the arrays smaller and larger than the pivot are made in parallel, up to a developer-specified depth [20]. After this depth a sequential quicksort is applied to the array. The parallelism of this parallel Quicksort algorithm can be derived in the ideal case, where the partition algorithm evenly splits every step around the pivot equally. The work and span have the following recurrence relations, presented in [3]: T1(N) = 1 + 2T1(N/2) (6.2) T∞(N) = N + T∞(N/2) (6.3) The solution of the recurrence relations, derived in [3], is: T1(N) = Θ(NlgN) (6.4) T∞(N) = N (6.5) Therefore the theoretical parallelism is: T1(N)/T∞(N) = Θ(NlgN)/Θ(NlgN) = Θ(lg(N)) (6.6) 40
  • 51. 6.3. Sorting Algorithms Parallelism Percentage of Work Count quicksort 32.8953 39.6884 32 quicksort 32.3863 73.4517 5841 partition 32.3009 12.7183 5841 quicksort 32.2813 73.6963 5841 quicksort 32.2134 39.3518 32 parallel quicksort 31.2151 78.496 31 parallel quicksort 25.8009 81.9838 31 quicksort thread 25.3564 82.0705 31 parallel quicksort 16.7058 86.7864 1 main 5.48624 100 1 partition 1.30958 0.761345 63 Table 6.3.: Call site parallelism for parallel quicksort with recursion depth 5 on 10,000 integers. Table 6.3 shows the parallelism and span of call sites in the parallel quick sort for a recursion depth 5, on an input size of 10,000 integers. The four sequential quicksort call sites all have parallelism 32, as well as the partition call site within the quicksort function. These call sites are all called after the recursion reaches depth 5. The parallel quicksort algorithm can be viewed as a full binary tree where each node that is not a leaf is a call to parallel quicksort. The leaf nodes are calls to the sequential quicksort function, and there are 2D leaves in a full binary tree, where D is the recursion depth, so here, where the depth is 5, a parallelism of 32 in the leaf function is expected. Figure 6.5 confirms that the average parallelism of the four quicksort call sites is approximately equal to 2D, where D is the recursion depth of the quicksort program. 2 4 6 8 10 2 4 6 8 10 Recursion Depth Log2(QuicksortParallelism) Figure 6.5.: Dependence of parallelism of sequential quicksort calls on recursion depth. Next, the parallelism of the top call to parallel quicksort observed using Parasite was compared to the theoretical parallelism of parallel quicksort, in Figure 6.6. To remain as close as possible to the ideal case, recursion depths of floor(log2(N)) were used. Interestingly, this parallel quicksort function seems to show a linear dependence of parallelism on input size, instead of the logarithmic dependence that theory predicts. This is likely because the algorithm used here applies the sequential quicksort algorithm after the recursion depth is reached. 41
  • 52. 6. Results 0 10 20 30 40 0 100 200 300 400 500 Input Size * 103 Parallelismofparallelquicksortfunction. Figure 6.6.: Dependence of parallelism of parallel quicksort on input size. 6.3.3. Radix Sort The following code is a parallelization of the radix sort algorithm [21]: #define NTHREADS 5 #define INPUT_SIZE 1000000 /* Bits of value to sort on. */ #define BITS 29 /* Thread arguments for radix sort. */ struct rs_args { int id; /* thread index. */ unsigned *val; /* array. */ unsigned *tmp; /* temporary array. */ int n; /* size of array. */ int *nzeros; /* array of zero counters. */ int *nones; /* array of one counters. */ int t; /* number of threads. */ }; /* Global variables and utilities. */ struct rs_args *args; /* Individual thread part of radix sort. */ void radix_sort_thread (unsigned *val, /* Array of values. */ unsigned *tmp, /* Temp array. */ int start, int n, /* Portion of array. */ int *nzeros, int *nones, /* Counters. */ int thread_index, /* My thread index. */ int t) /* Number of theads. */ { unsigned *src, *dest; int bit_pos; int index0, index1; 42
  • 53. 6.3. Sorting Algorithms int i; /* Initialize source and destination. */ src = val; dest = tmp; /* For each bit... */ for ( bit_pos = 0; bit_pos < BITS; bit_pos++ ) { /* Count elements with 0 in bit_pos. */ nzeros[thread_index] = 0; for ( i = start; i < start + n; i++ ) { if ( ((src[i] >> bit_pos) & 1) == 0 ) { nzeros[thread_index]++; } } nones[thread_index] = n - nzeros[thread_index]; /* Get starting indices. */ index0 = 0; index1 = 0; for ( i = 0; i < thread_index; i++ ) { index0 += nzeros[i]; index1 += nones[i]; } index1 += index0; for ( ; i < t; i++ ) { index1 += nzeros[i]; } /* Move values to correct position. */ for ( i = start; i < start + n; i++ ) { if ( ((src[i] >> bit_pos) & 1) == 0 ) { dest[index0++] = src[i]; } else { dest[index1++] = src[i]; } } /* Swap arrays. */ tmp = src; src = dest; dest = tmp; } } /* Thread main routine. */ void thread_work (int rank) { int start, count, n; int index = rank; /* Ensure all threads have reached this point, and then let continue. */ pthread_barrier_wait(&barrier); /* Get portion of array to process. */ 43
  • 54. 6. Results n = args[index].n / args[index].t; /* Number of elements this thread is in charge of */ start = args[index].id * n; /* Thread is in charge of [start, start+n] elements */ /* Perform radix sort. */ radix_sort_thread (args[index].val, args[index].tmp, start, n, args[index].nzeros, args[index].nones, args[index].id, args[index].t); } void radix_sort (unsigned *val, int n, int t) { unsigned *tmp; int *nzeros, *nones; int r, i; /* Thread-related variables. */ long thread; pthread_t* thread_handles; /* Allocate temporary array. */ tmp = (unsigned *) malloc (n * sizeof(unsigned)); /* Allocate counter arrays. */ nzeros = (int *) malloc (t * sizeof(int)); nones = (int *) malloc (t * sizeof(int)); /* Initialize thread handles and barrier. */ thread_handles = malloc (t * sizeof(pthread_t)); /* Initialize thread arguments. */ for ( i = 0; i < t; i++ ) { args[i].id = i; args[i].val = val; args[i].tmp = tmp; args[i].n = n; args[i].nzeros = nzeros; args[i].nones = nones; args[i].t = t; /* Create a thread. */ pthread_create (&thread_handles[i], NULL, thread_work, i); } /* Wait for threads to join and terminate. */ for ( i = 0; i < t; i++ ) pthread_join (thread_handles[i], NULL); /* Copy array if necessary. */ if ( BITS % 2 == 1 ) { copy_array (val, tmp, n); } } void main (int argc, char *argv[]) { int n, t; 44
  • 55. 6.3. Sorting Algorithms unsigned *val; time_t start, end; n = INPUT_SIZE; t = NTHREADS; val = (unsigned *) malloc (n * sizeof(unsigned)); random_array (val, n); args = (struct rs_args *) malloc (t * sizeof(struct rs_args)); radix_sort (val, n, t); /* The main algorithm. */ } The sequential version of this algorithm first sorts numbers by their least significant digit, then by their next significant digit, until entire sequences of numbers are sorted. The paral- lelization splits the array of numbers into equal portions, and determines their position in the overall array using prefix sums on each digit; for a detailed explanation, see [22]. Table 6.4 shows the call site parallelism profiles. Interestingly, the parallelism of these call sites does not change significantly when the input size is varied or when the number of threads used is changed. If the developer wished to acquire higher speedup from the program, she could try modifiying the implementation to make it more scalable, by decreasing the amount of time spent in non-parallelizable sections (radix sort only has about 43% of the work) or possibly choosing a different parallelization of the radix sort algorithm. Parallelism Percentage of Work Count radix sort thread 1.82281 52.99 2 radix sort 1.77026 60.7733 1 thread work 1.70867 57.7119 2 main 1.36292 100 1 random array 1 35.2695 1 Table 6.4.: Call site parallelism for parallel radix sort on 106 integers with 5 threads. 6.3.4. Merge Sort The following code is a simple parallelization of the merge sort algorithm [23]: #define TYPE int #define MIN_LENGTH 2 #define INPUT_SIZE 10000 typedef struct { TYPE *array; int left; int right; int tid; } thread_data_t; int number_of_threads; pthread_mutex_t lock_number_of_threads; // The function passed to a pthread_t variable. void *merge_sort_threaded(void *arg) { thread_data_t *data = (thread_data_t *) arg; int l = data->left; int r = data->right; int t = data->tid; if (r - l + 1 <= MIN_LENGTH) { 45
  • 56. 6. Results // Length is too short, let us do a |qsort|. qsort(data->array + l, r - l + 1, sizeof(TYPE), my_comp); } else { // Try to create two threads and assign them work. int m = l + ((r - l) / 2); // Data for thread 1 thread_data_t data_0; data_0.left = l; data_0.right = m; data_0.array = data->array; pthread_mutex_lock(&lock_number_of_threads); data_0.tid = number_of_threads++; pthread_mutex_unlock(&lock_number_of_threads); // Create thread 1 pthread_t thread0; int rc = pthread_create(&thread0, NULL, merge_sort_threaded, &data_0); // Data for thread 2 thread_data_t data_1; data_1.left = m + 1; data_1.right = r; data_1.array = data->array; pthread_mutex_lock(&lock_number_of_threads); data_1.tid = number_of_threads++; pthread_mutex_unlock(&lock_number_of_threads); // Create thread 2 pthread_t thread1; pthread_create(&thread1, NULL, merge_sort_threaded, &data_1); int created_thread_1 = 1; // Wait for the created threads. pthread_join(thread0, NULL); pthread_join(thread1, NULL); // Ok, both done, now merge. // left - l, right - r merge(data->array, l, r, t); } } void merge_sort(TYPE *array, int start, int finish) { thread_data_t data; data.array = array; data.left = start; data.right = finish; // Initialize the shared data. number_of_threads = 0; pthread_mutex_init(&lock_number_of_threads, NULL); data.tid = 0; 46
  • 57. 6.3. Sorting Algorithms // Create and initialize the thread pthread_t thread; pthread_create(&thread, NULL, merge_sort_threaded, &data); // Wait for thread, i.e. the full merge sort algo. pthread_join(thread, NULL); } int main(int argc, char **argv) { int n = INPUT_SIZE; int *p = random_array(n); merge_sort(p, 0, n - 1); free(p); pthread_mutex_destroy(&lock_number_of_threads); } The sequential version of this algorithm first divides the unsorted list of numbers into n small sublists, which are sorted using an algorithm of choice. Then the sublists are merged (combined into larger sorted lists) until there is a single sorted list remaining.The parallel implementation gives the two calls to merge sort on each recursion level to two threads, which then carry out the parallel merge sort on their own arrays until the size of the array is less than a user-set minimum merge sort size. Then, a sequential quicksort is applied to the array. Table 6.5 shows Parasite’s profile for a run of this parallel merge sort with 10,000 integers and a minimum merge sort size of 10. merge only occupies 3.6% of the work that merge sort performs, indicating that the calls to the sequential C library function quicksort (not measured by Parasite) are much more expensive. Therefore, decreasing the input size at which this quick- sort is performed should increase the parallelism. Further tests confirmed that the parallelism of the top-level merge sort call in main continues to increase until the minimum merge sort size is one. However, Parasite cannot show the effect of the true operating system overhead of pthread create(...) and pthread join(...) that would occur in concurrent execu- tion, and with this effect, the minimum merge sort size for peak parallelism would likely be greater than one. Parallelism (P) P including Mutex Correction % of Work Count merge 199.301 199.301 3.58896 1023 merge sort threaded 99.1834 21.7108 64.2943 1 merge sort 50.4269 18.1671 65.07 1 main 16.2865 11.9586 100 1 random array 1 1 4.66903 1 Table 6.5.: Call site parallelism for parallel merge sort on 10000 integers with minimum merge sort size of 10. Note that the single mutex in this parallel merge sort, that protects the global count of num- ber of threads, reduces the parallelism of the call to merge sort by about 60 percent. However, this global thread count is only used for debugging purposes, so could be removed. Without Parasite, the programmer would not necessarily know that this mutex was having a signifi- cant effect on the parallelism. With Parasite, the programmer sees a clear contrast in the first two columns of the table, and knows that removing the mutex can improve the parallelism significantly. 47
  • 58. 6. Results In [3] it is shown that parallel merge sort has the following work and span: T1(N) = Θ(NlgN) (6.7) T∞(N) = Θ(Nlg3 N) (6.8) Therefore the theoretical parallelism is: T1(N)/T∞(N) = Θ(NlgN)/Θ(Nlg3 N) = Θ(N/lg2 N) (6.9) Figure 6.7 compares Parasite’s measured parallelism of the parallel mergesort function as a function of input size. The plot is approximately linear, so this program shows that the imple- mentation has the expected theoretical parallelism. 0 20 40 60 80 100 120 0 50 100 N/(log(N)*log(N)) Parallelismoftopcalltoparallelmergesortfunction Figure 6.7.: Dependence of parallelism of parallel mergesort on N. the input vector size. 6.3.5. Summary Parallel sorting programs act as a more demanding validation of the Parasite tool, by showing that Parasite produces reasonable parallelism values for recursive algorithms with large depth, such as the parallel quicksort and parallel merge sort algorithms, and that the parallelism mea- sured agrees with the theoretical parallelism predicted in the case of mergesort. Furthermore, the sorting tests demonstrate that Parasite can quickly help the developer find information useful for setting parameters in their program including the input size, the number of threads used, and the recursion depth reached before a sequential algorithm is used for the sort. The de- veloper could gain further information useful to the design of their parallel sorting algorithms through the Parasite tool by examining the effect of different distributions of input number sets on the parallelism of each sorting algorithm. 6.4. European Championships Simulation In this section EMSim, a simulation of the European Football Championships, will be ana- lyzed [24]. This program has a more complex structure than a simple master-worker pat- 48
  • 59. 6.4. European Championships Simulation tern. Without understanding football, one does not necessarily understand how many matches could be played concurrently in this simulation, or how the simulation would scale with more threads. Parasite gives the programmer unambiguous numbers for the parallelism of all the call sites, that are useful for showing the scalability of individual parts of the simulation, as well as problems with the load balancing. Program Structure. Figure 6.8 shows Parceive’s visualization of the EMsim program. 49
  • 61. 6.4. European Championships Simulation As the calling context tree view in the center of this figure shows visually, the program fol- lows the following steps: 1. In the initDB method, read information containing statistics on previous EM matches into a database. 2. Get the 24 teams who will be playing. 3. Simulate a full execution of the European championships, through the following steps: a) Initialize the simulation. b) Call, in parallel, six calls to parallel calls to playGroup, which each simu- late six matches in the group play of the EM simulation. The matches are played sequentially. c) Sort the team scores based on their group results. d) Call, in parallel, 16 calls to playMatchInPar to simulate the round of 16. e) Call, in parallel, 8 calls to playMatchInPar to simulate the round of 8. f) Call, in parallel, 4 calls to playMatchInPar to simulate the quarterfinals. g) Call, in parallel, 2 calls to playMatchInPar to simulate the semifinals. h) Simulate the final match. Parceive’s visualization shows some useful information about the program, including that the work is not evenly balanced between the calls to playMatchGen, the function call at the second to bottom level of the performance view. However, its visualization only provides a qualitative view of the parallelism in the EM simulation. Parallelism. Parasite provides quantitative views: Table 6.6 shows all the function calls in the simulation with parallelism greater than 1. 51
  • 62. 6. Results Parallelism Percentage of Work Count team1DominatesTeam2 15.5336 0.00218563 72 getNumMatches 6.84717 0.746724 52 getMatches 6.4472 0.963656 51 getGoalsPerGame 6.09455 0.00455375 111 getMatchesInternal 5.60061 0.113177 52 getMatchesInternal 5.47997 0.103317 52 fillPlayer 4.71968 9.90176 2621 fillPlayer 4.60666 3.80308 222 parallel calls to playGroup 4.33232 69.1592 16 playGroup 4.33212 69.1557 6 playGroupMatch 4.33129 69.1352 36 playMatchGen 4.33077 69.1262 36 playEM 3.75038 97.9785 2 getPlayersOfMatch 3.73775 96.8404 111 getGoalsPerGame 3.63284 0.00139767 111 getNumPlayersOfMatch 3.62938 20.685 111 getNumPlayersOfMatch 3.6288 20.7125 111 main 3.58549 100 1 playFinalRound 2.84291 28.7715 4 playMatchInPar 2.84202 28.7286 15 playFinalMatch 2.84182 28.7262 15 playMatchGen 2.84162 28.7192 30 getTeam 1.24702 0.000830991 18 Table 6.6.: Parallelism of call sites in EMsim. Table 6.6 provides the insight that none of the call sites in the EMsim has a parallelism greater than 7, other than team1DominatesTeam2, which takes less than 0.01% of the work. There- fore, with the current design of the simulation, there is probably no benefit to calling more than seven threads at any point in the simulation. This insight is not immediately clear from Figure 6.8, the program description, or the DAG Parasite generates for the program. Furthermore, the Parceive visualization suggests that parallel calls to playGroup could have a parallelism of at most 6, as it is called six times concurrently. The parallelism of parallel calls to playGroup is instead about 4.3, indi- cating that the calls are not balanced in terms of span - one of the calls must take significantly longer than the other. A similar observation can be made for playGroupMatch. This function is called six times concurrently in playGroup, but only has a parallelism of 4.3. The function playGroupMatch is called 6 times from each of the six calls of playGroup. This function simulates a match, and in the group phase, all matches are independent, so it could have a parallelism of 36 if all matches involve the same work. However, Parasite only measures a parallelism of 4.33 for this function, indicating that the simulation could be re- designed to be more scalable in the group match phase. 6.5. Molecular Dynamics Appendix 8.1 contains a simple serial molecular dynamics code [25]. Table 6.7 shows the dis- tribution of work for the sequential simulation. 52
  • 63. 6.5. Molecular Dynamics Parallelism Percentage of Work Count main 1 100 1 update 1 1.08808 10 compute 1 10.0827 10 calKinetic 1 0.923045 110 compute 1 12.6647 1 main work 1 96.8591 1 distance 1 4.01697 990 initialize 1 3.35982 1 Table 6.7.: Call site parallelism and work percentage for the serial molecular dynamics code in Appendix 8.1, using 10 atoms. The compute function performs the largest percentage of the work, because it has nested for loops over all atoms in the simulation. The distance function, which occurs within two nested for loops over the atoms, contributes to about half the work of the compute function. This suggests one way to parallelize, which is to split the calls to distance over different threads. Examination of the compute function shows that the potential and force calculations after distance depend on the result of the distance calls, so they should be included in the same worker function as the distance calculation. This parallelization is included in Appendix 8.2, and table 6.8 shows the call site parallelism: Parallelism Percentage of Work Count distance 1.20118 2.34429 990 compute 1.16186 30.3021 10 distance work 1.14416 12.9843 990 main 1.05319 100 1 main work 1.04937 98.6271 1 compute 1.0108 39.2091 1 update 1 0.631825 10 initialize 1 1.92633 1 calKinetic 1 0.694468 110 Table 6.8.: Call site parallelism and work percentage for the a parallelization over distance calls for the code in 8.2, using 10 atoms, which creates a new thread for each distance calculation. This parallelization has a very high granularity, because is requires Θ(N2 a ), where Na is the number of atoms, threads to be created for each step of the simulation. The parallelization only increases the parallelism of main from 1 to 1.05, indicating that the time cost of creating and joining threads cancels out most of the additional concurrency resulting from creating the threads. A way to parallelize with larger granularity is to group the calculations for each atom together, so for each time step, the number of threads created is equal to the number of atoms. This parallelization is included in Appendix 8.3, and Table 6.9 shows the call site parallelism: 53
  • 64. 6. Results Parallelism Percentage of Work Count compute 44.3736 80.8772 10 main work 8.65348 99.7773 1 main 8.55733 100 1 compute 4.41926 11.8808 1 update 1 0.106188 10 initialize 1 0.335047 1 Table 6.9.: Call site parallelism and work percentage for the parallelization over distance calls included in 8.3, which creates a new thread for the distance and potential calcula- tions associated with each of the 10 atoms. There are 10 threads spawned at each time step to perform distance calculations for the atoms used in the simulation, and the parallelism of main is about 8.6, which indicates some overhead in pthread create(...) and pthread join(...). There are also possibly some load imbalances in the work assigned to each thread, which would mean the potential and kinetic energy calculations are more costly for some atoms. The molecular dynamics simulation illustrates an important difference in the use of Parasite as opposed to a profiler that executes the program in parallel. Synchronization issues such as race conditions do not have to be considered when using Parasite to gain information about a Pthread program. Consider the two alternative molecular dynamics parallelizations in Appen- dices 8.3 and 8.2. These likely have race conditions, because the pointer to an AtomInfo object can be accessed concurrently by two different threads. Using a normal profiler, this race con- dition creates indeterminism that may lead to undefined behavior, so the programmer would have to implement a mutex for each atom to avoid the race condition. Parasite operates se- quentially, so this race condition does not present a problem. With Parasite, the programmer can quickly test a “rough draft” of a possible parallelization for comparison with other paral- lelizations, without implementing synchronization. In this case the programmer may want to compare the parallelizations presented here, or merely see if the parallelism of either of these parallelizations merits the effort required to verify and implement a correct parallelization with proper synchronization primitives. The effort to implement the primitives may seem trivial for this simple molecular dynamics application, but could be significantly higher for applications with more complex synchronization requirements. 6.6. CPP Check CPPcheck is a static code analysis tool that checks style and correctness in C++ files [26]. It is much larger than the other test programs in this thesis, and is frequently used by programmers across the world. CPPcheck comes with a multi-threaded execution that does not use Pthreads, but to test Parasite, it has been parallelized by the following code excerpt, which has been edited to only show lines relevant to the parallelization [26]: struct thread_arg { CppCheck* object; const std::string* file_name; unsigned int return_value; }; void* pthread_worker(void* thread_arg) { struct thread_arg* thrd_arg = (struct thread_arg*) thread_arg; 54
  • 65. 6.6. CPP Check thrd_arg->return_value = thrd_arg->object->check(*(thrd_arg->file_name)); } int CppCheckExecutor::check_internal(CppCheck& cppcheck, int /*argc*/, const char* const argv[]) { // ...... CODE OMITTED FOR BREVITY .... // if (settings.jobs == 1) { // ...... CODE OMITTED FOR BREVITY .... // std::size_t processedsize = 0; pthread_t* thread; thread = new pthread_t[_files.size()]; struct thread_arg* arg; arg = new struct thread_arg[_files.size()]; for (std::map<std::string, std::size_t>::const_iterator i = _files.begin(); i != _files.end(); ++i) { if (!_settings->library.markupFile(i->first) || !_settings->library.processMarkupAfterCode(i->first)) { arg[j].file_name = &i->first; arg[j].return_value = 0; arg[j].object = &cppcheck; pthread_create(&thread[j], NULL, &pthread_worker, &(arg[j])); } } j = 0; for (std::map<std::string, std::size_t>::const_iterator i = _files.begin(); i != _files.end(); ++i) { pthread_join(thread[j], NULL); returnValue += arg[j].return_value; j++; } // ... EXCLUDED CODE ... // This code creates a worker thread for each file that CPP check processes, in an attempt to scale the application through concurrent processing of files. CPPCheck’s original multi-threaded ex- ecution is more complex due to synchronization, but operates using a similar principle - each thread analyzes a different file. However, the parallelization above requires very little effort to produce compared to CPPcheck’s multithreaded execution. If you were a programmer at- tempting to parallelize CPPcheck, the above parallelization could be used with Parasite’s se- quential trace execution to test scalability, before making the effort to produce the synchroniza- tion necessary for a fully-working parallelization. Table 6.10 shows the results for running Parasite on this parallelization of CPPcheck with 2 of the sample C++ files contained in CPPcheck’s Github repository: 55