1. Experiences Programming for GPUs with OpenCL
Oliver Fishstein, Alejandro Villegas
Abstract
This project looks at the complexities of parallel com-
puting with GPUs using OpenCL. In order to write a
parallel program, there is needed to understand the possi-
ble synchronization issues along with the GPU execution
and memory models, which are discussed in this paper.
These concepts were used to write a Genetic Algorithm
to solve the Knapsack Problem in OpenCL and analyze
the differences of using local memory and global memory
to compute.
Keywords: OpenCL, Parallel Programming, GPGPU
1 Introduction
General purpose computing on GPUs is the utiliza-
tion of the graphics processing unit to perform com-
putations and applications traditionally performed
on the CPU, instead of limiting the GPU’s uses to
the traditional graphics computations. Programming
GPUs takes advantage of the parallel nature of graph-
ics processing to increase the runtime speed [1].
The current dominant general-purpose GPU comput-
ing language is OpenCL. It defines a C-like language,
called kernels, that are executed on compute devices
or accelerators. In our case, we will focus on GPUs as
the accelerator. In order to execute OpenCL kernels,
it is necessary to write a host program in either C
or C++ that launches kernels on the compute device
and manages the device memory, which is usually
seperate from the host memory [2].
1.1 Sychronization Issues in Parallel
Programming
One type of synchronization issue in parallel pro-
gramming is hazards. There are three types of haz-
ards: read after write, write after read, and write
after write. These occur when instructions from dif-
ferent execution threads modify shared data in an
unexpected temporal order [3]. An example of code
with hazards can be seen in code listing 1.
Listing 1: Hazard Example
1// Both threads share this var.
2shared int a[2];
3// Each thread has a private copy of this var.
4private int b;
5// Returns 0 for thread 0 and 1 for thread 1
6private int id = get_id ();
7a[id] = id; // line 1
8b = a[1-id]; // line 2
9a[id] = b; // line 3
There is a read after write hazard in lines 1 and 2
because line 2 in the thread 1 can be executed before
the thread 0 executes the line 1. In addition, there is
another read after write along with a write after read
hazard in lines 2 and 3 because line 3 in the thread 1
can be executed before the thread 0 executes the line
2. In order to remove these hazards, a barrier can
be placed in between lines 1 and 2 along with lines 2
and 3. This indicates that all threads must reach the
barrier, written as barrier(), before proceeding to the
next portion of the program.
The other type of synchronization issue is critcal sec-
tions. Critical sections are lines in code that access a
shared resource (device or data structure) that can-
not be concurrently accessed by more than one thread
at a time [4]. This can lead to results that are not ex-
pected by the programmer due to threads interfering
with one another. An example of code with a critical
section can be seen in code listing 2.
2. Listing 2: Critical Section Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(list[i] != EMPTY) i++; // line 1
5 list[i] = i; // line 2
6}
There is a critical section in lines 1 and 2. By mak-
ing these lines a critical section, the only possible
outputs are list = 0 1 and list = 1 0 depending on
which thread takes the lead. In order to implement
the critical section, locks can be used. A lock is a
variable that can have the value 1, indicating locked,
or the value 0, indicating unlocked. A thread can
check the state of the lock, and if it is locked, it will
wait until it is unlocked. If it is unlocked, the thread
sets the lock as locked, executes the critical section,
and then sets the lock as unlocked. Thread APIs usu-
ally provide the programmer with functions to define
and use locks, which rely in lower level atomic oper-
ations. Code listing 3 is the code from code listing 2
corrected with a lock.
Listing 3: Lock Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(getLock (lock) == false) {}
5 while(list[i] != EMPTY) i++; // line 1
6 list[i] = 1; // line 2
7 releaseLock();
8}
1.2 GPU Execution Model
The GPU Execution model in OpenCL consists of
workitems, wavefronts, and workgroups. Workitems
are the total amount of threads that the program is
using. These are divided up into workgroups that
are going to be executed in the individual compute
units located in the GPU. The amount of workitems
in each workgroup is a multiply of the size of the
wavefront, which is the amount of workitems that
can run concurrently within a compute unit. The
wavefront size is a fixed size defined by the hardware.
Each workitem in the wavefront executes in lock-step
with the other workitems in the same wavefront. A
workgroup can be made up of multiple wavefronts.
1.3 GPU Memory Model
The GPU memory model consists of global memory
and local memory. The global memory is allocated by
the host program. This memory is visible by all the
workitems running in the GPU. The local memory is
a smaller and faster portion of memory. It is declared
within the kernel, and is visible only by the workitems
belonging to the same workgroup. Generally, doing
computations in local memory is faster than using
global memory.
2 Genetic Algorithm and the Knapsack
Problem
A genetic algorithm is a search heuristic that mimics
the process of natural selection. The knapsack prob-
lem is a particular application of genetic algorithms.
Both will be detailed below.
2.1 Genetic Algorithm
A genetic algorithm uses the biological concept of
natural selection in order to search for the best or
”most fit” solution [5]. A population of potential so-
lutions to an optimization problem is evolved toward
better solutions. The evolution usually starts from a
population of randomly generated values, although
it can also use a predefined population, and goes
through an iterative process that evolves in ”genera-
tions”. After the population of solutions is initialized,
the selection process begins. During each iteration,
values are selected to create a second generation by
comparing two existing solutions and choosing the
”fitter” one to determine the next generation. The
fitness comparison can take place with every value or
only a randomized sample.
The next step of the genetic algorithm is to mutate
the more fit result. The mutation is based on which
problem the algorithm is being applied to solved.
The mutation is then used to replace the less fit
value. All of the values are then shuffled in order
to change which values are compared. This whole
process is repeated for as many iterations or ”gener-
ations” as needed although generally more iterations
3. will give the better results,especially when working
with a more complex mutation.
2.2 Knapsack Problem
The knapsack problem is a specific type of application
that can be solved by a genetic algorithm. Given a set
of items, each with a corresponding mass, determine
which items to include in the ”knapsack” so that the
total weight is less than or equal to the given limit.
Ideally, the total weight should equal the limits. The
complexity of the problem can be increased by adding
additional criteria like value and dimensions [6].
3 Implementing the Knapsack Problem
in OpenCL
The implementation of the knapsack problem in
OpenCL used here consisted of an input of 128 val-
ues initialized to the values 0 through 127 and used
64 workitems in order to compute the ideal value.
The entire implementation was in a single workgroup
and wavefront. The goal mass was 500. In order
to compute the mass for each input, its binary was
compared against a preexisting array of values (5,
10, 20, 50, 100, 300, 200, 150), and when a bit was
high, the corresponding array value was added to the
total mass for the input. For example, an element
with a binary value of 11010000 would have a mass
of 5+10+50.
For the comparison process, the input corresponding
to the workitem was compared to the input corre-
sponding to the workitem value plus 64, in order to
access all 128 values. In order to determine which
value was most fit, various conditions had to be set.
If both values were less than the goal, then the larger
value was selected. If any value was equal to the goal,
it was selected, and if both were equal to the goal,
then the first value was selected. If both values were
greater than the goal, then the smaller value was se-
lected.
The mutation for this version of the algorithm was to
replace the unfit value with the fit value. They were
then replaced back into the input array, and the val-
ues corresponding with work items were shuffled by
adding 1 and taking the modulus of 128 and replacing
them in the input array. Barriers were used in order
to ensure no hazards occured during the replacement
process.
The entire process is repeated for a set number of
iterations in order to find the ideal result. The min-
imum number of iterations to ensure that the ideal
result was found was 100. The algorithm was run
through five different amounts of iterations in order
to find the ideal result and compare compute times,
which was the decimal value of 96 that corresponded
to a mass of the goal 500.
4 Evaluation
The initial version of the knapsack problem imple-
mentation was with global memory. This involved
constantly passing values back to the global memory,
which is generally a time consuming process. The
algorithm was tested at 100, 1000, 10000, 100000 it-
erations, and every time the result was the ideal 96.
In order to determine the amount of time it took
to compute, timing functions from C++11 were in-
cluded in the host program. It was run five times in
order to get a good grasp of the compute time, and
these values can be seen in table 1.
Tab. 1: Global Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000927s 0.002132s 0.014585s 0.147747s
2 0.000820s 0.002078s 0.014294s 0.140781s
3 0.000701s 0.002178s 0.014957s 0.142362s
4 0.000794s 0.002183s 0.024343s 0.138756s
5 0.000811s 0.002082s 0.016055s 0.159741s
The genetic algorithm was also implemented using
local memory. Theoretically, this implementation
should be significantly faster, but passing every value
from global memory to local memory required an
additional barrier and took a significant amount of
time. In addition, the computation was not com-
plex enough to make up for that time, so the time
to compute in global memory and local memory was
equivalent. The compute time values are in table 2.
4. Tab. 2: Local Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000803s 0.002324s 0.014757s 0.208451s
2 0.000789s 0.002060s 0.021008s 0.134539s
3 0.000841s 0.002004s 0.015843s 0.135663s
4 0.000816s 0.002197s 0.014649s 0.137022s
5 0.000755s 0.002107s 0.014124s 0.144065s
4.1 Testbed Characteristics
CPU: 2.7 GHz Intel Core i7
GPU: Intel HD Graphics 4000
NVIDIA GeForce GT 650M
Memory: 16 GB 1600 MHz DDR3
Operating System: OS X Yosemite
5 Conclusions
Through working on this project, a solid level of un-
derstanding on programming for GPUs with OpenCL
was developed. Learning about the synchronization
issues in parallel programming allowed for the devel-
opment of programming skills in order to complete
more complex algorithms like the genetic algorithm
as discussed in this. The knowledge of the GPU exe-
cution and memory models allowed for better under-
standing of how OpenCL interacts with the GPU in
order to program more effectively.
References
[1] General-purpose computing on graphics process-
ing units. http://en.wikipedia.org/wiki/General-
purposecomputingongraphicsprocessingunits.
[2] Opencl. http://en.wikipedia.org/wiki/OpenCL.
[3] Hazard (computer architecture).
http://en.wikipedia.org/wiki/Hazard(computerarchitecture).
[4] Critical section. http://en.wikipedia.org/wiki/Criticalsection.
[5] Genetic algorithm. http://en.wikipedia.org/wiki/Geneticalgorithm.
[6] Knapsack problem. http://en.wikipedia.org/wiki/Knapsackproblem.