SlideShare a Scribd company logo
1 of 4
Download to read offline
Experiences Programming for GPUs with OpenCL
Oliver Fishstein, Alejandro Villegas
Abstract
This project looks at the complexities of parallel com-
puting with GPUs using OpenCL. In order to write a
parallel program, there is needed to understand the possi-
ble synchronization issues along with the GPU execution
and memory models, which are discussed in this paper.
These concepts were used to write a Genetic Algorithm
to solve the Knapsack Problem in OpenCL and analyze
the differences of using local memory and global memory
to compute.
Keywords: OpenCL, Parallel Programming, GPGPU
1 Introduction
General purpose computing on GPUs is the utiliza-
tion of the graphics processing unit to perform com-
putations and applications traditionally performed
on the CPU, instead of limiting the GPU’s uses to
the traditional graphics computations. Programming
GPUs takes advantage of the parallel nature of graph-
ics processing to increase the runtime speed [1].
The current dominant general-purpose GPU comput-
ing language is OpenCL. It defines a C-like language,
called kernels, that are executed on compute devices
or accelerators. In our case, we will focus on GPUs as
the accelerator. In order to execute OpenCL kernels,
it is necessary to write a host program in either C
or C++ that launches kernels on the compute device
and manages the device memory, which is usually
seperate from the host memory [2].
1.1 Sychronization Issues in Parallel
Programming
One type of synchronization issue in parallel pro-
gramming is hazards. There are three types of haz-
ards: read after write, write after read, and write
after write. These occur when instructions from dif-
ferent execution threads modify shared data in an
unexpected temporal order [3]. An example of code
with hazards can be seen in code listing 1.
Listing 1: Hazard Example
1// Both threads share this var.
2shared int a[2];
3// Each thread has a private copy of this var.
4private int b;
5// Returns 0 for thread 0 and 1 for thread 1
6private int id = get_id ();
7a[id] = id; // line 1
8b = a[1-id]; // line 2
9a[id] = b; // line 3
There is a read after write hazard in lines 1 and 2
because line 2 in the thread 1 can be executed before
the thread 0 executes the line 1. In addition, there is
another read after write along with a write after read
hazard in lines 2 and 3 because line 3 in the thread 1
can be executed before the thread 0 executes the line
2. In order to remove these hazards, a barrier can
be placed in between lines 1 and 2 along with lines 2
and 3. This indicates that all threads must reach the
barrier, written as barrier(), before proceeding to the
next portion of the program.
The other type of synchronization issue is critcal sec-
tions. Critical sections are lines in code that access a
shared resource (device or data structure) that can-
not be concurrently accessed by more than one thread
at a time [4]. This can lead to results that are not ex-
pected by the programmer due to threads interfering
with one another. An example of code with a critical
section can be seen in code listing 2.
Listing 2: Critical Section Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(list[i] != EMPTY) i++; // line 1
5 list[i] = i; // line 2
6}
There is a critical section in lines 1 and 2. By mak-
ing these lines a critical section, the only possible
outputs are list = 0 1 and list = 1 0 depending on
which thread takes the lead. In order to implement
the critical section, locks can be used. A lock is a
variable that can have the value 1, indicating locked,
or the value 0, indicating unlocked. A thread can
check the state of the lock, and if it is locked, it will
wait until it is unlocked. If it is unlocked, the thread
sets the lock as locked, executes the critical section,
and then sets the lock as unlocked. Thread APIs usu-
ally provide the programmer with functions to define
and use locks, which rely in lower level atomic oper-
ations. Code listing 3 is the code from code listing 2
corrected with a lock.
Listing 3: Lock Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(getLock (lock) == false) {}
5 while(list[i] != EMPTY) i++; // line 1
6 list[i] = 1; // line 2
7 releaseLock();
8}
1.2 GPU Execution Model
The GPU Execution model in OpenCL consists of
workitems, wavefronts, and workgroups. Workitems
are the total amount of threads that the program is
using. These are divided up into workgroups that
are going to be executed in the individual compute
units located in the GPU. The amount of workitems
in each workgroup is a multiply of the size of the
wavefront, which is the amount of workitems that
can run concurrently within a compute unit. The
wavefront size is a fixed size defined by the hardware.
Each workitem in the wavefront executes in lock-step
with the other workitems in the same wavefront. A
workgroup can be made up of multiple wavefronts.
1.3 GPU Memory Model
The GPU memory model consists of global memory
and local memory. The global memory is allocated by
the host program. This memory is visible by all the
workitems running in the GPU. The local memory is
a smaller and faster portion of memory. It is declared
within the kernel, and is visible only by the workitems
belonging to the same workgroup. Generally, doing
computations in local memory is faster than using
global memory.
2 Genetic Algorithm and the Knapsack
Problem
A genetic algorithm is a search heuristic that mimics
the process of natural selection. The knapsack prob-
lem is a particular application of genetic algorithms.
Both will be detailed below.
2.1 Genetic Algorithm
A genetic algorithm uses the biological concept of
natural selection in order to search for the best or
”most fit” solution [5]. A population of potential so-
lutions to an optimization problem is evolved toward
better solutions. The evolution usually starts from a
population of randomly generated values, although
it can also use a predefined population, and goes
through an iterative process that evolves in ”genera-
tions”. After the population of solutions is initialized,
the selection process begins. During each iteration,
values are selected to create a second generation by
comparing two existing solutions and choosing the
”fitter” one to determine the next generation. The
fitness comparison can take place with every value or
only a randomized sample.
The next step of the genetic algorithm is to mutate
the more fit result. The mutation is based on which
problem the algorithm is being applied to solved.
The mutation is then used to replace the less fit
value. All of the values are then shuffled in order
to change which values are compared. This whole
process is repeated for as many iterations or ”gener-
ations” as needed although generally more iterations
will give the better results,especially when working
with a more complex mutation.
2.2 Knapsack Problem
The knapsack problem is a specific type of application
that can be solved by a genetic algorithm. Given a set
of items, each with a corresponding mass, determine
which items to include in the ”knapsack” so that the
total weight is less than or equal to the given limit.
Ideally, the total weight should equal the limits. The
complexity of the problem can be increased by adding
additional criteria like value and dimensions [6].
3 Implementing the Knapsack Problem
in OpenCL
The implementation of the knapsack problem in
OpenCL used here consisted of an input of 128 val-
ues initialized to the values 0 through 127 and used
64 workitems in order to compute the ideal value.
The entire implementation was in a single workgroup
and wavefront. The goal mass was 500. In order
to compute the mass for each input, its binary was
compared against a preexisting array of values (5,
10, 20, 50, 100, 300, 200, 150), and when a bit was
high, the corresponding array value was added to the
total mass for the input. For example, an element
with a binary value of 11010000 would have a mass
of 5+10+50.
For the comparison process, the input corresponding
to the workitem was compared to the input corre-
sponding to the workitem value plus 64, in order to
access all 128 values. In order to determine which
value was most fit, various conditions had to be set.
If both values were less than the goal, then the larger
value was selected. If any value was equal to the goal,
it was selected, and if both were equal to the goal,
then the first value was selected. If both values were
greater than the goal, then the smaller value was se-
lected.
The mutation for this version of the algorithm was to
replace the unfit value with the fit value. They were
then replaced back into the input array, and the val-
ues corresponding with work items were shuffled by
adding 1 and taking the modulus of 128 and replacing
them in the input array. Barriers were used in order
to ensure no hazards occured during the replacement
process.
The entire process is repeated for a set number of
iterations in order to find the ideal result. The min-
imum number of iterations to ensure that the ideal
result was found was 100. The algorithm was run
through five different amounts of iterations in order
to find the ideal result and compare compute times,
which was the decimal value of 96 that corresponded
to a mass of the goal 500.
4 Evaluation
The initial version of the knapsack problem imple-
mentation was with global memory. This involved
constantly passing values back to the global memory,
which is generally a time consuming process. The
algorithm was tested at 100, 1000, 10000, 100000 it-
erations, and every time the result was the ideal 96.
In order to determine the amount of time it took
to compute, timing functions from C++11 were in-
cluded in the host program. It was run five times in
order to get a good grasp of the compute time, and
these values can be seen in table 1.
Tab. 1: Global Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000927s 0.002132s 0.014585s 0.147747s
2 0.000820s 0.002078s 0.014294s 0.140781s
3 0.000701s 0.002178s 0.014957s 0.142362s
4 0.000794s 0.002183s 0.024343s 0.138756s
5 0.000811s 0.002082s 0.016055s 0.159741s
The genetic algorithm was also implemented using
local memory. Theoretically, this implementation
should be significantly faster, but passing every value
from global memory to local memory required an
additional barrier and took a significant amount of
time. In addition, the computation was not com-
plex enough to make up for that time, so the time
to compute in global memory and local memory was
equivalent. The compute time values are in table 2.
Tab. 2: Local Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000803s 0.002324s 0.014757s 0.208451s
2 0.000789s 0.002060s 0.021008s 0.134539s
3 0.000841s 0.002004s 0.015843s 0.135663s
4 0.000816s 0.002197s 0.014649s 0.137022s
5 0.000755s 0.002107s 0.014124s 0.144065s
4.1 Testbed Characteristics
CPU: 2.7 GHz Intel Core i7
GPU: Intel HD Graphics 4000
NVIDIA GeForce GT 650M
Memory: 16 GB 1600 MHz DDR3
Operating System: OS X Yosemite
5 Conclusions
Through working on this project, a solid level of un-
derstanding on programming for GPUs with OpenCL
was developed. Learning about the synchronization
issues in parallel programming allowed for the devel-
opment of programming skills in order to complete
more complex algorithms like the genetic algorithm
as discussed in this. The knowledge of the GPU exe-
cution and memory models allowed for better under-
standing of how OpenCL interacts with the GPU in
order to program more effectively.
References
[1] General-purpose computing on graphics process-
ing units. http://en.wikipedia.org/wiki/General-
purposecomputingongraphicsprocessingunits.
[2] Opencl. http://en.wikipedia.org/wiki/OpenCL.
[3] Hazard (computer architecture).
http://en.wikipedia.org/wiki/Hazard(computerarchitecture).
[4] Critical section. http://en.wikipedia.org/wiki/Criticalsection.
[5] Genetic algorithm. http://en.wikipedia.org/wiki/Geneticalgorithm.
[6] Knapsack problem. http://en.wikipedia.org/wiki/Knapsackproblem.

More Related Content

What's hot

Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Takuo Watanabe
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik TambekarPratik Tambekar
 
Ccr - Concurrency and Coordination Runtime
Ccr - Concurrency and Coordination RuntimeCcr - Concurrency and Coordination Runtime
Ccr - Concurrency and Coordination RuntimeIgor Moochnick
 
A New Modified Version of Caser Cipher Algorithm
A New Modified Version of Caser Cipher AlgorithmA New Modified Version of Caser Cipher Algorithm
A New Modified Version of Caser Cipher AlgorithmIJERD Editor
 
Metrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMetrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMax Kleiner
 
Inter threadcommunication.38
Inter threadcommunication.38Inter threadcommunication.38
Inter threadcommunication.38myrajendra
 
Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingMario Fusco
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsCarol McDonald
 
Chaotic Variations Of Aes Algorithm
Chaotic Variations Of Aes AlgorithmChaotic Variations Of Aes Algorithm
Chaotic Variations Of Aes Algorithmijccmsjournal
 
CHAOTIC VARIATIONS OF AES ALGORITHM
CHAOTIC VARIATIONS OF AES ALGORITHMCHAOTIC VARIATIONS OF AES ALGORITHM
CHAOTIC VARIATIONS OF AES ALGORITHMijccmsjournal
 
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution MethodNovel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution MethodIDES Editor
 
Sync, async and multithreading
Sync, async and multithreadingSync, async and multithreading
Sync, async and multithreadingTuan Chau
 
Csphtp1 14
Csphtp1 14Csphtp1 14
Csphtp1 14HUST
 

What's hot (20)

Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
 
Ccr - Concurrency and Coordination Runtime
Ccr - Concurrency and Coordination RuntimeCcr - Concurrency and Coordination Runtime
Ccr - Concurrency and Coordination Runtime
 
A New Modified Version of Caser Cipher Algorithm
A New Modified Version of Caser Cipher AlgorithmA New Modified Version of Caser Cipher Algorithm
A New Modified Version of Caser Cipher Algorithm
 
Metrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMetrics ekon 14_2_kleiner
Metrics ekon 14_2_kleiner
 
Intake 37 12
Intake 37 12Intake 37 12
Intake 37 12
 
Inter threadcommunication.38
Inter threadcommunication.38Inter threadcommunication.38
Inter threadcommunication.38
 
CS2309 JAVA LAB MANUAL
CS2309 JAVA LAB MANUALCS2309 JAVA LAB MANUAL
CS2309 JAVA LAB MANUAL
 
Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional Programming
 
Matlab file
Matlab fileMatlab file
Matlab file
 
Mutual exclusion and sync
Mutual exclusion and syncMutual exclusion and sync
Mutual exclusion and sync
 
Multithreading in Java
Multithreading in JavaMultithreading in Java
Multithreading in Java
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
 
Java lab 2
Java lab 2Java lab 2
Java lab 2
 
Chaotic Variations Of Aes Algorithm
Chaotic Variations Of Aes AlgorithmChaotic Variations Of Aes Algorithm
Chaotic Variations Of Aes Algorithm
 
CHAOTIC VARIATIONS OF AES ALGORITHM
CHAOTIC VARIATIONS OF AES ALGORITHMCHAOTIC VARIATIONS OF AES ALGORITHM
CHAOTIC VARIATIONS OF AES ALGORITHM
 
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution MethodNovel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
 
Sync, async and multithreading
Sync, async and multithreadingSync, async and multithreading
Sync, async and multithreading
 
Csphtp1 14
Csphtp1 14Csphtp1 14
Csphtp1 14
 

Viewers also liked

Squire Technologies: Signal Transfer Point
Squire Technologies: Signal Transfer PointSquire Technologies: Signal Transfer Point
Squire Technologies: Signal Transfer PointSquire Technologies
 
Squire Technologes: Session Border Controller
Squire Technologes: Session Border Controller Squire Technologes: Session Border Controller
Squire Technologes: Session Border Controller Squire Technologies
 
Squire Technologies: Media Gateway Controller
Squire Technologies: Media Gateway ControllerSquire Technologies: Media Gateway Controller
Squire Technologies: Media Gateway ControllerSquire Technologies
 
Squire Technologies: Signalling Gateway
Squire Technologies: Signalling GatewaySquire Technologies: Signalling Gateway
Squire Technologies: Signalling GatewaySquire Technologies
 
Squire Technologies: Media Gateway
Squire Technologies: Media GatewaySquire Technologies: Media Gateway
Squire Technologies: Media GatewaySquire Technologies
 
Squire Technologies: Short Message Server Gateway
Squire Technologies: Short Message Server GatewaySquire Technologies: Short Message Server Gateway
Squire Technologies: Short Message Server GatewaySquire Technologies
 
Squire Technologies: Media Gateway Controller Function
Squire Technologies: Media Gateway Controller FunctionSquire Technologies: Media Gateway Controller Function
Squire Technologies: Media Gateway Controller FunctionSquire Technologies
 
Squire Technologies: Rolling Out A VoLTE Network
Squire Technologies: Rolling Out A VoLTE NetworkSquire Technologies: Rolling Out A VoLTE Network
Squire Technologies: Rolling Out A VoLTE NetworkSquire Technologies
 
Squire Technologies: Short Message Service Centre
Squire Technologies: Short Message Service CentreSquire Technologies: Short Message Service Centre
Squire Technologies: Short Message Service CentreSquire Technologies
 

Viewers also liked (14)

SCN_0001
SCN_0001SCN_0001
SCN_0001
 
Business solutions
Business solutionsBusiness solutions
Business solutions
 
Tecnología y Sociedad
Tecnología y SociedadTecnología y Sociedad
Tecnología y Sociedad
 
Squire Technologies:SVI 9220
Squire Technologies:SVI 9220Squire Technologies:SVI 9220
Squire Technologies:SVI 9220
 
Squire Technologies: Signal Transfer Point
Squire Technologies: Signal Transfer PointSquire Technologies: Signal Transfer Point
Squire Technologies: Signal Transfer Point
 
Squire Technologes: Session Border Controller
Squire Technologes: Session Border Controller Squire Technologes: Session Border Controller
Squire Technologes: Session Border Controller
 
Squire Technologies: Media Gateway Controller
Squire Technologies: Media Gateway ControllerSquire Technologies: Media Gateway Controller
Squire Technologies: Media Gateway Controller
 
Squire Technologies: Signalling Gateway
Squire Technologies: Signalling GatewaySquire Technologies: Signalling Gateway
Squire Technologies: Signalling Gateway
 
Squire Technologies: Media Gateway
Squire Technologies: Media GatewaySquire Technologies: Media Gateway
Squire Technologies: Media Gateway
 
Squire Technologies: Short Message Server Gateway
Squire Technologies: Short Message Server GatewaySquire Technologies: Short Message Server Gateway
Squire Technologies: Short Message Server Gateway
 
Squire Technologies: Media Gateway Controller Function
Squire Technologies: Media Gateway Controller FunctionSquire Technologies: Media Gateway Controller Function
Squire Technologies: Media Gateway Controller Function
 
Squire Technologies: Rolling Out A VoLTE Network
Squire Technologies: Rolling Out A VoLTE NetworkSquire Technologies: Rolling Out A VoLTE Network
Squire Technologies: Rolling Out A VoLTE Network
 
SMS is alive and well
SMS is alive and wellSMS is alive and well
SMS is alive and well
 
Squire Technologies: Short Message Service Centre
Squire Technologies: Short Message Service CentreSquire Technologies: Short Message Service Centre
Squire Technologies: Short Message Service Centre
 

Similar to genalg

Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesHeman Pathak
 
session-1_Design_Analysis_Algorithm.pptx
session-1_Design_Analysis_Algorithm.pptxsession-1_Design_Analysis_Algorithm.pptx
session-1_Design_Analysis_Algorithm.pptxchandankumar364348
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIlya Kryukov
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmTarikuDabala1
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Debugging and optimization of multi-thread OpenMP-programs
Debugging and optimization of multi-thread OpenMP-programsDebugging and optimization of multi-thread OpenMP-programs
Debugging and optimization of multi-thread OpenMP-programsPVS-Studio
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DRNguyen Tran
 
Android Application Development - Level 3
Android Application Development - Level 3Android Application Development - Level 3
Android Application Development - Level 3Isham Rashik
 
DataMiningReport
DataMiningReportDataMiningReport
DataMiningReport?? ?
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxjohnsmith96441
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
Intro to programing with java-lecture 3
Intro to programing with java-lecture 3Intro to programing with java-lecture 3
Intro to programing with java-lecture 3Mohamed Essam
 

Similar to genalg (20)

Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming Languages
 
session-1_Design_Analysis_Algorithm.pptx
session-1_Design_Analysis_Algorithm.pptxsession-1_Design_Analysis_Algorithm.pptx
session-1_Design_Analysis_Algorithm.pptx
 
DATA STRUCTURE.pdf
DATA STRUCTURE.pdfDATA STRUCTURE.pdf
DATA STRUCTURE.pdf
 
DATA STRUCTURE
DATA STRUCTUREDATA STRUCTURE
DATA STRUCTURE
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver Library
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithm
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Debugging and optimization of multi-thread OpenMP-programs
Debugging and optimization of multi-thread OpenMP-programsDebugging and optimization of multi-thread OpenMP-programs
Debugging and optimization of multi-thread OpenMP-programs
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 
Android Application Development - Level 3
Android Application Development - Level 3Android Application Development - Level 3
Android Application Development - Level 3
 
DataMiningReport
DataMiningReportDataMiningReport
DataMiningReport
 
Daa chapter 1
Daa chapter 1Daa chapter 1
Daa chapter 1
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
Intro to programing with java-lecture 3
Intro to programing with java-lecture 3Intro to programing with java-lecture 3
Intro to programing with java-lecture 3
 

genalg

  • 1. Experiences Programming for GPUs with OpenCL Oliver Fishstein, Alejandro Villegas Abstract This project looks at the complexities of parallel com- puting with GPUs using OpenCL. In order to write a parallel program, there is needed to understand the possi- ble synchronization issues along with the GPU execution and memory models, which are discussed in this paper. These concepts were used to write a Genetic Algorithm to solve the Knapsack Problem in OpenCL and analyze the differences of using local memory and global memory to compute. Keywords: OpenCL, Parallel Programming, GPGPU 1 Introduction General purpose computing on GPUs is the utiliza- tion of the graphics processing unit to perform com- putations and applications traditionally performed on the CPU, instead of limiting the GPU’s uses to the traditional graphics computations. Programming GPUs takes advantage of the parallel nature of graph- ics processing to increase the runtime speed [1]. The current dominant general-purpose GPU comput- ing language is OpenCL. It defines a C-like language, called kernels, that are executed on compute devices or accelerators. In our case, we will focus on GPUs as the accelerator. In order to execute OpenCL kernels, it is necessary to write a host program in either C or C++ that launches kernels on the compute device and manages the device memory, which is usually seperate from the host memory [2]. 1.1 Sychronization Issues in Parallel Programming One type of synchronization issue in parallel pro- gramming is hazards. There are three types of haz- ards: read after write, write after read, and write after write. These occur when instructions from dif- ferent execution threads modify shared data in an unexpected temporal order [3]. An example of code with hazards can be seen in code listing 1. Listing 1: Hazard Example 1// Both threads share this var. 2shared int a[2]; 3// Each thread has a private copy of this var. 4private int b; 5// Returns 0 for thread 0 and 1 for thread 1 6private int id = get_id (); 7a[id] = id; // line 1 8b = a[1-id]; // line 2 9a[id] = b; // line 3 There is a read after write hazard in lines 1 and 2 because line 2 in the thread 1 can be executed before the thread 0 executes the line 1. In addition, there is another read after write along with a write after read hazard in lines 2 and 3 because line 3 in the thread 1 can be executed before the thread 0 executes the line 2. In order to remove these hazards, a barrier can be placed in between lines 1 and 2 along with lines 2 and 3. This indicates that all threads must reach the barrier, written as barrier(), before proceeding to the next portion of the program. The other type of synchronization issue is critcal sec- tions. Critical sections are lines in code that access a shared resource (device or data structure) that can- not be concurrently accessed by more than one thread at a time [4]. This can lead to results that are not ex- pected by the programmer due to threads interfering with one another. An example of code with a critical section can be seen in code listing 2.
  • 2. Listing 2: Critical Section Example 1# define EMPTY -1 2void insert (int * list , int val) { 3 int i = 0; 4 while(list[i] != EMPTY) i++; // line 1 5 list[i] = i; // line 2 6} There is a critical section in lines 1 and 2. By mak- ing these lines a critical section, the only possible outputs are list = 0 1 and list = 1 0 depending on which thread takes the lead. In order to implement the critical section, locks can be used. A lock is a variable that can have the value 1, indicating locked, or the value 0, indicating unlocked. A thread can check the state of the lock, and if it is locked, it will wait until it is unlocked. If it is unlocked, the thread sets the lock as locked, executes the critical section, and then sets the lock as unlocked. Thread APIs usu- ally provide the programmer with functions to define and use locks, which rely in lower level atomic oper- ations. Code listing 3 is the code from code listing 2 corrected with a lock. Listing 3: Lock Example 1# define EMPTY -1 2void insert (int * list , int val) { 3 int i = 0; 4 while(getLock (lock) == false) {} 5 while(list[i] != EMPTY) i++; // line 1 6 list[i] = 1; // line 2 7 releaseLock(); 8} 1.2 GPU Execution Model The GPU Execution model in OpenCL consists of workitems, wavefronts, and workgroups. Workitems are the total amount of threads that the program is using. These are divided up into workgroups that are going to be executed in the individual compute units located in the GPU. The amount of workitems in each workgroup is a multiply of the size of the wavefront, which is the amount of workitems that can run concurrently within a compute unit. The wavefront size is a fixed size defined by the hardware. Each workitem in the wavefront executes in lock-step with the other workitems in the same wavefront. A workgroup can be made up of multiple wavefronts. 1.3 GPU Memory Model The GPU memory model consists of global memory and local memory. The global memory is allocated by the host program. This memory is visible by all the workitems running in the GPU. The local memory is a smaller and faster portion of memory. It is declared within the kernel, and is visible only by the workitems belonging to the same workgroup. Generally, doing computations in local memory is faster than using global memory. 2 Genetic Algorithm and the Knapsack Problem A genetic algorithm is a search heuristic that mimics the process of natural selection. The knapsack prob- lem is a particular application of genetic algorithms. Both will be detailed below. 2.1 Genetic Algorithm A genetic algorithm uses the biological concept of natural selection in order to search for the best or ”most fit” solution [5]. A population of potential so- lutions to an optimization problem is evolved toward better solutions. The evolution usually starts from a population of randomly generated values, although it can also use a predefined population, and goes through an iterative process that evolves in ”genera- tions”. After the population of solutions is initialized, the selection process begins. During each iteration, values are selected to create a second generation by comparing two existing solutions and choosing the ”fitter” one to determine the next generation. The fitness comparison can take place with every value or only a randomized sample. The next step of the genetic algorithm is to mutate the more fit result. The mutation is based on which problem the algorithm is being applied to solved. The mutation is then used to replace the less fit value. All of the values are then shuffled in order to change which values are compared. This whole process is repeated for as many iterations or ”gener- ations” as needed although generally more iterations
  • 3. will give the better results,especially when working with a more complex mutation. 2.2 Knapsack Problem The knapsack problem is a specific type of application that can be solved by a genetic algorithm. Given a set of items, each with a corresponding mass, determine which items to include in the ”knapsack” so that the total weight is less than or equal to the given limit. Ideally, the total weight should equal the limits. The complexity of the problem can be increased by adding additional criteria like value and dimensions [6]. 3 Implementing the Knapsack Problem in OpenCL The implementation of the knapsack problem in OpenCL used here consisted of an input of 128 val- ues initialized to the values 0 through 127 and used 64 workitems in order to compute the ideal value. The entire implementation was in a single workgroup and wavefront. The goal mass was 500. In order to compute the mass for each input, its binary was compared against a preexisting array of values (5, 10, 20, 50, 100, 300, 200, 150), and when a bit was high, the corresponding array value was added to the total mass for the input. For example, an element with a binary value of 11010000 would have a mass of 5+10+50. For the comparison process, the input corresponding to the workitem was compared to the input corre- sponding to the workitem value plus 64, in order to access all 128 values. In order to determine which value was most fit, various conditions had to be set. If both values were less than the goal, then the larger value was selected. If any value was equal to the goal, it was selected, and if both were equal to the goal, then the first value was selected. If both values were greater than the goal, then the smaller value was se- lected. The mutation for this version of the algorithm was to replace the unfit value with the fit value. They were then replaced back into the input array, and the val- ues corresponding with work items were shuffled by adding 1 and taking the modulus of 128 and replacing them in the input array. Barriers were used in order to ensure no hazards occured during the replacement process. The entire process is repeated for a set number of iterations in order to find the ideal result. The min- imum number of iterations to ensure that the ideal result was found was 100. The algorithm was run through five different amounts of iterations in order to find the ideal result and compare compute times, which was the decimal value of 96 that corresponded to a mass of the goal 500. 4 Evaluation The initial version of the knapsack problem imple- mentation was with global memory. This involved constantly passing values back to the global memory, which is generally a time consuming process. The algorithm was tested at 100, 1000, 10000, 100000 it- erations, and every time the result was the ideal 96. In order to determine the amount of time it took to compute, timing functions from C++11 were in- cluded in the host program. It was run five times in order to get a good grasp of the compute time, and these values can be seen in table 1. Tab. 1: Global Memory Results 100 Iters 1000 Iters 10000 Iters 100000 Iters 1 0.000927s 0.002132s 0.014585s 0.147747s 2 0.000820s 0.002078s 0.014294s 0.140781s 3 0.000701s 0.002178s 0.014957s 0.142362s 4 0.000794s 0.002183s 0.024343s 0.138756s 5 0.000811s 0.002082s 0.016055s 0.159741s The genetic algorithm was also implemented using local memory. Theoretically, this implementation should be significantly faster, but passing every value from global memory to local memory required an additional barrier and took a significant amount of time. In addition, the computation was not com- plex enough to make up for that time, so the time to compute in global memory and local memory was equivalent. The compute time values are in table 2.
  • 4. Tab. 2: Local Memory Results 100 Iters 1000 Iters 10000 Iters 100000 Iters 1 0.000803s 0.002324s 0.014757s 0.208451s 2 0.000789s 0.002060s 0.021008s 0.134539s 3 0.000841s 0.002004s 0.015843s 0.135663s 4 0.000816s 0.002197s 0.014649s 0.137022s 5 0.000755s 0.002107s 0.014124s 0.144065s 4.1 Testbed Characteristics CPU: 2.7 GHz Intel Core i7 GPU: Intel HD Graphics 4000 NVIDIA GeForce GT 650M Memory: 16 GB 1600 MHz DDR3 Operating System: OS X Yosemite 5 Conclusions Through working on this project, a solid level of un- derstanding on programming for GPUs with OpenCL was developed. Learning about the synchronization issues in parallel programming allowed for the devel- opment of programming skills in order to complete more complex algorithms like the genetic algorithm as discussed in this. The knowledge of the GPU exe- cution and memory models allowed for better under- standing of how OpenCL interacts with the GPU in order to program more effectively. References [1] General-purpose computing on graphics process- ing units. http://en.wikipedia.org/wiki/General- purposecomputingongraphicsprocessingunits. [2] Opencl. http://en.wikipedia.org/wiki/OpenCL. [3] Hazard (computer architecture). http://en.wikipedia.org/wiki/Hazard(computerarchitecture). [4] Critical section. http://en.wikipedia.org/wiki/Criticalsection. [5] Genetic algorithm. http://en.wikipedia.org/wiki/Geneticalgorithm. [6] Knapsack problem. http://en.wikipedia.org/wiki/Knapsackproblem.