Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives

Orthogonal Matching Pursuit in 2D for Java with
GPGPU Prospectives
Matt Simons
6th May 2014
Abstract
With the advent of new processing technologies which can bring about sig-
nificant performance and quality gains to image processing, now is a good time
to make contributions to that field. The Orthogonal Matching Pursuit algo-
rithm dedicated to two dimensions (OMP2D), is an effective approach for image
representation outside the traditional transformation framework. The project
has focussed on creating a Java implementation of OMP2D, because no current
implementation existed. This report outlines the algorithm, details how it has
been implemented in Java and discusses the performance gains which Graphics
Processor Unit (GPU) accelerated processing can yield in calculating it.
A fully implemented OMP2D ImageJ plugin has been produced with good
performance results, especially considering its inability to benefit from optimised
linear algebra libraries. The developed open source software and documentation
has been made publicly available through a dedicated website. Methods of fully
exploiting GPUs are discussed, proposing how further performance improve-
ments could be made through mass parallelisation techniques.
1

Preface
This project follows on from the research that I undertook while writing the Mathe-
matics Report (MR) entitled: Image processing with Java [1]. In that report I have
addressed issues such as: Why Java should be used for image processing, the advan-
tages of developing using the ImageJ libraries and platform and the current challenges
for achieving GPU acceleration in image processing techniques on a personal computer.
For the reader’s convenience I have transcribed in this section the statement of con-
tributions of my MR as well as my final reflections.
Statement of contributions of the MR
The preceding report:
• Provides an introductory tutorial for image processing with Java.
• Reviews current practices for storing images in lossless and lossy formats and
which is the most appropriate for a given application.
• Acquaints the reader who has an assumed basic knowledge of programming or
scripting, in any language such as MATLAB or C, to object-oriented program-
ming using Java.
• Describes the advantages of object-oriented programming in terms of image pro-
cessing methods.
• Demonstrates examples of image processing methods using ImageJ plugins.
• Informs the reader of emerging technologies in the field of parallelisation.
• Makes critical arguments on how best to implement GPU acceleration in Java.
• Theorises the extent of which emerging technologies such as future Java releases
and NVIDIA Grid could impact image processing methods.
2

Reflections of the MR
Its final reflections were:
• The object-oriented nature of Java enables it to excel at point-wise operations
on pixels or groups of pixels.
• There were many image file formats before the standardising effects of the inter-
net. This process brought us a handful of file formats still used today; each of
which is optimised for a specific purpose by experts.
• Many image processing techniques can be accelerated by using parallelisation
methods
• Parallel processing when used in appropriate circumstances can provide a sub-
stantial mark-up in speed.
• There is significant interest in exploiting GPU architectures to further the func-
tionality that parallelisation can offer.
• Both CUDA and OpenCL are equal standards achieving similar performances.
• When exploiting GPUs considerations on bottlenecks such as memory transfer
speeds must be taken into account to contrast the advantages of mass paralleli-
sation. In the event of small data sets the transfer time can outweigh the time
saved by processing a method in parallel.
• The JCuda library is the most revered for producing GPGPU code with less
development time.
• Work which focuses on the inclusion of JCuda optimisations for ImageJ functions
is under continuous development.
• Java vendors are positioning themselves to utilise the newly available GPU APIs
in newer versions.
• Distributed or cluster computing could also be used to effect parallelisation meth-
ods.
• Using NVIDIA grid computations where clusters of GPUs can be accessed and
virtualised may yield faster results.
3

Contents
1 Introduction 5
2 Sparse Image Representation 6
2.1 Orthogonal Matching Pursuit in 2D . . . . . . . . . . . . . . . . . . . . 7
2.2 Quantiﬁcation of sparsity and approximation quality . . . . . . . . . . 9
2.3 Computational complexity and storage . . . . . . . . . . . . . . . . . . 9
2.4 Blockwise approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Constructing the dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 10
3 OMP2D in Java 11
3.1 Expanding Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 ImageJ Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Performance Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 BLAS Library acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Comparisons with other implementations . . . . . . . . . . . . . . . . . 16
4 GPU Acceleration 21
4.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 The PRAM model of parallel computation . . . . . . . . . . . . . . . . 23
4.3 Kernel Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 GPU Programming Patterns 27
5.1 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Reduce & Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Parallel Programming Patterns in Image Processing 31
7 Applying GPU Acceleration to OMP2D 32
7.1 Block-wise Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Matrix Operations Parallelisation . . . . . . . . . . . . . . . . . . . . . 33
8 Conclusions 34
8.1 Reﬂections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9 Appendices 37
A Installing the OMP2D plugin in ImageJ 37
4

1 Introduction
Image processing is still a relatively young branch of applied mathematics which is
continually evolving. An especially notable application, where research has been very
proactive is image compression. The first step of a typical compression scheme involves
a mathematical transformation which maps the image onto a transformed domain,
where the representation is of a smaller dimension. When a significant reduction in the
dimensionality is achieved, the image is said to be sparse in the transformed domain.
The popular compression standard JPEG implements the transformation using the
Discrete Cosine Transform, whilest the updated version, JPEG2000, uses the Discrete
Wavelet Transform.
In the last fifteen years however, research has demonstrated that significant improve-
ments in the sparsity of an image’s representation can be obtained if traditional trans-
formations are replaced by approaches called pursuit strategies [2]. Within this frame-
work, a given image is represented as superposition of elements which are chosen from
a redundant set which is of much larger dimension than the image to be approximated.
The seminal approach in this direction is a simple algorithm called Matching Pursuit
(MP) [3]. It has evolved into a number of refinements of higher complexity though.
One such refinement is the Orthogonal Matching Pursuit (OMP) approach [4]. It has
recently been shown that a dedicated implementation of this approach, in two dimen-
sions (OMP2D), is very effective in terms of speed and performance for sparse image
representation [5,6].
Applications of sparse representations go beyond image compression to benefit areas
such as image denoising and feature extraction [7]. Since to the best of my knowledge
there was no implementation of OMP2D in Java, I decided to produce one and make it
publicly available to the scientific and academic community. As it is a Java application
my implementation is platform independent (write and compile once, and it will run on
any operating system) making it widely accessible, I also developed a companion plugin
to facilitate its use in the free and user friendly image processing environment ImageJ.
As previously mentioned in my Mathematical Report (MR) [1], ImageJ was created
by US National Institute of Health for medical applications of image processing, and
is widely used in academia for other image processing fields.
My implementation was created with the anticipation of some later accelerating ca-
pabilities being added, with that in mind it was made to be easily extendible. As
such I set out methods for optimising my program’s performance by exploiting the
massive parallel computing capabilities of GPUs through use of General Processing on
Graphics Processing Units (GPGPU) programming technologies.
Statement of Contributions of the Mathematics Project
The problems I have addressed in this project are motivated and supported by the
content of the previous MR.
The contributions of the project include:
5

• A Platform independent implementation of the Orthogonal Matching Pursuit
approach in Java.
• An ImageJ plugin which applies the implementation.
• An Overview of two methods of applying GPGPU acceleration for OMP2D.
• A CleverPointer class capable of automated memory management compatible
with Java’s GC.
• A comparison of the performance of my Java implementing with existing imple-
mentations of the algorithm in MATLAB and C++ Mex.
• Construction of a website for the free distribution of the software under an open
source license http://simonsm1.github.io/OMP2D/.
2 Sparse Image Representation
Let us start the mathematical setting of the subject matter by introducing the adopted
notational convention:
R and N represent the sets of real and natural numbers, respectively. Boldface letters
are used to indicate Euclidean vectors or matrices, whilst standard mathematical fonts
indicate components, e.g. d ∈ RN
is a vector of components d(i), i = 1, . . . , N and
I ∈ RNx×Ny
is a matrix of elements I(i, j), i = 1, . . . , Nx, j = 1, . . . , Ny. The notation
B, I F indicates the Frobenius inner product of B and I as deﬁned by
B, I F =
Nx,Ny
i=1
j=1
B(i, j)I(i, j).
The corresponding induced norm is indicated as · 2
F = ·, · F.
Given an image, which is expressed as an array I ∈ RNx×Ny
of intensity value pixels
(e.g. 0-255), we can approximate it by the linear decomposition
IK
=
K
k=1
ckd k
, (1)
where each ck is a scalar and each d k
is an element of RNx×Ny
which is selected from
a set, D = {dn}M
n=1, called a ‘dictionary’.
A sparse approximation of I ∈ RNx×Ny
is an approximation of the form (1) such that
the number of elements K in the decomposition is signiﬁcantly smaller than those in
I. The terms in the decomposition (1) are chosen from a large redundant dictionary.
The chosen elements d k
in (1), are called ‘atoms’ and are selected according to an
optimality criterion.
6

The most sparse decomposition of an image, within the redundant dictionary frame-
work for approximation, can be found by approximating the image by the ‘atomic
decomposition’ (1) such that the number K of atoms is minimum.
The minimizing process of the elements is however restricted to (1) which creates an
exhaustive search combinatorial problem and is therefore intractable. Hence, instead of
looking for the sparsest solution we look for a ‘satisfactory solution’, i.e. a solution such
that the number of K-terms in (1) is considerably smaller than the image dimension.
This can be effectively achieved by the greedy technique called Orthogonal Matching
Pursuit (OMP) This approach selects the atoms in the decomposition (1) in a stepwise
manner, as will be described in the next section.
2.1 Orthogonal Matching Pursuit in 2D
OMP was first introduced in [4]. The implementation here I will be using is dedi-
cated to 2D (OMP2D). This version of the algorithm uses separable dictionaries, i.e,
a 2D dictionary which corresponds to the tensor product of two 1D dictionaries. The
implementation is based on adaptive biorthogonalization and Gram-Schmidt orthog-
onalization procedures, as proposed in [8] for one dimensional case and generalized to
separable 2D dictionaries in [5,6].
The images which I will be considering are assumed to be grey-scale images. Given
an image I ∈ RNx×Ny
and two 1D dictionaries Dx
= {dx
n ∈ RNx
}Mx
n=1 and Dy
= {dy
m ∈
RNy
}
My
m=1 we approximate the array I ∈ RNx×Ny
by the decomposition
IK
(i, j) =
K
n=1
cndx
x
n
(i)dy
y
n
(j). (2)
For selecting the atoms dx
x
n
, dy
y
n
, i = 1, . . . , K the OMP selection criterion evolves as
follows:
When setting R0
= I for iteration k + 1 the algorithm selects the atoms dx
x
k+1
∈
Dx
and dy
y
k+1
∈ Dy
which maximize the Frobenius inner product’s absolute value
| dx
n, Rdy
m F|, n = 1, . . . , Mx, m = 1, . . . , My, i.e.
x
k+1, y
k+1 = arg max
n=1,...,Mx
m=1,...,My
Nx,Ny
i=1
j=1
|dx
n(i)Rk
(i, j)dy
m(j)|,
where
Rk
(i, j) = I(i, j) −
k
n=1
cndx
x
n
(i)dy
y
n
(j).
(3)
The coefficients cn, n = 1, . . . , k are calculated such that Rk
F is minimised. This is
achieved by calculating these coefficients as
7

cn = Bk
n, I F, n = 1, . . . , k, (4)
where the matrices Bk
n, n = 1, . . . , k, are recursively constructed at each iteration in
such a way that Rk
F is minimised. This is ensured by requesting that Rk
= I− ˆPVk
I,
where ˆPVk
is the orthogonal projection operator onto Vk = span{dx
x
n
⊗ dy
y
n
}k
n=1. The
required representation of ˆPVk
is of the form ˆPVk
I = k
n=1 An Bk
n, I F , where each
An ∈ RNx×Ny
is an array with the selected atoms An = dx
x
n
⊗ dy
y
n
and Bk
n, n =
1, . . . , k the concomitant reciprocal matrices. These are the unique elements of RNx×Ny
satisfying the conditions:
i) An, Bk
m F = δn,m =
1 if n = m
0 if n = m.
ii) Vk = span{Bk
n}k
n=1.
Such matrices can be adaptively constructed through the recursion formula [8]:
Bk+1
n = Bk
n − Bk+1
k+1 Ak+1, Bk
n F, n = 1, . . . , k,
where
Bk+1
k+1 = Wk+1/ Wk+1
2
F, with W1 = A1 and Wk+1 = Ak+1 −
k
n=1
Wn
Wn
2
F
Wn, Ak+1 F.
(5)
For numerical accuracy of the orthogonality of matrices Wn, n = 1, . . . , k + 1 at least
one re-orthogonalization step has to be implemented. It implies to recalculate the
matrices as
Wk+1 = Wk+1 −
k
n=1
Wn
Wn
2
F
. Wn, Wk+1 F. (6)
With matrices Bk
n, n = 1, . . . , k constructed as above the required coefficients are cal-
culated as in (4).
Remark: It is appropriate to discuss at this point that the original Matching Pursuit
(MP) approach [3], from where OMP has evolved, does not calculate the coefficients
as in (4). Instead the coefficients are calculated simply as
cn = dx
x
n
, Rk
dy
y
n
F n = 1, . . . , k. (7)
Although the implementation of MP is much simpler, coefficients (7) do not minimize
norm of the residual error at each iteration step. In this sense, MP is not a step wise
optimal approach.
8

Both MP and OMP2D algorithms iterate up until a Kth step, at which the stopping
criterion ||I − IK
||2
F < ρ, for a given ρ, is met.
2.2 Quantification of sparsity and approximation quality
To quantify the quality of an image approximation the most commonly used measures
are: The classical Peak Signal-to-Noise Ratio (PSNR) and the Structural SIMilarity
(SSIM) index [9]. The PSNR involves a simple calculation, as it is defined by:
PSNR = 10 log10
(2lb − 1)
2
MSE
, (8)
where lb is the number of bits used to represent the intensity of the pixels and
MSE =
I − IK 2
F
NxNy
.
The SSIM index [9] is a method for measuring the similarity between two images, for
identical images SSIM=1. The software for implementing this measure is available at
https://ece.uwaterloo.ca/~z70wang/research/ssim/.
Since I restrict considerations to high quality approximations, I aim to achieve a very
high PSNR and a SSIM very close to one.
To measure the sparsity I will use the Sparsity Ratio (SR) which is defined as
SR =
Number intensity points of the image
Number of Coefficients in the approximation
.
2.3 Computational complexity and storage
It is important to stress that the main advantage of implementing Matching Pursuit
in 2D with separable dictionaries, is the significant savings in storage. Although a
2D array can be stored as a long 1D array of identical dimensionality, to store it in
that way a 2D dictionary would have M = MxMy arrays of dimension NxNy. For a
separable dictionary, the number of arrays reduces to Mx + My and the dimensions
reduce to Nx and Ny, respectively.
For simplicity let us consider Mx = My = M and Nx = Ny = N. It is clear then that
the storage of separable dictionaries is linear O(MN) while in the non separable case
is O(M2
N2
).
The separability of the dictionary has also significant implication in the selection step
(3). The complexity in evaluating the inner products dx
n, Rdy
m F, n = 1, . . . , M, m =
1, . . . , M would change from O(M2
N + N2
M) to O(N2
M2
), for none separable dic-
tionary. This would also affect the computation of the MP coefficients. In OMP2D,
however, the calculation of the coefficients is not benefited by separability. The com-
plexity in calculating the matrices Bk
n, n = 1, . . . , k is O(kN2
).
9

2.4 Blockwise approximation
From the discussion of the previous section it follows that the OMP2D technique
could not be implemented on a complete image. The approximation is made possible
by partitioning the image into small blocks, as illustrated in Figure 1.
Figure 1: Blockwise splitting of an image
For simplicity I will process an image by dividing it into, say Q, square blocks Iq, q =
1, . . . , Q of Nq × Nq intensity pixels. The block is approximated independently of the
others by an atomic decomposition:
IKq
q =
Kq
n=1
cq
ndx
x
n
q dy
y
n
q , q = 1, . . . , Q. (9)
The approximation of the whole image IK
is then obtained by assembling all the
approximated blocks, i.e.
IK
= ∪Q
q=1IKq
q .
2.5 Constructing the dictionaries
The simple mixed dictionary used for this project consists of three components for
each 1D dictionary:
• A Redundant Discrete Cosine dictionary (RDC) Dx
1 as given by:
Dx
1 = {wc
i cos
π(2j − 1)(i − 1)
2Mx
, j = 1, . . . , Nx}Mx
i=1,
10

with wc
i , i = 1, . . . , Mx normalization factors. For Mx = Nx this set is a Discrete
Cosine orthonormal basis for the Euclidean space RNx
. For Mx = 2lNx, with
l ∈ N, the set is an RDC dictionary with redundancy 2l, that will be fixed equal
to 2.
• A Redundant Discrete Sine dictionary (RDS) Dx
2 as given by:
Dx
2 = {wc
i sin
π(2j − 1)(i − 1)
2Mx
, j = 1, . . . , Nx}Mx
i=1,
with wc
i , i = 1, . . . , Mx normalization factors. For Mx = Nx this set is a Discrete
Sine orthonormal basis for the Euclidean space RNx
. For Mx = 2lNx, with l ∈ N,
the set is an RDS dictionary with redundancy 2l, that will be fixed equal to 2.
• The standard Euclidean basis, also called the Dirac basis, i.e.
Dx
3 = {ei(j) = δi,j, j = 1, . . . , Nx}Nx
i=1.
The whole dictionary is then constructed as
Dx
= Dx
1 ∪ Dx
2 ∪ Dx
3 .
3 OMP2D in Java
I decided to implement OMP2D in Java because of its platform independence and
hence subsequent reach and ease of distribution. The algorithm itself was very com-
putationally expensive to implement due to the number of matrix operations required
over multiple iterations.
Foreseeing this I decided to structure my program in such a way that the matrix
class, which carries out these operations, could later be extended to enable GPU
acceleration through use of a GPGPU interfacing library. This would make it easier to
inject acceleration into the program at a later point and bring about the performance
benefits of mass parallelisation, see example Figures 2 and 3 for how the biorthogonal
matrix object could be replaced for a GPU accelerated version.
To aid the design process I programmed in a Test Driven Development (TDD) manner,
using black box test cases. When the GPGPU extensions were later added the black
box test cases could once again be used to ensure compliance with the program’s
original design.
Note: some of the methods and variables of the classes have been omitted, see the
online documentation of the program’s classes for full disclosure [10]
3.1 Expanding Matrices
During development it became evident that there would be a need to create a capa-
bility for some of the matrices stored to be expandable, namely the orthogonal and
11

OMP2D
findNextAtom(): double
orthogonalize(double[]) : void
reorthogonalize(int) : void
calcBiorthogonal(Matrix, double[], double[], double) : void
updateResidual(double[]) : void
Matrix
matrix : ArrayList<double[]>
add(Matrix) : void
addRow(double[]) : void
normalizeRow(int) : double
scale(double) : void
biorthogonal : Matrix
Bk+1
n = Bk
n − Bk+1
k+1 Ak+1, Bk
n F
n = 1, . . . , k
Figure 2: Case of OMP2D using standard matrix operations on CPU
biorthogonal matrices c.f. (5). This functionality is not available natively for Java
arrays as they are set at a fixed size on creation. Initial attempts to expand existing
matrices by creating new, larger matrices and copy across the data from the previous
iteration proved costly.
I changed focus to create an advanced matrix object which could increment its size to
accommodate for more values on demand.
3.2 Multi-threading
As all of the blocks that are processed are independent of each other, an obvious
performance boost can be achieved by processing several blocks concurrently. I simply
created an ExecutionService which could measure how many cores the host platform
had available and utilise them all by spawning as many threads. This drove the work
load and decreased the total time taken to process an entire image.
3.3 ImageJ Plugin
Following my work prior in my MR [1], in which I recognised ImageJ as being the
most competent supporting library for image processing operations, I chose to create
a wrapper for my program which would integrate it as a plugin for ImageJ. As described
in my previous report ImageJ is very proficient in plugin support, and is widely used
by the image processing and scientific community for these purposes.
I created a simple interface with which the user could select the blocking size, the
maximum number of iterations, whether to run in multi-thread mode and the MSE
(c.f. (8)).
12

OMP2D
findNextAtom(): double
orthogonalize(double[]) : void
reorthogonalize(int) : void
calcBiorthogonal(Matrix, double[], double[], double) : void
updateResidual(double[]) : void
Matrix
matrix : ArrayList<double[]>
add(Matrix) : void
addRow(double[]) : void
normalizeRow(int) : double
biorthogonal : CudaMatrix
Bk+1
n = Bk
n − Bk+1
k+1 Ak+1, Bk
n F
n = 1, . . . , k
CudaMatrix
matrix : CleverPointer<double[]>
add(CudaMatrix) : void
addRow(CleverPointer<double[]>) : void
normalizeRow(int) : void
jcuda
Pointer
CleverPointer
Figure 3: Case of OMP2D using GPGPU extensions
13

I made public the program under the open source MIT license which makes it freely
available for anyone in the scientific community and beyond to use, modify and extend
it.
3.4 Performance Optimisations
As OMP2D is a computationally intensive algorithm it was important to minimise
any bottlenecks which could otherwise prove costly for the timely execution of the
program.
I chose to use the IBM Health Center [sic] tool to analyse the execution of my code and
find any impediments. It is a free and low-overhead diagnostic tool which IBM offers for
its clients and the wider Java community to assess the execution of Java applications.
It provides deep analysis and recommendations based on the real time or saved data
produced by the Java Virtual Machine (JVM) through hooks. Given an IBM Java
Runtime Environment this tool injects itself into the JVM to produce performance
reports across many component areas including: thread execution, locking, I/O and
Garbage Collection (GC).
In assessing the performance I used a 512 × 512 astronomical image, see Figure 4b,
as it has a mixed composition of high and low change areas, with multi threading
enabled. I ran the OMP2D algorithm for block dimensions 8 × 8, 16 × 16 and 32 × 32.
From the IBM Health Center reports it was found that the matrix methods multiply
and getCol were consuming the most CPU time, making them good candidates for
optimisations. The next closest method in all three tests was the reorthogonlize
method which took up a low 6.79% of time.
In producing its report many samples were taking by Health Center whilst the program
was running, using these samples it probes the JVM for what function was running
at that particular moment in time, given enough samples it can generate a profile of
where the most time is spent during its execution and in what method.
The table below shows the analysis report shows for the multiply and getCol meth-
ods, where Self is the percentage of samples during which this method was running
and Tree is the percentage of samples during which this method or its descendent were
running,
multiply() getCol()
Block Dimensions Time Self Tree Self Tree
8 × 8 1.613s 38.8% 73.2% 34.0% 34.0%
16 × 16 9.757s 66.6% 83.5% 16.7% 16.7%
32 × 32 72.944s 45.9% 73.8% 27.9% 27.9%
Analysis showed that the only function which referenced getCol was multiply. This
indicated to me that the method needed further investigation into why it was taking
so long to calculate.
The only place in my Java implementation where the multiplication method was used
was in the findNextAtom function which selected the next best approximating atom
14

(c.f. (3)). In this function the matrices which are used are non-expanding, i.e. al-
ways the same dimensions no matter the iteration. Thus there was no need to have
them constructed as complex matrix objects held in an expanding container (i.e. an
ArrayList), only to extract values each time. Therefore I chose to store them simply
as large single arrays removing the process of getting columns and constructing new
matrix objects on each iteration, see the source code for BasicMatrix and Dictionary
Classes [10].
A second evaluation revealed a noticeable reduction in the time to complete and the
proportional time spent in the methods.
multiply() reorthogonalize()
Block Dimensions Time Self Tree Self Tree
8 × 8 0.367s - - - -
16 × 16 1.577s 60.6% 60.6% 2.17% 3.8%
32 × 32 12.876s 52.4% 52.4% 15.0% 19.2%
Note: The 8×8 block Health Center proﬁle report was unable to determine the method
names from its samples because the methods were calculated too quickly for it to read
them.
3.5 BLAS Library acceleration
Basic Linear Algebra Subprograms (BLAS) [11] is a commonly used library which ac-
celerates the performance of linear operations. It was originally created in Fortran and
is highly optimised for this type of work. It is commonly used in many mathematical
applications as it has been proven to be one of the fastest libraries available.
As Java applications run in a virtualised environment some libraries have made avail-
able native calls to the platform so that the operations can execute outside the JVM
in native Fortran. However this type of implementation would also mean copying the
matrix arrays to and from the JVM. Due to the high volume of relatively small sized
matrix operations which happen at each iteration, these copying prerequisites would
prove too timely to justify the accelerated performance gains which would have been
achieved.
In larger block dimensions though, such as 32 × 32, it may be justiﬁable to accelerate
the matrix multiplication method which, in Section 3.4, was shown to consume the
greatest amount of time.
There have been some recent and more fruitful developments on implementing a version
of BLAS inside the JVM. One such library is known as Java Linear Algebra (JLA)
which shows promising performances [12].
Given the small amount of time available to me, however and given the focus of
this project to use GPGPU acceleration, which itself includes BLAS, I chose not to
spend additional time further increasing the performance of the CPU version. I would
however certainly have considered using this performance optimising route if the intent
was to run this implementation solely in Java without GPGPU acceleration.
15

3.6 Comparisons with other implementations
Now that I believed it to be efficaciously done, I performed some comparative tests.
The aim was to evaluate the performance of my Java implementation, compared to
other available implementations of the identical algorithm. These implementations
are:
• OMP2D implemented in MATLAB.
• OMP2D C++ MEX file executed from MATLAB.
• OMP2D C++ MEX file using the BLAS library executed from MATLAB.
In order to make a fair comparison with my Java program, I had to learn how to im-
plement parallel processing in MATLAB and extend the available MATLAB scripts to
use multiple threads. I was then able to do fair comparisons with the implementations
above, using multiple threads, as well as single threads.
Numerical Test
I conducted the experiments using a set of five test images, see Figures 4a, 4b, 5a,
5b and 6a. The Artificial, Chest and Nebula images were approximated up to 50 dB,
corresponding to an SSIM of 0.99. Such approximations are perceptually indistin-
guishable from the originals. For completeness I have also included the popular test
images: Lena and Peppers, both were approximated up to 43dB, corresponding to
SSIM=0.98, this also renders perceptually indistinguishable approximations of these
images.
The platform which was used for running all the tests had the following distinguishable
hardware and software specifications:
Intel i7 3770k 8 × 3.40GHz
16GB DDR3 RAM
Samsung SSD 500MB/s Read
OS: Gnome Ubuntu 12.10
The Java Runtime Environment was Oracle SR7-51
All the implementations result in the same sparsity and quality levels, see Table 1.
The comparison is based on the time taken to complete the approximation. For this,
the corresponding implementations were executed five times on each image, and the
times are the average of the five runs.
16

(a) Chest X-Ray (b) Astro
8x8 16x16 32x32
0
Block Dimensions
8x8 16x16 32x32
0
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
100
200
300
400
500
600
Artificial Performance Comparisons
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
50
100
150
200
250
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
90
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with B
Block Dimensions
Time(seconds)
Lena Performance
Single Threaded
Lena Performance
Multi-Threaded
8x8 16x16 32x32
0
Block Dimensions
8x8 16x16 32x32
0
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
100
200
300
400
500
600
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
50
100
150
200
250
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
90
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLA
Block Dimensions
Time(seconds)
Lena Performance
Single Threaded
Lena Performance
Multi-Threaded
Sheet1
8x8 16x16 32x32
0
5
10
15
20
25
Astro Performance Comparisons
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
400
500
600
Single Threaded
Matlab
s)
200
250
Multi-Threaded
Matlab
)
Sheet1
8x8 16x16 32x32
0
5
10
15
20
25
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
400
500
600
Single Threaded
Matlab
C++ Mex
s)
150
200
250
Multi-Threaded
Matlab
s)
Figure 4: Testing Images
17

(a) Peppers (b) Lena
Page 1
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
Block Dimensions Block Dimensions
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
90
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with B
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
Lena Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Lena Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with B
Block Dimensions
Time(seconds)
Page 1
8x8 16x16 32x32
0
5
0
5
0
5
0
5
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
Block Dimensions Block Dimensions
8x8 16x16 32x32
0
0
0
0
0
0
0
0
0
0
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
0
0
0
0
0
0
0
0
Lena Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Lena Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
Page 1
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
90
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
70
80
Lena Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Lena Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with
Block Dimensions
Time(seconds)
Page 1
8x8 16x16 32x32
0
5
0
5
0
5
0
5
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
0
0
0
0
0
0
0
0
0
Peppers Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
40
Peppers Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
0
0
0
0
0
0
0
0
Lena Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Lena Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
18

(a) Artificial 1
Sheet1
8x8 16x16 32x32
0
5
10
15
20
25
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
10
20
30
40
50
60
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLA
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
100
200
300
400
500
600
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
50
100
150
200
250
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
Sheet1
8x8 16x16 32x32
0
5
10
15
20
25
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)8x8 16x16 32x32
0
10
20
30
40
50
60
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
5
10
15
20
25
30
35
Chest Performance
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
1
2
3
4
5
6
7
8
9
Chest Performance
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
100
200
300
400
500
600
Single Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
8x8 16x16 32x32
0
50
100
150
200
250
Multi-Threaded
Matlab
C++ Mex
Java
C++ Mex with BLAS
Block Dimensions
Time(seconds)
Results show how significant the benefit of using the BLAS library is for the OMP2D
algorithm. It is especially evident when blocks of size 32 × 32 are run in single thread
mode, resulting in the C++ MEX implementation without BLAS being even slower
than MATLAB. The reason being that MATLAB uses BLAS by default.
With the exception of the Chest image, my Java implementation is faster than MAT-
LAB in spite of it not using BLAS or any other optimising library.
1
High resolution image shrunk to 2048x2048 pixels, taken from http://www.imagecompression.
info
19

A notable difference between MATLAB and the other implementations is that it does
not appear to be as effective as the other methods in reducing time when multiple
threads are used, becoming the slowest, except in the case of the Chest image.
The fastest implementation overall across all images and blocks sizes was the C++
MEX with BLAS version.
On the whole my Java implementation without BLAS performs well. Nevertheless,
these experiments lead me to conclude that, as the dimension of the blocks increases,
the use of a library optimized for accelerating matrix multiplication becomes crucial
to performance.
Given the extendible way I have developed my Java program, it should be easy to
inject acceleration and I feel well positioned to utilise the BLAS libraries and massive
parallel capabilities of my GPU device.
Finally, in order to illustrate the effectiveness of OMP2D, for blocks of size 8 × 8, for
which OMP2D can be considered to be efficient as far as running time is concerned,
I have produced the comparison Table 1. All the results have been obtained with a
ready to use Matlab script, which was already available for comparison of sparsity
against traditional transforms and MP2D.
OMP2D MP2D DWT DCT
Chest 10.60 9.65 7.82 7.02
Astro 7.8 6.83 4.78 4.53
Peppers 5.54 5.00 2.97 2.89
Lena 6.73 6.02 4.0 3.8
Artificial 20.5 17.5 11.0 12.0
Table 1: SRs corresponding to the same approximation quality by OMP2D, MP2D,
DWT and DCT.
The first column in Table 1 indicates the image. The second column is the SR yielded
by the OMP2D approach with the dictionaries of Section 2.5. The third column
corresponds to the approximation with the same dictionary, but using the much simpler
MP2D approach. The last two columns are the results produced by the Discrete
Wavelet and Cosine Transforms (DWT and DCT) respectively, by elimination of the
smallest coefficients.
Table 2 shows the sparsity ratios produced for each image and corresponding block
size.
It may appear from this table that the gains in increasing the size of the blocks
does not justify the extra computational effort. This is only a consequence of the
dictionaries I am using here. It has been shown that for larger dictionaries, which also
include localized atoms [6], that the gain in increasing the block size is significant.
Thus, I understand that dedicating the remainder of the project to optimising the
implementation of OMP2D with GPGPU acceleration is justified.
20

8 × 8 16 × 16 32 × 32
Chest 10.5814 13.2196 14.6523
Astro 7.7507 8.3308 8.6101
Peppers 5.5328 6.1385 6.2842
Lena 6.7275 7.5180 7.7069
Artificial 20.5354 19.7175 17.0245
Table 2: SRs corresponding to block dimensions 8 × 8, 16 × 16, 32 × 32 for each test
image
4 GPU Acceleration
It follows from the preceding discussion that the performance gains of GPU acceler-
ation would only be significant when processing large blocks, such as size 32 × 32.
Otherwise the high latency memory copying described previously in my MR [1] for
GPU processing may completely negate the performance boosts gained.
In order to effectively create GPU extensions for my Java program, I first had to learn
about the GPU architecture, parallelism and memory management.
I found that almost all standard programming patterns, which are typically trivial for
serially developed programs, no longer worked effectively due to the parallel nature of
GPGPU programming. This required me to relearn basic concepts in an entirely new
light, I also had to choose and learn a new language and parallel platform to run this
new process on.
As of writing there were two GPGPU programming languages available: Nvidia’s pro-
prietary Compute Unified Device Architecture (CUDA) and the open source OpenCL.
Given the time available to me, I chose to learn CUDA over OpenCL, as there had
been more previous research conducted using it and more libraries developed for it,
especially maths based libraries such as BLAS/LAPACK. I also found that there was
more widely available documentation and examples on the web.
As I had previously investigated in my MR [1], in order to drive this hardware specific
GPU acceleration using Java, it was necessary to exploit a Java Native Interface (JNI)
that could enable communications with the platform’s device. This was due to the
virtual nature of the JVM which operates in a platform neutral state, i.e. not taking
into account platform specific hardware such, as the GPU.
JCuda was deemed an excellent candidate as it offered runtime compilation of CUDA
code; preserving the platform independence of Java. It also mirrored the CUDA
programming language’s API, allowing easier transfer and application of knowledge.
Once I had the accelerated GPGPU for matrix operations programmed in CUDA I
then intended to use the JCuda library to make JNI calls to the GPU and drive the
execution from Java.
21

4.1 GPU Architecture
As is expected with new to market and leading edge technologies, the design and com-
pute capabilities of CUDA enabled graphics cards are constantly evolving with each
new generation, bringing new features and specification changes. Many of these new
features and specifications are incompatible with the hardware of previous generations.
This makes it difficult to select a platform to begin support with and becomes an issue
of weighing up the opportunity costs of not using some new performance enhancing
features over alienating part of the academic community, that would otherwise have
been able to use and contribute to it.
I chose to support compute version 3.0 of CUDA because it included a second genera-
tion implementation of the BLAS library (CUBLAS) which was more highly optimised
for parallel execution and supported all BLAS 1, 2 and 3 routines to double precision,
complex numbers. As previously shown in my comparisons in Section 3.6 this library
was found to be very effective so i wished to exploit the most optimised version of it.
In CUDA threads are organised into blocks of a specified size, up to 1024 (post compute
version 2.0). The number of blocks and their dimensionality is specified on launching a
kernel, defining what is known as the size and dimensionality of a grid. The dimension-
ality of the blocks and grids that can be instantiated are up to 3D for blocks and 2D
for grids. This allows for separable characteristics to be applied to each thread given
its thread and block indexes, see The PRAM model of computation in Section 4.2.
Massively-parallel graphics processors have the potential to offer high performance at low
cost. However, at present such devices are largely inaccessible from higher-level languages
such as Java. This work allows compilation from Java bytecode by making use of annota-
tions to specify loops for parallel execution. Data copying to and from the GPU is handled
automatically. Evaluation showed that significant speedups could be achieved (up to 180×),
that were similar to those claimed by other less-automated work.
1 Introduction
It has been widely recognised that future improvements in processor performance are likely
to come from parallelism rather than increased clock speeds [1]. As well as multi-core CPUs,
massively-parallel graphics processors (GPUs) are now widely available that can be used for
single instruction multiple data (SIMD) style general processing. However, existing frame-
works for making use of this hardware require the developer to have a detailed understanding
of the underlying architecture, and to make explicit data movements.
This work focuses on NVIDIA’s GPUs which can be made use of via their CUDA frame-
work [2]. Under this programming model, the same ‘kernel’ is executed on each thread in a
block, and for all blocks in a grid (see Figure 1). On each invocation, the coordinates of the
thread can be used to act on different data.
The compiler produced introduces the simple concept of parallel for loops to languages
based on the Java virtual machine (JVM), allowing these to be executed on the GPU without
any need for explicit data copying. The loops are marked as parallel using Java annotations
rather than any changes to the Java syntax, and as such can still be executed sequentially if
the parallelising compiler is not used. The overall process followed by the compiler is shown
in Figure 2 (the code shown is for the computation of the Mandelbrot set using parallel GPU
execution). Although this work is based on the JVM, the techniques involved could equally
be applied to other runtimes such as the Common Language Runtime (CLR) that forms part
of Microsoft’s .NET framework.
The remainder of this synopsis gives details of the key stages in the compiler (loop de-
tection, code generation and data transfers), and also an overview of the evaluation that was
carried out.
Block
Thread
Grid
Figure 1: Software model of threads under CUDA.
1 July 8, 2010
Figure 7: Visual representation of a 2-Dimensional grid of 2D blocks
The GPU is made up of several Streaming Multi-processors (SMs), these SMs accept
one block of threads at a time for processing. The SMs can not accept any threads
which are not from its active block and cannot be freed until all the threads within its
active block have completed. It is absolutely critical therefore, to avoid divergence in
any individual thread, which would otherwise cause several others to wait until it has
completed.
When considering where GPU acceleration could benefit a program it is important
to assess whether the particular function is a good candidate in terms of throughput.
GPUs are best suited for doing a lot of work in several smaller parts than cyclical
processes. Ideally we want to have a large maths/memory ratio.
22

In order to begin programming GPU kernels it is vital to first understand massive
parallelism and the PRAM model of computing. Without this foundation knowledge
it is impossible to effectively design a parallel system which can yield any improved
results, contrarily it could perform worse.
4.2 The PRAM model of parallel computation
Kernels separate workload by their thread and block index numbers, and sometimes
their dimensions. This enables unique characteristics, operations and behaviour of
a thread to be defined, such as the address of a specific memory space to work on,
enabling mass parallelisation.
Due to the number of threads which are competing for memory access and writes we
must consider a different model of computation known as Parallel Random Access
Machine (PRAM). PRAM is analogous to the Random Access Machine (not to be
confused with Random Access Memory) model of sequential computation that is used
in algorithm analysis. The synchronous PRAM model is defined as follows [13]:
1. There are p processors connected to a single shared memory.
2. Each processor has a unique index 1 ≤ i ≤ p called the processor ID.
3. A single program is executed in Single-Instruction stream, Multiple-Data (SIMD)
fashion. Each instruction in the stream is carried out by all processors simultane-
ously and requires same amount unit time, regardless of the number of processors.
4. Each processor has a flag that controls whether it is active in the execution
of an instruction. Inactive processors do not participate in the execution of
instructions, except for instructions that reset the flag.
By working independently on a small slice of an algorithm the threads solve a common
problem together. The PRAM model does not allow for simultaneous writing to a
single shared memory space. This type of event is known as a collision and can cause
abnormal behaviour in the execution of the program. Take for instance the following
kernel in badwrite.cu.
badwrite.cu
1 extern "C"
2 __global__ void increment(int *var)
3 {
4 var[0] = var[0] + 1;
5 }
If it was launched with a thousand in-
stances the final value of var[0] would
not be 1000, but would in fact be closer
to 30. This is due to the lag between
one thread reading the current value of
var[0] and incrementing it, in which
time another thread may also have read the previous value and incremented it to
same value. This is known as a race condition.
CUDA provides a solution to this through operators called Atomics which offer a viable
solution where concurrent writing is absolutely necessary, however it forces the threads
to effectively queue to do their operations. This queuing effect greatly slows down the
entire process as all the affected processors’ unit times increase. For this reason they
should be avoid if at all possible.
23

For instance the inner product function would be a poor choice for Atomics, where
a summation of the products is done as the entire process would take as long as the
slowest processor.
4.3 Kernel Construction
Obeying the PRAM model of computation often requires an abandonment of some
programming techniques used in serial languages. Take for instance a scaling function
which multiples every element of an array by some factor. Typically this would be
achieved in an iterative fashion as in Program2 in Java, but for CUDA it is done in a
single step which is execute by as many threads as there are elements.
When constructing kernels individual characteristic of processors, such as which ele-
ment of an array to act on, can be deﬁned by obtaining the block and thread ID of
the processor, see line 4 of Program 1. The following kernel is launched in 1D fashion,
only taking into account the x dimension.
1 extern "C"
2 __global__ void scale(double *v1, int length, double factor)
3 {
4 int index = blockIdx.x*threadIdx.x + threadIdx.x;
5 if(index < length)
6 {
7 v1[index] = v1[index] * factor;
8 }
9 }
Program 1: Scaling function in CUDA
1 public void scale(double[] v1, double factor)
2 {
3 for(int i = 0; i < v1.length; i++)
4 {
5 v1[i] = v1[i]*factor;
6 }
7 }
Program 2: Scaling function in Java
When designing a kernel great care should be taken to avoid divergence of the threads
that follow it. As stated in Section 4.1 an SM cannot be freed until all the thread
instructions within the block have completed. If one thread were to diverge the entire
block would wait on it, which wastes computation time which could otherwise be used
by other threads.
4.4 Memory Management
The architecture of motherboards makes it computationally impossible to work on any
memory which resides outside of the GPU due to hardware imposed restrictions. Even
if it were possible the latency involved would be disastrous for performance.
24

Coalesced Strided Scattered
Figure 8: Paged memory resource structuring
In order to work on a dataset we need to allocate memory space on the GPU’s own
memory storage. There is however a great deal of latency in transferring memory to
and from the device, ranging upwards to a couple milliseconds. For this reason it is
advisable to minimise the number of transfers which take place to limit its hampering
effect.
The GPU has three memory storage areas in which values can be held: global, shared
and local. When memory if initially transferred to the GPU it is held in the global
storage area, as it is not associated with any particular block or thread. This global
memory can only be accessed via pointers, conversely pointers cannot be used in
conjunction with local or shared variables. By default any variable that is instantiated
during the runtime of a thread is held as a local variable and is only accessible to that
thread. In order to share variables between threads of the same block the shared
declaration is needed. It is however very limited by its small and fixed storage size.
Global is the slowest memory storage area, yet accessible to all threads regardless
of their block origins. Shared is far faster than global but slower than local, and is
accessible to all threads of the same block. Local is the fastest storage area, but is
only accessible to the thread which created it. The memory access and read speeds of
the GPU are extremely fast, more so than the memory for a serial application.
It is usually good practice to cache parts of the global memory which are going to be
used and altered before returning them to the global storage area for transfer.
When structuring large arrays the memory should also be coalesced, this is due to the
way sections of memory space are actively read in blocks. By using coalesced memory
access, less reads are needed to access related data resulting in faster performance
times.
Unlike Java, the memory which is allocated on the GPU is not managed by JCuda
or the JVM as it exists outside of its structure. We must therefore manage it in a
C like manner where the programmer allocates memory to be used and frees it when
finished. This requires good programmer discipline yet also goes against one of Java’s
key characteristics.
25

4.4.1 CleverPointer Class
To bring memory allocation more in line with the characteristics of Java I propose a
method automated memory management by producing a Java pointer class with its
own self managed system capable of allocating and deallocating memory as and when
needed by the JVM.
As Java is a pointer-less and memory managed language, programmers do not have to
discipline themselves in allocating and deallocating memory blocks. If for instance a
pointer to some GPU memory is no longer being used (i.e. it has no active references)
then it becomes staged for deletion by the Garbage Collector (GC) component of the
JVM. This is particularly troublesome when considering the case of pointers because
if it is destroyed without deallocation or the JVM shuts down abruptly, then the GPU
memory is lost and never gets cleaned up.
I first looked to extend the Pointer class of JCuda, to make it capable of deallocating
CUDA memory on disposal. I achieved this by overriding the finalize() method
of the object super class. From testing however I found that the pointers were not
being properly deallocated in the event of the application closing before a final GC
was made.
This problem was solved by creating a shutdown hook for the JVM, which would
iterate through a static collection of all pointer objects which were currently in use.
This created a new problem however, in that this collection clashed with the finalizing
effect I had made previously, as all the pointers now had a permanent reference from
this collection, resulting in them never being staged for removal by the GC.
By utilising a little known weak referencing system, I was able to create a weak
hashmap collection to these objects instead. This prevented the shutdown hook col-
lection from causing a blocking effect and allowed the objects to be destroyed when
no other references existed.
Below is an example of using a normal system of pointer objects to perform a matrix
operation such as the inner product.
Using the pointer class
1 public static double innerProductSetup(double[] vector1, double[] vector2, int length) {
2 int size = length*Sizeof.DOUBLE;
3 Pointer d_vector1 = new Pointer();
4 Pointer d_vector2 = new Pointer();
5 Pointer d_ans = new Pointer();
6
7 cudaMalloc(d_vector1, size);
8 cudaMemcpy(d_vector1, Pointer.to(vector1), size, cudaMemcpyHostToDevice);
9
10 cudaMalloc(d_vector2, size);
11 cudaMemcpy(d_vector2, Pointer.to(vector2), size, cudaMemcpyHostToDevice);
12
13 cudaMalloc(d_ans, size);
14
15 innerProduct(d_vector1, d_vector2, length, d_ans);
16
17 double[] ans = new double[1];
18 cudaMemcpy(Pointer.to(ans), d_ans, size, cudaMemcpyDeviceToHost);
19
20 cudaFree(d_ans);
21 cudaFree(d_vector1);
22 cudaFree(d_vector2);
23
26

24 return ans[0];
25 }
Below is my new system of CleverPointers which follows Java naming conventions
more strictly, is more easy to read and requires less work to implement.
Using the pointer class
1 public static double innerProductSetup(double[] vector1, double[] vector2, int length) {
2 CleverPointer<double[]> dVector1 = CleverPointer.copyDouble(vector1);
3 CleverPointer<double[]> dVector2 = CleverPointer.copyDouble(vector2);
4 CleverPointer<double[]> dAns = CleverPointer.create(new double[1]);
5
6 innerProduct(dVector1, dVector2, length, dAns);
7
8 return dAns.getArray()[0];
9 }
It should however be noted that in the case of high throughput processes where a
method requires huge amounts of memory allocated programmer discipline is still
necessary to ensure that there is enough memory for all proceeding calls to able to
allocate their own memory. I have therefore included a manual method of freeing
memory by using the object’s free() method. However, given the size of the global
memory reserves of modern GPUs, often in the gigabytes, this shouldn’t be necessary
for the majority applications. See my website for source and full documentation of the
Class [10].
5 GPU Programming Patterns
Considering the PRAM model of computing and the restrictions placed on memory
usage, many patterns which were previously common place and trivial in serial com-
puting, must be re-examined in order to fully utilise the mass parallel computing
capabilities of the GPU.
Conversely there are some familiar mathematical libraries such as BLAS and LAPACK
have GPU equivalent libraries, such as cuBLAS, which can be used in place making
some common practices portable.
Where current GPU libraries do not fully meet the requirements of a program’s design,
additional functionality must be created by the programmer. When creating new
functions it is detrimental to consider the PRAM model to guard against collisions
and ensure eﬀective GPU usage. There are several parallel programming patterns
which address some issues of parallel computing, below are I name the most relevant
to this project.
27

5.1 Map
Figure 9: Map
Mapping is simple operation whereby a
large array of values are transformed from
one state to another.
It is a collision free, one to one operation
where each processor operates indepen-
dently of all other processors on a single
value; each value can either be modified
in its own memory space, such as scaling
a vector, or each resulting value can be
mapped to a new memory area, as in the case of the addition of two arrays. This
pattern is especially useful and easy to implement in CUDA as it is inherently thread
safe.
In OMP2D the matrix operation where entire columns are scaled by a normalizing
factor, could be considered a map operation.
5.2 Reduce & Scan
Reduce looks to take an array of values and, through applying an associative binary
operator to all elements, produce a single resulting value. This pattern can be applied
to functions which, for instance, find the sum or the maximum value of all variables.
The reduce pattern follows Brent’s Theorem, which assumes a PRAM model.
Let si be the number of operations in step i of the overall computation and assuming
that each processor can complete an arithmetic operation in the same unit time, t.
The total time to compute step si of the overall operation given P processors is
si
P
≤
si + P − 1
P
(10)
Giving an overall time to complete all steps of
T
i=0
si
P
≤
T
i=0
si + P − 1
P
=
T
i=0
P
P
+
T
i=0
si − 1
P
= T +
T
i=0
si −
T
i=0
1
P
= T +
m − T
P
(11)
The number of steps to complete a reduction in parallel compared to serial is decidedly
smaller. n − 1 in serial compared to log2 n. in parallel, making it increasingly more
beneficial as the number of elements grows.
This reduction algorithm can be implemented by using a Scan pattern. Scan takes a
binary associative operator (e.g. addition or finding the minimum value) and applies
it on a set of elements up to a kth element (that element being the total sum up to
28

8 4 9 7
12 16
28
Figure 10: Hillis and Steele tree diagram
that element if we were doing addition, or if we were trying to find the minimum, the
smallest so far). Scan can be inclusive or exclusive of the kth element.
e.g.
Value Set: [8, 4, 9, 7, 6, 3, 10]
Addition: [8, 12, 21, 28, 34, 37, 47]
Minimum: [8, 4, 4, 4, 4, 3, 3]
We can use inclusive scan to find what the reduced value would be for an array by
taking the kth element. Exclusive scan does not include the k element and instead
iterates up to the k − 1 element.
There are two different main implementations of scan: Hillis and Steele and the Blel-
loch method, which by extension creates an ordered array.
5.2.1 Hillis and Steele
The Hillis and Steele method is more step efficient but requires more work. The main
concept is to take the current element, k, and apply the operator to its 2i
neighbour.
For n many elements there are log2(n) steps, requiring work of O(n). This creates a
tree-like effect, as shown in Figure 10. Hillis and Steele method only finds the reduced
value of an array unlike Blelloch which orders the entire array as well.
5.2.2 Blelloch
The Blelloch method is a work efficient method but requires more steps [14]. The
Blelloch method stages the scan process into two parts: the reduce and down-sweep
stages. It also has the added benefit of created a new ordered array, as shown below
in Figure 11.
During the reduce phase this method uses the same reducing process as Hillis and
Steele, after which the final value in the array is set to an identity operator, in this case
of addition it is 0. The down-sweep is then applied, which copies left and applies the
operation right until all elements have been processed resulting in a ordered exclusive
scan array.
29

8 4 9 7
12 16
28
012
12
21
90
12
8
80
Figure 11: Blelloch
5.3 Scatter
Figure 12: Scatter
The scatter pattern involves a single pro-
cessor writing to many memory space at
once, this may happen asynchronously
with other threads. This can in some
cases cause a collision where in some cir-
cumstances memory spaces overlap when
multiple processors attempt to write to
the same memory spaces, this problem
was previously addressed in Section 4.2.
As previously discussed CUDA’s Atomic functions can be used in this type of instance
to ensure properly ordered writing.
5.4 Gather
Figure 13: Gather
Gather is the opposite of scatter where
many memory spaces are combined in
some function and outputted to a single
memory space, for instance a moving av-
erage. It can be diﬃcult to implement
in a PRAM model if multiple threads are
used as it will create many collisions, due
to the number of simultaneous writes to
30

one address space at the same time.
6 Parallel Programming Patterns in Image Pro-
cessing
Applying GPU acceleration to a Java program requires a lot of setup, even for simple
procedures. Take for example the program below which uses the map pattern to
perform a simple image processing technique, inverting all the colour channels of an
image. It is run as an ImageJ plugin, which uses JCuda to drive the parallel execution
of inverting all of the image’s pixels concurrently.
InvertPlugin.java
1 import static jcuda.runtime.JCuda.cudaFree;
2 import static jcuda.runtime.JCuda.cudaMalloc;
3 import static jcuda.runtime.JCuda.cudaMemcpy;
4 import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
5 import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;
6
7 import ij.IJ;
8 import ij.ImagePlus;
9 import ij.plugin.filter.PlugInFilter;
10 import ij.process.ImageProcessor;
11
12 import java.io.BufferedReader;
13 import java.io.IOException;
14 import java.io.InputStream;
15 import java.io.InputStreamReader;
16
17 import jcuda.Pointer;
18 import jcuda.Sizeof;
19 import jcuda.utils.KernelLauncher;
20
21 public class Invert_Plugin implements PlugInFilter
22 {
23 private ImagePlus currentImage;
24 private KernelLauncher kernelLauncher;
25
26 @Override
27 public void run(ImageProcessor imageProcessor)
28 {
29 int[] pixels = (int[])imageProcessor.getPixels();
30 int w = imageProcessor.getWidth();
31 int h = imageProcessor.getHeight();
32 invert(pixels, w, h);
33 currentImage.updateAndDraw();
34 }
35
36 public void invert(int pixels[], int w, int h)
37 {
38 // Allocate memory on the device, and copy the host data to the device
39 CleverPointer<int[]> pointer = CleverPointer.copyInt(pixels);
40
41 // Set up and call the kernel
42 int blockSize = 16;
43 int gridSize = (int)Math.ceil((double)Math.max(w, h)/blockSize);
44 kernelLauncher.setGridSize(gridSize, gridSize);
45 kernelLauncher.setBlockSize(blockSize, blockSize, 1);
46 kernelLauncher.call(pointer, w, h);
47
48 // Copy the data from the device back to the host and clean up
49
50 pixels = pointer.getArray();
51 }
52
53 @Override
31

54 public int setup(String arg, ImagePlus imagePlus)
55 {
56 currentImage = imagePlus;
57
58 // Obtain the CUDA source code from the CUDA file
59 String cuFileName = "invertKernel.cu";
60
61 // Create the kernelLauncher that will execute the kernel
62 try {
63 kernelLauncher = KernelLauncher.compile(getKernel(cuFileName), "invert");
64 } catch(IOException e) {
65 IJ.showMessage(
66 "Failed to compile CUDA Kernel!",
67 "Check that " + cuFileName + " exists.");
68 }
69 return DOES_RGB;
70 }
71
72 private String getKernel(String k) throws IOException
73 {
74 InputStream kFile = getClass().getResourceAsStream(k);
75 BufferedReader br = new BufferedReader(new InputStreamReader(kFile));
76 StringBuilder sb = new StringBuilder();
77 String line;
78 while((line = br.readLine()) != null) {
79 sb.append(line).append("n");
80 }
81 return sb.toString();
82 }
83 }
invertKernel.cu
1 extern "C"
2 __global__ void invert(uchar4* data, int w, int h)
3 {
4 int x = threadIdx.x+blockIdx.x*blockDim.x;
5 int y = threadIdx.y+blockIdx.y*blockDim.y;
6 if (x < w && y < h)
7 {
8 int index = y*w+x;
9 uchar4 pixel = data[index];
10 pixel.x = 255 - pixel.x;
11 pixel.y = 255 - pixel.y;
12 pixel.z = 255 - pixel.z;
13 pixel.w = 255 - pixel.w;
14 data[index] = pixel;
15 }
16 }
Using my CleverPointer class in this case does alleviate the number of programmed
memory management operations required and needs no manual clean up of the global
memory, as when the CleverPointer object is destroyed by the GC its ﬁnalize method
is called, which simultaneously frees the GPU memory.
7 Applying GPU Acceleration to OMP2D
There were two options in applying GPU acceleration to OMP2D:
One is to follow the PRAM model and compute, in parallel, the matrix operations
such as innerProduct and matrix multiply. Many of these operations have already
been implemented for CUDA and are available as libraries, such as BLAS specialised
for CUDA. This makes them implicitly good at solving parallelising problems such as
collisions and divergence. When considering the iterative process of OMP2D and the
32

way in which acceleration is driven, an issue arises concerning the latencies involved
with carrying out memory transfers to and from the GPU. These transfers, which
could take up a couple milliseconds between each iteration, are done solely to check
whether our tolerance level (c.f. (8)) has been met yet. These transfers are necessary
as there is no self stopping mechanism which the GPU can use to finish its execution
once its reached the desired tolerance level.
Another approach is to simply process each block concurrently in its own thread in
a sequential manner, similar to how the multi-threaded CPU programs described in
Section 3.2 are able to simultaneously calculate many blocks at once on multi-core CPU
processors. This does not however launch as many threads, meaning less throughput,
unless small block sizes are used or the image is very large. As found in the results
of Section 3.6 small block processed images already process very quickly so the only
notable application in this case would be very large images with large block sizes.
7.1 Block-wise Parallelisation
It would have been programmed using CUDA code but without use of libraries such
as BLAS because they must be initiated from the CPU side of the execution.
Advantages:
• There would still be a good speed up gained from processing the thousands of
blocks concurrently.
• Easier to program for as there would not be any parallel issues such as collisions
or divergence.
• There would be significantly less memory transfers required.
• The process is self sufficient once it is on the GPU.
Disadvantages:
• The entire OMP2D algorithm and its related matrix operations would need to
be written in CUDA.
• There would be no BLAS support as additional kernels cannot be launched from
live kernels.
• GPU cores are not as fast as CPU cores, so the time decrease would not be linear
7.2 Matrix Operations Parallelisation
This option would look to accelerate the matrix operations only, maintaining logic
separation of the accelerated parallel processes and the iteratively driven serial process.
Advantages:
• Allows for use of libraries such as CUBLAS
33

• Easier to inject into the Java program and able to validate the functionality
through the same unit tests which were used in the Java version.
Disadvantages:
• The number of memory transfers needed to complete an approximation increases
with each iteration
• Would require a multiple stream solution to negate time lost in memory transfers
7.2.1 Multiple Streams Solution
In order to negate the number of memory transfer bottlenecks, I intended to run
multiple streams capable of transferring the most recently calculated iteration’s results,
whilst simultaneously calculating the next iteration. This would repeat until one
of the returning iterations had met their stopping condition (c.f. (8)) after which
the subsequent iterations which are calculated are discarded. see Figures 14 and 15
Where blue represents the calculations performed on the GPU and orange is the time
to transfer memory.
Time
Stream 1
Stream 2
Stream 3
Stream 4
Figure 14: GPGPU calculation in multiple streams.
Time
Stream 0
Figure 15: Single stream GPGPU calculation.
8 Conclusions
A Java implementation of OMP2D is presented here and made available to the scientiﬁc
and academic community as an easily packaged ImageJ plugin. The implementation
shows good performance, despite its lack of accelerating libraries, such as BLAS or
GPGPU.
A pointer class, which was created as a supporting element to GPGPU acceleration
in Java, is also made publicly available. The class more strictly follows Java mem-
ory conventions and design philosophy and enables easy transfer and management of
memory existing on GPU devices.
34

Two frameworks are further presented and outlined for further investigation on how
well they may improve the performance of the Java implementation. Discussions the
methods in which computationally demanding parts can be extended through GPGPU
usage.
Through my exploration of GPGPU usage I found that GPGPU processing is still in
its infancy and is going through continuous development, making it hard to find up to
date resources on the subject.
It is unfortunate that, due to time constraints, I was not fully able to effect GPU
acceleration for the OMP2D algorithm, however it is certainly an area I will would be
keen to pursue given the time and opportunity in the future.
8.1 Reflections
• The project as a whole was very ambitious for the time given in a single term.
• OMP2D can offer much improved image representation results in a small amount
of time in an optimised program.
• In terms of performance my Java program was sufficiently competent.
• GPGPU platforms are a constantly evolving field yet to reach mainstream adop-
tion, making it difficult to find sources of information such as tutorials.
• The process of creating GPU acceleration is fraught with complications and
difficulties.
• Through this project I have learnt MATLAB and its parallel tool kit, Git version
control for my code, C/C++ and CUDA and its associated memory management
policies and how to write parallel programs conforming to the PRAM model.
• It has been an incredibly rewarding experience giving me the opportunity to
expand my knowledge and skills in an area which I had no previous experience.
8.2 Future Work
Areas in which further work to extend the findings of this project are:
• To fully implement the GPGPU acceleration process for Java.
• A file saving technique which accommodates the compressed image.
• Optimisations of the Java only implementation, including the JLA library.
35

References
[1] M. Simons, “Image processing with Java,” 2013.
[2] S. G. Mallat, “A wavelet tour of signal processing,” Third Edition: The Sparse
Way, Academic Press, 2008.
[3] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”
IEEE Trans. Signal Process, 1993.
[4] R. R. Y.C. Pati and P. Krishnaprasad, “Orthogonal matching pursuit: recursive
function approximation with applications to wavelet decomposition,” Proc. of the
27th ACSSC, pp. 40–44, 1993.
[5] A. C. L. Rebollo-Neira, J. Bowley and A. Plastino, “Self contained encrypted
image folding,” Physica A, 2012.
[6] L. Rebollo-Neira and J. Bowley, “Sparse representation of astronomical images,”
Journal of The Optical Society of America A, vol. 30, 2013.
[7] M. Elad, “Sparse and redundant representations: From theory to applications in
signal and image processing,” Springer (2010), 2010.
[8] L. Rebollo-Neira and D. Lowe, “Optimized orthogonal matching pursuit ap-
proach,” IEEE Signal Process. Letters, pp. 137–140, 2002.
[9] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “From error visibility
to structural similarity,” IEEE Transactions on Image Processing, 2004.
[10] M. Simons, “OMP2D ImageJ Plugin.” https://github.com/simonsm1/OMP2D,
2014.
[11] C. L. Lawson, R. J. Hanson, D. R. Kincaid, F. T. Krough, “Basic linear algebra
subprograms for Fortran usage,” ACM Trans. Math. Software, pp. 308–323.
[12] B. Oancea, I. G. Rosca, T. Andrei, and A. I. Iacob, “Evaluating java performance
for linear algebra numerical computations,” Procedia Computer Science, vol. 3,
no. 0, pp. 474 – 478, 2011. World Conference on Information Technology.
[13] S. Chatterjee and J. Prins, “Pram algorithms,” pp. 1–2, 2009.
[14] G. Blelloch, Preﬁx sums and their applications. Morgan Kaufmann, 1990.
36

9 Appendices
A Installing the OMP2D plugin in ImageJ
1. First Download the appropriate ImageJ application for your platform from here
http://rsb.info.nih.gov/ij/download.html and the OMP2D plugin from
here https://github.com/simonsm1/OMP2D/raw/master/OMP2D/OMP2D_Plugin.
jar
2. Extract ImageJ to a folder of your choice and copy OMP2D Plugin.jar to the
plugins folder within.
3. Launch the ImageJ executable and you should ﬁnd OMP2D listed under the
plugins menu.
37

Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives

Similar to Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives (20)

Recently uploaded

Recently uploaded (20)

Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives