1. GPU Based Image Compression and Interpolation with Anisotropic
Diffusion
Vartika Sharma Umang Sehgal
Electronics and Communication Engineering Computer Science and Engineering
LNM Institute of Information Technology LNM Institute of Information Technology
Jaipur, India Jaipur, india
vartika.y12@lnmiit.ac.in umangsehgal.y12@lnmiit.ac.in
December 25, 2014
Abstract
Image compression is used to reduce irrelevance and
redundancy of the image data in order to be able
to store or transmit data in an efficient form. The
best image quality at a given bit-rate or compression
rate is the main goal of image compression. Methods
based on partial differential equation (PDEs) have
been used in the past for inpainting and reconstruc-
tion from digital image features. We go for PDE
method because optimal set for image compression
and interpolation depends on PDE, i.e., good PDEs
can cope with bad points and good points allow sim-
ple (suboptimal) PDEs. Suboptimal point set can
pay off if coded efficiently. During encoding, the basic
idea is to store only a few relevant pixel coordinates in
the encoding step. We use an adaptive triangulation
method based on binary tree coding for removing less
significant pixels from the image. Decoding is done
by the Perona and Malik diffusion process for which
the remaining points serve as scattered interpolation
data. Our goal in this paper is to analyse the poten-
tial of differential equations for image compression
and interpolation and analyse the performance speed
of the algorithm both on CPU and GPU. Graphics
Processing Units (GPUs) are used in image process-
ing because they accelerate parallel computing, are
affordable and energy efficient. Research has also
proved that GPUs perform better even at lower occu-
pancies. In this paper, we will see the advantage we
achieve with respect to the productivity and main-
tainability when applying concepts of the hardware
system. Our experiment illustrates that the compu-
tation time for CPU code increases significantly as we
increase the image dimension but higher dimensional
images are processed with equal ease using GPU com-
puting.
Keywords- Image Compression, Partial Differ-
ential Equations(PDEs), Binary Tree Triangulation,
GPU Computing
1 Introduction
Image compression is concerned with taking an
image and compressing it down to its smallest
possible size without much loss of data. The main
purpose of image compression is to reduce the file
size of an image, so that it can be transferred quickly
over a communication network. It is desirable to
have algorithms that are faster because fast coding is
useful during image interpolation and compression.
The first part of our paper deals with image com-
pression using Binary Tree Triangular Coding. We
aim to gather our compression points as leafs to the
binary tree which is challenged through triangular
coding. Further, we aim to demonstrate that in a
hierarchical tree traversal, optimizing can result in
substantial performance gains on the GPU.
In the next part, we aim at filling in missing
information in certain corrupted image areas by
means of second-order PDEs. The basic idea is to
interpolate the data in the inpainting regions by
solving appropriate boundary value problems. We
take the example of the Perona and Malik non linear
diffusion method [5] and show implementation on
both CPU and GPU.
Image processing has challenges that pose “inherent
parallel” nature. Owing to the large number of
cores, GPUs hold a potential for high performance
computing. The paper focuses on the variation in
1
2. the performance obtained on the CPU (Matlab and
C++) and the GPU (CUDA C++) architectures
with image interpolation using Perona and Malik
diffusion of greyscale images of different sizes.
The paper is organized as follows. Section II ex-
plains the B-tree triangulation coding scheme and
tree construction on the GPU. PDE based image
interpolation and its GPU implementation using
CUDA is discussed in section III. Results are shown
in Section IV and Section V gives the conclusion.
2 Image Compression using B-
tree triangular coding
We will discuss an algorithm for image compression
called B-tree triangular coding. [6]It is based on the
recursive decomposition of the image into isosceles-
angled triangles arranged in a binary tree. The
method is attractive because of its fast encoding,
and decoding, and because it is easy to implement
and to parallelize.
The image to be encoded is regarded as a discrete
surface, by considering a non-negative discrete
function of two discrete variables F(x, y), and
establishing a correspondence between the image
and the surface A = (x, y, c)|c = F(x, y), so that
each point in A corresponds to a pixel in the image:
where c gives the pixel’s density.
Our goal is to approximate A by a discrete surface
B = (x, y, d)|d = G(x, y), defined by a finite means
of a finite set of polyhedrons. Each polyhedron has
a right-angled triangle (RAT) face on the XY plane
and a RAT upper face approximating A. The sur-
face B is made by the upper faces of the polyhedrons.
To show how we make our binary tree using the
image, first of all, let T be a generic RAT, on the
XY plane, of vertexes
P1 = (x1, y1), P2 = (x2, y2), P3 = (x3, y3), (1)
and let,
c1 = F(x1, y1), c2 = F(x2, y2), c3 = F(x3, y3) (2)
so that,
(x1, y1, c1), (x2, y2, c2), (x3, y3, c3) ∈ A (3)
Figure 1: Image partition process using Binary Tree
Triangulation Coding. A leaf of the binary tree
marked with a small triangle indicates that the corre-
sponding triangle satisfies the uniformity predicate.
The condition inside T of the approximating function
G is given by the linear interpolation
G(x, y) = c1 + α(c2–c1) + β(c3–c1) (4)
where α and β are defined by the two relations
α =
(x − x1)(y3 − y1) − (y − y1)(x3 − x1)
(x2 − x1)(y3 − y1) − (y2 − y1)(x3 − x1)
(5)
β =
(x2 − x1)(y − y1) − (y2 − y1)(x − x1)
(x2 − x1)(y3 − y1) − (y2 − y1)(x3 − x1)
(6)
Therefore, using the definition of linear interpola-
tion, it can be concluded that, values of F and G
coincide on the vertexes of T:
F(P1) = G(P1); F(P2) = G(P2); F(P3) = G(P3)
(7)
We now let our approximation function G by defin-
ing
err(x, y) = F(x, y)–G(x, y) (8)
and see whether,
err(x, y) ≤ ε, (9)
2
3. Figure 2: Implementation of Binary Tree structure on the GPU. There is one thread for each node at each
level.
where ε > 0 is an adjustable quality factor.
If the condition does not hold, T is divided along
its height relative to the hypotenuse. If the subdi-
vision process is reiterated indefinitely, we eventually
obtain a minimal triangles, comprising only three pix-
els or vertexes, which satisfy the above condition as
err(x, y) = 0 on each vertex.
The topological information relative to all subdivi-
sions is stored in a hierarchical structure or B-tree.
2.1 Tree Construction on the GPU
In the recent year, general-purpose GPU comput-
ing has given rise to a number of methods for
constructing bounding volume hierarchies (BVHs).
In our case, the matter of essence is the speed of
construction.
For parallel implementation of Binary-tree trian-
gulation coding, the idea is to process the levels
of the nodes sequentially, starting from the root.
[1]Therefore, every level in the binary tree hierarchy
corresponds to a linear range of nodes. On a given
level, we launch one thread for each node that falls
into this range.
However, this process is fast only when there are
millions of objects, to fully employ the GPU. The
main short-coming that we face with the existing
methods that aim to maximize construction speed is
that they generate the node hierarchy in a sequential
pattern, which is usually one level at a time. This is
bound to limit the amount of parallelism that we can
achieve at the top levels of the tree, and can lead to
lesser utilization of the parallel cores available to us.
3 PDE based image interpola-
tion
For decompression firstly, the vertex mask is recov-
ered from the binary tree representation, and the
stored grey values are placed at the appropriate
pixel positions to give the sparse image. To recover
the vertex mask, the tree is generated in the same
order as it was stored. Along with generating nodes,
vertex positions are calculated and marked in the
vertex mask.
The second step consists in the interpolation of
the image, where the vertex mask becomes the
interpolation mask. Recovering the image is done by
treating the final image as the steady state of some
diffusion process taken over the data points (subset
of pixels selected through BTTC). After the subset
of pixels has been chosen, we will interpolate the
points and recreate the original image.
In this section we consider the following non-linear
diffusion scheme:
∂tu = (c u) (10)
where c is a conductivity function introduced by
Perona and Malik.
3
4. Now, we discretize the given diffusion filter as -
∂tu = ∂x(c∂xu) + ∂y(c∂xu) (11)
We will discretize the PDE using a symmetric
scheme for the first order derivatives (in a 3x3 sten-
cil), as given by Weickert [2] [3]. First we will con-
sider the term ∂x(c∂xu) . We will discretize it using
forward differences –
∂x(c∂xu) ≈
[∂x(c∂xu)]|i+1,j − [∂x(c∂xu)]|i,j
2
(12)
Now, we discretize the term c∂xu using backward-
differences -
c∂xu =
ci,j + ci−1,j ∗ ui,j − ui−1,j
2
(13)
Adding (12) and (13) together, we obtain
∂x(c∂xu) ≈ [
ci,j + ci−1,j ∗ ui,j − ui−1,j
2
]|i+1,j− (14)
[
ci,j + ci−1,j ∗ ui,j − ui−1,j
2
]|i,j (15)
Similarly, we can write for ∂y(c∂yu).
The discretization of the PDE then becomes -
ut+dt
i,j = ut
i,j +
dt
2
(∂x(c∂xu) + ∂y(c∂xu)) (16)
where dt is the step-size.
We take an image and then iterate the code n
times ( 80). For every step, the derivative of the
Gaussian kernel and the c-function of the square
of the derivative is calculated, given by the Matlab
Command-
c = exp(−grad2/(k2
));
where grad2 is the square of the gradient norm.
We then calculate the non linear diffusion step,
given by equation (16), to calculate the change in
image with every iteration. The resulting image
is our diffused image. The resulting image is our
interpolated image.
3.1 GPU implementation of Perona
and Malik using CUDA
As our algorithm require many floating point com-
putations per pixel, it can result in slow run-time
even for the fastest of CPUs. The slow speed of a
CPU is a serious hindrance to productivity. Using
CUDA, we can spawn exactly one thread per pixel.
Each thread will be responsible for calculating the
final color of exactly one pixel.
Since images are naturally two dimensional, it makes
sense to have each block be two dimensional. (32x16
is a good size because it allows each thread block to
run 512 threads). Then, we spawn as many thread
blocks in the x and y dimension as necessary to
cover the entire image. For example, for a 1024x768
image, the grid of thread blocks is 32x48, with each
thread block having 32x16 threads.
CUDA uses the GPU (device) to execute code.
A function that executes on the device is called a
kernel, which is qualified with global
A call to the kernel is done by kernel name<<<
blocks, threads >>>. [4]
Now to define the height and width of an image,
we write the code -
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
Because our function runs on a CUDA device,
the image data must be copied over to the GPU.
Therefore, the image is copied to the GPU, then
that image is copied to another place on the GPU
with a GPU to GPU memory copy. The kernel (our
function) is called, and finally the resulting image
can be copied back to the host.
Copy the data to the device -
cudaMemcpy(unsigned int *image, float k, float
timestep, int nsteps, float w, cudaMemcpyHostToDe-
vice) ;
cudaMemcpy(unsigned int *imageDataCopy,float k,
float timestep, int nsteps, float w, cudaMemcpyDevice-
ToDevice) );
Function call -
pm diffusion<<< blocks, threads >>>(image,
0.001f, 0.2, 80, w);
HANDLE ERROR(”pm diffusion() execution failed
n”);
syncthreads();
Copy the data back to the host -
cudaMemcpyunsigned int *image, float k, float
timestep, int nsteps, float w, cudaMemcpyDeviceTo-
Host);
We then run our algorithms for both CPU (Matlab
4
5. Figure 3: Test of Perona and Malik function. On the top, original images of three different sizes(128x128,
512x512,1024x1024) are shown. On the bottom, the result of Perona and Malik diffusion is shown (after 80
iterations).
and C++) and GPU (CUDA C++).
4 Results
To see the performance analysis, we took three images
of different sizes and run our Perona and Malik code
in Matlab, C + + and CUDAC + + using NVIDIA
GPUs.
We run the benchmark for 5 times and average
the time consumed. The results shown are for 80
iterations in Perona and Malik diffusion algorithm.
Image
Dimentions
Matlab
Time(sec)
C++
Time
(sec)
CUDA
C++
Time(sec)
128X128 3.766 1.009 0.069
512x512 12.745 3.126 0.081
1024x1024 40.737 6.309 0.106
Table 1: Test of Perona and Malik function. Com-
parison in processing speed between Matlab, C++ and
CUDA C++
5 Conclution
In comparison, the benefits of speed offered by CUDA
C++ far outweigh that of both Matlab and C++.
Processing speed is especially important when deal-
ing with high dimensional images, since many cal-
culations involve immense optimization with com-
plex equations and algorithms or calculations with
a large number of iterations. As the amount of
data increases, the computation time for both Matlab
and C++ code increases significantly, therefore CPU
codes becomes infeasible for those calculations. How-
ever, higher dimensional images are processed with
equal ease in CUDA using GPU computing. How-
ever, it must also be taken in mind that the devel-
opment time in C++ and CUDA C++ is much high
compared to Matlab.
References
[1] Taro Carras. “Maximizing Parallelism in the
Construction of BVHs, Octrees, and k-d Trees”.
In: (2012).
[2] Martin Welk Irena Galic Joachim Weickert. “Im-
age Compression with Anisotropic Diffusion”.
In: (2008).
[3] Martin Welk Irena Galic Joachim Weickert.
“Towards PDE-Based Image Compression”. In:
(2005).
[4] Edward Kandrot Jason Sanders. CUDA by Ex-
ample. 2010.
5
6. [5] J. Malik P. Perona. “Scale-Space and Edge De-
tection Using Anisotropic Diffusion”. In: (1992).
[6] Michele Nappi Riccardo Distasi and Sergio Vitu-
lano. “Image Compression by B-Tree Triangular
Coding”. In: (1997).
6