Gaussian Image Blurring in CUDA C++

Image Processing:
Gaussian smoothing
201301032
Darshan Parsana

Blurring/smoothing
 Mathematically, applying a Gaussian blur to an image is the same
as convolving the image with a Gaussian function.
 The Gaussian blur is a type of image-blurring filter
that uses a Gaussian function for calculating the
transformation to apply to each pixel in the image.
 Gaussian blur takes a weighted average around
the pixel, while "normal" blur just averages all the
pixels in the radius of the single pixel together.

How it works? kernel type : Gaussian
 Complexity = O(N*r*r) ; r = blur radii. N = total no. of pixels
 It is a widely used effect in graphics software, typically to
reduce image noise and reduce image detail.
 Ref: https://en.wikipedia.org/wiki/Gaussian_blur

Examples:
 Input = image Output = image
Blur radii = 1.2 pixel

Serial code
 Complexity = O(N*r*r); N=total no. of pixel
 So, in parallel code we can just launch threads based on output image(like in
matrix multiplication)
for(row = 0; row < height; row++){
for(col = 0; col < width; col++){
int sumX = 0,sumY = 0,ans = 0;
int r = row;
int c = col;
for(i = -filterWidth/2; i < filterWidth/2; i++){
for(j = -filterWidth/2; j < filterWidth/2; j++){
row = row+i;
col = col+j;
row = min(max(0, row), width - 1);
col = min(max(0, col), height - 1);
int pixel = input[row][col];
sumX += pixel*Mx[i + filterWidth/2][j +
filterWidth/2]; }
}
ans = abs(sumX/273) ;
if(ans > 255) ans = 255;
if(ans < 0) ans = 0;
output[r][c] = ans;
}
}

Serial code
64*64 228*221 749*912
convolution 0.45 4.92 90.027
load 1.01 6.75 70.8
1.01 6.75
70.8
0.45
4.92
90.027
0
20
40
60
80
100
120
140
160
180
time
size
load convolution

Strategy & Naïve Implementation
 Each thread generates a single output pixel.
 Simple implementation => load image, launch kernel, compute output
 A block of pixels from the image is loaded into an array in shared memory.
 And load filter into constant memory

Parallel code:
(without shared)
Here, Block size = 16*16;
__global__ void image(int * in, int *out, int width)
{
//masks
int Mx[5][5] =
{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4
},{1,4,7,4,1} };
int sumX = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row <= 0 || row >= n-1 || col <= 0 || col >=
n-1)
{
out[row*width + col] = 0;
}
else
{
for(int i = -2; i < 3; i++)
{
for(int j = -2; j < 3; j++)
{
int pixel = in[(row + i) * width + (col + j)];
sumX += pixel * Mx[i+2][j+2];
}
}
__syncthreads();
int ans = abs(sumX)/273;
//if the value of sum exceeds general pixels
measures then assign boundaries
if(ans > 255) ans = 255;
if(ans < 0) ans = 0;
//save the convolved pixel to out array
out[row*width + col] = ans;
}
}

Parallel code:
(shared)
 Use of constant and shared
memory
 Tile size = block size = 16*16
//kernel
__global__ void image(int * in, int *out, int width, int
height)
{
__shared__ int smem[BLOCK_W*BLOCK_H];
__const__ int Mx[5][5] =
{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4
},{1,4,7,4,1} };
int x =blockIdx.x*TILE_W+threadIdx.x - R;
int y = blockIdx.y*TILE_H + threadIdx.y -R;
x = min(max(0, x), width-1);
y = min(max(0,x), height-1);
unsigned int index = y*width+x;
unsigned int bindex = threadIdx.y*blockDim.y+threadIdx.x;
smem[bindex] = in[index];
__syncthreads();
if((threadIdx.x>=R)&&(threadIdx.x<(BLOCK_W-
R))&&(threadIdx.y>=R)&&(threadIdx.y<(BLOCK_H-R)))
{
int sum =0;
for(int dy = -R; dy<R;dy++){
for(int dx=-R;dx<R;dx++){
int i = smem[bindex+(dy*blockDim.y)+dx];
sum += Mx[dy][dx]*i;
}
}
out[index]= sum/273;
}
}

Comparison
(block size/TILE size on time)
0.28
0.16
0.14
0.167
0.08
0.07 0.064
0.081
0
0.05
0.1
0.15
0.2
0.25
0.3
4*4 8*8 16*16 32*32
time
Block size
Effect of block size
without shared shared
Fixed input size : 228*221

0.03 0.176
1.89
0.0649 0.1453
1.93
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
64*64 228*221 749*912
time
size
Without shared
load convolution
0.03 0.181
1.88
0.05 0.064
0.2658
0
0.5
1
1.5
2
2.5
64*64 228*221 749*912
shared
load convolution

Speed up:
15
33.86
46.65
15
76.875
338.7
1
0
50
100
150
200
250
300
350
400
450
64*64 228*221 749*912
without shared shared
From graph ,we can see that use of shared memory improves performance.

Conclusion
 Using shared mem. and const. mem. , we can get much more speed up
(here ~10x) than naïve.

Gaussian Image Blurring in CUDA C++

In this document

More Related Content

What's hot

Viewers also liked

Similar to Gaussian Image Blurring in CUDA C++

Recently uploaded

Gaussian Image Blurring in CUDA C++