This chapter discusses a CUDA implementation of the structural similarity index (SSIM) for image quality assessment. SSIM is computationally intensive but can be parallelized for GPU implementation. The author implemented SSIM on CUDA and evaluated performance on Nvidia GPUs, achieving a speedup of 30x on GTX275 and 80x on C2050 compared to a single-core CPU implementation. Evaluation on test images showed the CUDA implementation produced SSIM values matching the CPU version while significantly reducing computation time.
2. 277
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
SSIM x y
c
c
x y xy
x y x y
,
( c )( )
( c )( )
( )=
+ +
+ + + +
2 21 2
2 2
1
2 2
2
µ µ σ
µ µ σ σ
(1)
where
σx
2
the variance of x ;
σy
2
the variance of y ;
C K L1 1
2
= ( ) and C K L2 2
2
= ( ) two variables to stabilize the division with weak denominator;
L is the dynamic range of the pixel-values (typically this is 2 1# /bits pixel
− );
K1
0 01= . and K2
0 03= . by default.
The NxN regions are shifted around the image pixel by pixel to cover the whole image, and the final
SSIM is obtained by summing up the SSIM of all the regions. The resultant SSIM index is a decimal
value between -1 and 1, and value 1 is only reachable in the case of test image and reference image being
identical. The typical size of the region is taken to be 8x8 or 16x16 (Singh et al., 2011).
Image quality assessment is an important step in many image restoration applications like image
denoising, image deblurring and image in painting. It is also an important step in video codecs where a
block based approach is followed for video compression.
It is obvious from the definition of SSIM that it is a computationally intensive method. However in
many cases we may require a real time or a faster implementation. It is also obvious from the definition
that SSIM computations of two different regions are independent of each other’s and could be done in
parallel. This kind of parallelism is well suited to GPU kind of architectures, where each stream multi-
processor (SM) works independent of other SM’s (Shrivastav et al., 2011).
In this chapter, we report performance evaluation of a CUDA implementation of the SSIM based
image quality assessment tool. For this purpose we have done a C implementation on Intel single core
processor, and a CUDA implementation on Nvidia GTX275 and C2050. We have compared both the
implementations on a database of six images of various sizes, by taking region sizes to be 8x8 and 16x16
(Wang et al., 2004).
Figure 1. Flow diagram of image quality assessment
3. 278
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
Technical Background of CUDA
CUDA technology gives computationally intensive applications access to the tremendous processing
power of the recent GPUs through a C-like programming interface. The GPU is especially well-suited
to address problems that can be expressed as data-parallel computations with high arithmetic intensity.
Because the same program is executed for each data element, there is a lower requirement for sophisti-
cated control; and because it is executed on many data elements and has high arithmetic intensity, the
memory access latency can be hidden with calculations instead of big data caches Data-parallel process-
ing maps data elements to parallel processing threads. This work is contributed in NIVIDIA (2012) &
(Wang et al., 2002).
In this chapter, we use GTX275 core which having 240 CUDA cores, 633MHz graphics clock,
1404MHz processor clock, 896MB GDDR3 RAM. Fermi Details C2050, CUDA core 448, CUDA core
frequency 1.15GHz, 3GB GDDR5.
Implementation Details
In order to exploit the inherent parallelism in the computation of the SSIM metric for different regions,
the images are split into NxN regions, as shown in Figure 2, and computation for each region was done
by one CUDA block in NVIDIA (2012). To reduce the computational complexity further, pixel-by-pixel
shifting of the regions was not considered, and only the regions shown in Figure 2, which is considered
for final SSIM calculations.
SincetheSSIMmetricrequiresmeanandvarianceofboththeimages,toavoiddatadependenceamong
different CUDA blocks, the computations for images blocks at the same location of both the images are
performed by one CUDA block. Within one CUDA block, warp independence is exploited by assign-
Figure 2. 8 x 8 block for image division
4. 279
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
ing different warps for the computation of different image parameters. Coalesced assess to the device
memory is ensured by using cudaMallocPitch() API for image buffer allocation on the device. Shared
memory optimization is used by copying pixel values to the shared memory. The SSIM for each region
is computed by one CUDA block and moved back to the device memory. The SSIM sum is performed
by launching one more kernel (Zujovic et al., 2009) and (Aswathappa & Rao, 2010).
Results and Discussion
For evaluation purpose the CUDA implementation is compared with a C implementation running on
Intel single core. The experiments were performed to evaluate the speedup achieved by the CUDA
implementation. The experiments were performed using regions sizes 8x8 and 16x16. The test images
taken were either noisy images created by adding Gaussian noise to the reference image, blurred images
obtained by using a Gaussian blur on the reference image, or an altogether different image, examples
shown in Figure 3 and Figure 4.
Result are shown in following tables (Zujovic et al., 2009).
Table 1 showing the SSIM result with region size 8x8 on Intel core and CUDA core with and without
optimization methods. We observed around 29x speedup for region size 8x8.
Table 2 showing the SSIM result with region size 16x16 on Intel core and CUDA core with and
without optimization methods. We observed around 32x speedup for region size 16x16.
Table 3 showing the SSIM result with region size 16x16 on Intel core and CUDA core with optimi-
zation on GTX275 and C2050 core. We observed around 32x speedup on GTX275 and 80x speedup on
C2050 for region size 16x16.
Figure 3. Reference image Figure 4. Distorted image
5. 280
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
Table 1. SSIM result with region size 8 x 8 CUDA code
Reference
Image
Distorted
Image
Size SSIM
Without Optimization
With Optimization
(Shared, Intrinsic,
Pragma Loop)
Intel Time
(Micro Sec)
CUDA
Time
(Micro Sec)
Speedup
CUDA
Time
(Micro Sec)
Speedup
Lena.gif Len.gif 256 x 256 1 7800 512 15.23 332 23.49
Lena.gif
Lena_
gaussian.gif
256 x 256 0.95 7708 611 12.61 321 24.01
Lena.gif Lena.gif 512 x 512 1 7927 451 17.57 270 29.35
Lena.gif
Lena_
gaussian.gid
512 x 512 0.97 8300 466 17.81 291 28.52
Lena.gif Lena.gif 1024 x 1024 1 8200 478 17.15 281 29.18
Lena.gif
Lena_
gaussian.gif
1024 x 1024 0.90 8250 465 17.74 283 29.15
Lena.gif Baboon.gif 512 x 512 -0.036 8100 461 17.57 292 27.73
Barbara.gif Lena.gif 512 x 512 0.027 8112 464 17.48 277 29.28
Pepper.gif Lena.gif 512 x 512 -0.012 8140 450 18.08 274 29.70
Pepper.gif
Pepper_blur.
gif
512 x 512 0.324 8034 447 17.97 273 29.42
Table 2. SSIM result with region size 16 x 16 CUDA code
Reference
Image
Distorted
Image
Size SSIM
Without Optimization With Optimization
Intel Time
CUDA
Time
Speedup
CUDA
Time
Speedup
Lena.gif Len.gif 256 x 256 1 6211 398 15.60 224 27.72
Lena.gif Lena_g.gif 256 x 256 0.94 6314 401 15.74 231 27.33
Lena.gif Lena.gif 512 x 512 1 6415 412 15.57 218 29.42
Lena.gif Lena_g.gif 512 x 512 0.96 6434 406 15.84 207 31.08
Lena.gif Lena.gif 1024 x 1024 1 6450 396 16.28 201 32.08
Lena.gif Lena_g.gif 1024 x 1024 0.91 6543 394 16.60 202 32.39
Lena.gif Baboon.gif 512 x 512 -0.034 6411 391 16.39 194 33.04
Barbara.gif Lena.gif 512 x 512 0.026 6387 399 16.00 195 32.75
Pepper.gif Lena.gif 512 x 512 -0.012 6410 389 16.47 199 32.21
Pepper.gif Pepperb.gif 512 x 512 0.324 6412 390 16.44 200 32.06
6. 281
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
CONCLUSION
In this chapter, we have done performance evaluation of a CUDA implementation of an SSIM based
image quality assessment tool. We have shown that SSIM algorithm is highly suitable for GPU kind of
architecture. In our implementation we have achieved an average performance improvement of 30x on
GTX275 and 80x on C2050.
Future work in this direction include utilizing the speedup achieved to evaluate SSIM performance
by shifting the region pixel-by-pixel and performing further experiments to find out optimal window
size which reduces computational complexity without compromising much on the index value.
REFERENCES
Aswathappa, B. M. K., & Rao, K. R. (2010). Rate-Distortion Optimization using Structural Information
in H.264 Strictly Intra-frame Encoder. Paper presented at Southeastern Symposium on Systems Theory,
Tyler, TX, USA (pp.367-370). doi:10.1109/SSST.2010.5442789
NVIDIA Corporation. (2012). CUDA Parallel Computing Platform. Retrieved from http://www.nvidia.
com/object/cuda_home_new.html
Shrivastav, A., Tomar, G. S., & Singh, A. K. (2011).Performance Comparison of AMBA Bus-Based
System-On-Chip Communication Protocol. Paper presented at IEEE International Conference on Com-
munication Systems and Network Technologies (CSNT), Katra, Jammu (pp. 449-454). doi:10.1109/
CSNT.2011.98
Table 3. SSIM result with region size 16 x 16 CUDA code, GTX275 and C2050
Reference
Image
Distorted
Image
Size SSIM
With Optimization (Shared,
Intrinsic, Pragma Loop)
Fermi C2050
Tesla Time Speedup Fermi Time Speedup
Lena.gif Len.gif 256 x 256 1 224 27.72 58 107.08
Lena.gif Lena_g.gif 256 x 256 0.94 231 27.33 60 105.23
Lena.gif Lena.gif 512 x 512 1 218 29.42 76 84.40
Lena.gif Lena_g.gif 512 x 512 0.96 207 31.08 77 83.55
Lena.gif Lena.gif 1024 x 1024 1 201 32.08 120 53.75
Lena.gif Lena_g.gif 1024 x 1024 0.91 202 32.39 124 52.76
Lena.gif Baboon.gif 512 x 512 -0.034 194 33.04 78 82.19
Barbara.gif Lena.gif 512 x 512 0.026 195 32.75 78 81.88
Pepper.gif Lena.gif 512 x 512 -0.012 199 32.21 80 80.12
Pepper.gif Pepperb.gif 512 x 512 0.324 200 32.06 80 80.15
7. 282
GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index
Singh, R. R., Tiwari, A., Singh, V. K., & Tomar, G. S. (June, 2011). VHDL environment for floating
point Arithmetic Logic Unit-ALU design and simulation. Paper presented at IEEE International Confer-
ence on Communication Systems and Network Technologies (CSNT), Katra, Jammu (pp. 469-472).
doi:10.1109/CSNT.2011.167
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error
visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. doi:10.1109/
TIP.2003.819861 PMID:15376593
Wang, Z., Lu, L., & Bovik, A. C. (2002).Video quality assessment using structural distortion measure-
ment.PaperpresentedatIEEEInternationalConferenceonImageProcessing,Rochester,NY(pp.65-68).
Zujovic, J., Pappas, T. N., & Neuhoff, D. L. (2009). Structural similarity metrics for texture analysis and
retrieval. Paper presented at 16th IEEE International Conference on Image Processing (ICIP), Cairo,
Egypt (pp. 2225-2228). doi:10.1109/ICIP.2009.5413897