APPLYING OF THE NVIDIA CUDA TO THE
VIDEO PROCESSING IN THE TASK OF THE
ROUNDWOOD VOLUME ESTIMATION
Artem Kruglov, Andrey Chiryshev, Evgeniya Shishko1
1Ural federal university named after the first President of Russia B.N.Yeltsin
Preamble
The system for roundwood geometry measurement on the
basis of machine vision allows to:
• Automate the process of the roundwood scanning and
sorting,
• Reduce an error as compared to the systems based on a
laser scanning and photocell,
• Fulfill an external assessment of the quality and type of
wood.
Ural-PDC 2016
Problem
• logs detection should be carried out in the real time, which
imposes much tighter restrictions to the applying of the
image analysis methods:
• maximum processing time for each frame is 20 ms.
Ural-PDC 2016
Roundwood volume estimation algorithm
Ural-PDC 2016
Possible scenes
Ural-PDC 2016
Empty scene Appearance of the log
Log tracking End of the log
Possible scenes
Ural-PDC 2016
Two consecutive logs Appearance of the parallel
log
Two parallel logs
Results of the computational experiment
Image
Processing time, ms
Image
enhancement
Background
model
refreshing
Object
detection
1 0 2,1 0
2 31,4 0 10,6
3 32,0 0 11,5
4 31,9 0 12,0
5 32,1 0 13,0
6 31,9 0 13,6
7 32,2 0 13,4
Ural-PDC 2016
CUDA implementation
1. Selecting of the active GPU;
2. Allocating the data store for an image in the global
memory;
3. Loading input image into the global memory;
4. Loading coefficients of the filter window into the
constant memory;
5. Forming the structure of the computational grid;
6. Launching the kernel functions executing the image
filtration;
7. Copying the output image from the GPU memory to the
internal memory;
8. Deallocating.
Ural-PDC 2016
CUDA implementation
1 #define BLOCK DIM 16
2 constant int c_mask [25];
3 // Host-function
4 void FilterGPU (byte *inbuf, byte *outbuf, int height, int width, int apert, int *mask)
5 {
6 // Device initialization
7 cudaSetDevice (0);
8 // Memory allocating
9 byte *buf;
10 cudaMalloc ((void **)&buf, sizeof (byte)*width*height);
11 // Loading image into the device
12 cudaMemcpy (buf, inbuf, sizeof (byte)*width* height, cudaMemcpyHostToDevice);
13 // Loading coefficients of the filter window into the constant memory
14 cudaMemcpyToSymbol (c_mask, mask, sizeof (int)*apert*apert);
15 // Computational grid forming
16 dim3 blocks = dim3 (width/BLOCK_DIM+1, height/BLOCK_DIM+1);
17 dim3 threads = dim3 (BLOCK_DIM, BLOCK_DIM);
18 // Kernel launch
19 kernel<<< blocks, threads >>> (buf, height, width, apert);
20 cudaThreadSynchronize ();
21 // Uploading output image to the PC memory
22 cudaMemcpy (outbuf, buf, sizeof (byte)*width*height, cudaMemcpyDeviceToHost);
23 // Device deallocating
24 cudaFree (buf);
25 }
Ural-PDC 2016
CUDA implementation
• The fast shared memory used to
process each part of the image.
• Image enhancement operations
in the Kernel function:
• Area opening (erosion + dilatation)
• Dilatation
Ural-PDC 2016
part of the image (16x16)
processed in the thread
CUDA architecture uses global, constant, textural, shared
and local types of the memory.
• The constant memory is used to store Filter window.
• The global memory is used to store the image (capacity
and R/W).
Approaches comparison
Ural-PDC 2016
The computational experiment has been carried out on the
IBM PC with the following characteristics:
• 8-core CPU Intel Core i7 920 2.79 GHz;
• GPU NVIDIA GeForce GTS 450 (192 CUDA kernels);
• 64x Windows 7 OS.
Compared approaches:
• Sequential implementation (CPU only)
• OpenMP multitreading
• Parallel implementation (CUDA)
Approaches comparison
Ural-PDC 2016
Image
Average processing time, ms
Speedup
(serial)Serial OpenMP CUDA
1 2,1 2,1 2,1 1
2 42,0 19,9 12,2 3,44
3 43,5 21,0 12,8 3,4
4 43,9 21,8 13,9 3,16
5 45,1 22,9 14,5 3,11
6 45,5 22,8 15,2 2,99
7 45,6 22,8 15,4 2,96
Conclusion
• The average data transfer time to/from GPU is 0,2 ms
• Calculations speedup using CUDA is up to 3 times
• The time costs of the algorithm in the parallel
implementation is less then 20 ms per frame - suitable for
real-time system
Ural-PDC 2016
Thank you for your attention!
Ural-PDC 2016

Applying of the NVIDIA CUDA to the video processing in the task of the roundwood volume estimation

  • 1.
    APPLYING OF THENVIDIA CUDA TO THE VIDEO PROCESSING IN THE TASK OF THE ROUNDWOOD VOLUME ESTIMATION Artem Kruglov, Andrey Chiryshev, Evgeniya Shishko1 1Ural federal university named after the first President of Russia B.N.Yeltsin
  • 2.
    Preamble The system forroundwood geometry measurement on the basis of machine vision allows to: • Automate the process of the roundwood scanning and sorting, • Reduce an error as compared to the systems based on a laser scanning and photocell, • Fulfill an external assessment of the quality and type of wood. Ural-PDC 2016
  • 3.
    Problem • logs detectionshould be carried out in the real time, which imposes much tighter restrictions to the applying of the image analysis methods: • maximum processing time for each frame is 20 ms. Ural-PDC 2016
  • 4.
    Roundwood volume estimationalgorithm Ural-PDC 2016
  • 5.
    Possible scenes Ural-PDC 2016 Emptyscene Appearance of the log Log tracking End of the log
  • 6.
    Possible scenes Ural-PDC 2016 Twoconsecutive logs Appearance of the parallel log Two parallel logs
  • 7.
    Results of thecomputational experiment Image Processing time, ms Image enhancement Background model refreshing Object detection 1 0 2,1 0 2 31,4 0 10,6 3 32,0 0 11,5 4 31,9 0 12,0 5 32,1 0 13,0 6 31,9 0 13,6 7 32,2 0 13,4 Ural-PDC 2016
  • 8.
    CUDA implementation 1. Selectingof the active GPU; 2. Allocating the data store for an image in the global memory; 3. Loading input image into the global memory; 4. Loading coefficients of the filter window into the constant memory; 5. Forming the structure of the computational grid; 6. Launching the kernel functions executing the image filtration; 7. Copying the output image from the GPU memory to the internal memory; 8. Deallocating. Ural-PDC 2016
  • 9.
    CUDA implementation 1 #defineBLOCK DIM 16 2 constant int c_mask [25]; 3 // Host-function 4 void FilterGPU (byte *inbuf, byte *outbuf, int height, int width, int apert, int *mask) 5 { 6 // Device initialization 7 cudaSetDevice (0); 8 // Memory allocating 9 byte *buf; 10 cudaMalloc ((void **)&buf, sizeof (byte)*width*height); 11 // Loading image into the device 12 cudaMemcpy (buf, inbuf, sizeof (byte)*width* height, cudaMemcpyHostToDevice); 13 // Loading coefficients of the filter window into the constant memory 14 cudaMemcpyToSymbol (c_mask, mask, sizeof (int)*apert*apert); 15 // Computational grid forming 16 dim3 blocks = dim3 (width/BLOCK_DIM+1, height/BLOCK_DIM+1); 17 dim3 threads = dim3 (BLOCK_DIM, BLOCK_DIM); 18 // Kernel launch 19 kernel<<< blocks, threads >>> (buf, height, width, apert); 20 cudaThreadSynchronize (); 21 // Uploading output image to the PC memory 22 cudaMemcpy (outbuf, buf, sizeof (byte)*width*height, cudaMemcpyDeviceToHost); 23 // Device deallocating 24 cudaFree (buf); 25 } Ural-PDC 2016
  • 10.
    CUDA implementation • Thefast shared memory used to process each part of the image. • Image enhancement operations in the Kernel function: • Area opening (erosion + dilatation) • Dilatation Ural-PDC 2016 part of the image (16x16) processed in the thread CUDA architecture uses global, constant, textural, shared and local types of the memory. • The constant memory is used to store Filter window. • The global memory is used to store the image (capacity and R/W).
  • 11.
    Approaches comparison Ural-PDC 2016 Thecomputational experiment has been carried out on the IBM PC with the following characteristics: • 8-core CPU Intel Core i7 920 2.79 GHz; • GPU NVIDIA GeForce GTS 450 (192 CUDA kernels); • 64x Windows 7 OS. Compared approaches: • Sequential implementation (CPU only) • OpenMP multitreading • Parallel implementation (CUDA)
  • 12.
    Approaches comparison Ural-PDC 2016 Image Averageprocessing time, ms Speedup (serial)Serial OpenMP CUDA 1 2,1 2,1 2,1 1 2 42,0 19,9 12,2 3,44 3 43,5 21,0 12,8 3,4 4 43,9 21,8 13,9 3,16 5 45,1 22,9 14,5 3,11 6 45,5 22,8 15,2 2,99 7 45,6 22,8 15,4 2,96
  • 13.
    Conclusion • The averagedata transfer time to/from GPU is 0,2 ms • Calculations speedup using CUDA is up to 3 times • The time costs of the algorithm in the parallel implementation is less then 20 ms per frame - suitable for real-time system Ural-PDC 2016
  • 14.
    Thank you foryour attention! Ural-PDC 2016