Nvidia® cuda™ 5 sample evaluationresult_2

NVIDIA® CUDA™ 5.0
Sample evaluation result
PART Ⅱ
GPU: GTX 560 Ti
CPU: i5-3450S (TDP65W)
RAM: 16GB
OS: Windows 7 x64 Ultimate
Yukio Saitoh | FXFROG.com
24/Apr/2013

INDEX
Sample binary :
19. concurrentKernels
20. conjugateGradient
21. concurrentKernels
22. conjugateGradient
23. conjugateGradientPrecond
24. convolutionFFT2D
25. convolutionSeparable
26. convolutionTexture
27. cppIntegration
28. cudaDecodeD3D9 (runaway)
29. cudaDecodeGL
30. cudaEncode (runaway)
31. dct8x8
32. deviceQuery
33. deviceQueryDrv
34. dwtHaar1D
35. dxtc

Sample target path and files
• C:¥ProgramData¥NVIDIA Corporation¥CUDA
Samples¥v5.0¥bin¥win64¥Release

concurrentKernels.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥concurrentKernels.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1
> Detected Compute SM 2.1 hardware with 8 multi-processors
Expected time for serial execution of 8 kernels = 0.080s
Expected time for concurrent execution of 8 kernels = 0.010s
Measured time for sample = 0.010s
Test passed

conjugateGradient.exe
> GPU device has 8 Multi-Processors, SM 2.1 compute capabilities
iteration = 1, residual = 4.451374e+001
iteration = 2, residual = 3.248658e+000
iteration = 3, residual = 2.695777e-001
Test Summary: Error amount = 0.000000

conjugateGradientPrecond.exe
conjugateGradientPrecond starting...
GPU selected Device ID = 0
> GPU device has 8 Multi-Processors, SM 2.1 compute capabilities
laplace dimension = 128
Convergence of conjugate gradient without preconditioning:
Convergence Test: OK
Convergence of conjugate gradient using incomplete LU preconditioning:
Convergence Test: OK
Test Summary:
Counted total of 0 errors
qaerr1 = 0.000004 qaerr2 = 0.000003

convolutionFFT2D.exe 1/2
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionFFT2D.exe] - Starting...
Testing built-in R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating R2C & C2R FFT plans for 2048 x 2048
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 1267.922657 MPix/s (3.154767 ms)
...reading back GPU convolution results
...running reference CPU convolution
...comparing the results: rel L2 = 7.179421E-008 (max delta = 4.808732E-007)
L2norm Error OK
...shutting down
Testing custom R2C / C2R FFT-based convolution
...creating C2C FFT plan for 2048 x 1024
...reading back GPU FFT results
L2norm Error OK
...shutting down

convolutionFFT2D.exe 2/2
Testing updated custom R2C / C2R FFT-based convolution
...creating C2C FFT plan for 2048 x 1024
...reading back GPU FFT results
L2norm Error OK
...shutting down
Test Summary: 0 errors
Test passed

convolutionSeparable.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionSeparable.exe] -
Starting...
Image Width x Height = 3072 x 3072
Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Running GPU convolution (16 identical iterations)...
convolutionSeparable, Throughput = 3179.0263 MPixels/sec, Time = 0.00297 s, Size = 9437184 Pixels,
NumDevsUsed = 1, Work
group = 0
Reading back GPU results...
Checking the results...
...running convolutionRowCPU()
...running convolutionColumnCPU()
...comparing the results
...Relative L2 norm: 0.000000E+000
Shutting down...
Test passed

convolutionTexture.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥convolutionTexture.exe] - Starting...
Initializing data...
Running GPU rows convolution (10 identical iterations)...
Average convolutionRowsGPU() time: 1.427774 msecs; //3304.859282 Mpix/s
Copying convolutionRowGPU() output back to the texture...
cudaMemcpyToArray() time: 0.481161 msecs; //9806.674660 Mpix/s
Running GPU columns convolution (10 iterations)
Average convolutionColumnsGPU() time: 1.429637 msecs; //3300.552071 Mpix/s
Reading back GPU results...
Checking the results...
...running convolutionRowsCPU()
...running convolutionColumnsCPU()
Relative L2 norm: 0.000000E+000
Shutting down...
Test passed

cppIntegration.exe
Hello World.
Hello World.

cudaDecodeD3D9.exe (runaway)
Command Line Arguments:
argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeD3D9.exe

cudaDecodeGL.exe 1/2
[CUDA/OpenGL Video Decode]
argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeGL.exe
[cudaDecodeGL]: input file: <../../../3_Imaging/cudaDecodeGL/data/plush1_720p_10s.m2v>
VideoCodec : MPEG-2
Frame rate : 30000/1001fps ~ 29.97fps
Sequence format : Progressive
Coded frame size: [1280, 720]
Display area : [0, 0, 1280, 720]
Chroma format : 4:2:0
Bitrate : 14116kBit/s
Aspect ratio : 16:9
argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeGL.exe
> Device 0: <GeForce GTX 560 Ti >, Compute SM 2.1 detected
-> GPU 0: < GeForce GTX 560 Ti > driver mode is: WDDM
>> initGL() creating window [1280 x 720]
> Using CUDA/GL Device [0]: GeForce GTX 560 Ti
> Using GPU Device: GeForce GTX 560 Ti has SM 2.1 compute capability
Total amount of global memory: 1024.0000 MB
>> modInitCTX<NV12ToARGB_drvapi_x64.ptx > initialized OK
>> modGetCudaFunction< CUDA file: NV12ToARGB_drvapi_x64.ptx >
CUDA Kernel Function (0x0a4c6660) = < NV12ToARGB_drvapi >
>> modGetCudaFunction< CUDA file: NV12ToARGB_drvapi_x64.ptx >
CUDA Kernel Function (0x0a4c6210) = < Passthru_drvapi >
> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

cudaDecodeGL.exe 2/2
setTextureFilterMode(GL_NEAREST,GL_NEAREST)
ImageGL::CUcontext = 02047fd0
ImageGL::CUdevice = 00000000
reshape() glViewport(0, 0, 1280, 720)
[cudaDecodeGL] - [Frame: 0016, 00.0 fps, frame time: 98854.47 (ms) ]
[cudaDecodeGL] statistics
Video Length (hh:mm:ss.msec) = 00:00:00.440
Frames Presented (inc repeats) = 326
Average Present Rate (fps) = 739.44
Frames Decoded (hardware) = 327
Average Rate of Decoding (fps) = 741.71

cudaDecodeD3D9.exe 1/2
argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaDecodeD3D9.exe
[cudaDecodeD3D9]: input file: <../../../3_Imaging/cudaDecodeD3D9/data/plush1_720p_10s.m2v>
VideoCodec : MPEG-2
Frame rate : 30000/1001fps ~ 29.97fps
Sequence format : Progressive
Coded frame size: [1280, 720]
Display area : [0, 0, 1280, 720]
Chroma format : 4:2:0
Bitrate : 14116kBit/s
Aspect ratio : 16:9
> Using GPU Device 0: GeForce GTX 560 Ti has SM 2.1 compute capability
Total amount of global memory: 1024.0000 MB
>> modInitCTX<NV12ToARGB_drvapi_x64.ptx> initialized SUCCESS!
>> modGetCudaFunction<NV12ToARGB_drvapi_x64.ptx>
CUDA Kernel Function = <NV12ToARGB_drvapi, 0x04439d20>
>> modGetCudaFunction<NV12ToARGB_drvapi_x64.ptx>
CUDA Kernel Function = <Passthru_drvapi, 0x044398d0>
> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

cudaDecodeD3D9.exe 2/2
[cudaDecodeD3D9] - [Frame: 0016, 833.6 fps, time: 1.20 (ms) ]
[cudaDecodeD3D9] statistics
Video Length (hh:mm:ss.msec) = 00:00:00.375
Frames Presented (inc repeats) = 326
Average Present FPS = 868.73
Frames Decoded (hardware) = 327
Average Decoder FPS = 871.40

cudaEncode.exe (runaway)
Starting cudaEncode...
[ CUDA H.264 Encoder ]
argv[0] = C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥cudaEncode.exe

dct8x8.exe
dct8x8.exe Starting...
CUDA sample DCT/IDCT implementation
===================================
Loading test image: barbara.bmp... [512 x 512]... Success
Running Gold 1 (CPU) version... Success
Running Gold 2 (CPU) version... Success
Running CUDA 1 (GPU) version... Success
Running CUDA 2 (GPU) version... 10459.499992 MPix/s //0.025063 ms
Success
Running CUDA short (GPU) version... Success
Dumping result to barbara_gold1.bmp... Success
Dumping result to barbara_gold2.bmp... Success
Dumping result to barbara_cuda1.bmp... Success
Dumping result to barbara_cuda2.bmp... Success
Dumping result to barbara_cuda_short.bmp... Success
Processing time (CUDA 1) : 0.209782 ms
Processing time (CUDA 2) : 0.025063 ms
Processing time (CUDA short): 0.170617 ms
PSNR Original <---> CPU(Gold 1) : 32.777073
PSNR Original <---> CPU(Gold 2) : 32.777046
PSNR Original <---> GPU(CUDA 1) : 32.777092
PSNR Original <---> GPU(CUDA 2) : 32.777077
PSNR Original <---> GPU(CUDA short): 32.749447
PSNR CPU(Gold 1) <---> GPU(CUDA 1) : 64.019310
PSNR CPU(Gold 2) <---> GPU(CUDA 2) : 71.777740
PSNR CPU(Gold 2) <---> GPU(CUDA short): 42.258053
Test Summary...
Test passed

dct8x8.exe / result
barbara_cuda_short.bmp

dct8x8.exe / result
barbara_cuda1.bmp

dct8x8.exe / result
barbara_cuda2.bmp

dct8x8.exe / result
barbara_gold1.bmp

dct8x8.exe / result
barbara_gold2.bmp

deviceQuery.exe 1/2
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 560 Ti"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 2050 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32

deviceQuery.exe 2/2
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1,
Device0 = GeForce
GTX 560 Ti

deviceQueryDrv.exe 1/2
C:¥ProgramData¥NVIDIA Corporation¥CUDA
Samples¥v5.0¥bin¥win64¥Release¥deviceQueryDrv.exe Starting...
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 560 Ti"
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 2050 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536,65535)
3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32

deviceQueryDrv.exe 2/2
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

dwtHaar1D.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥dwtHaar1D.exe Starting...
source file = "../../../3_Imaging/dwtHaar1D/data/signal.dat"
reference file = "result.dat"
gold file = "../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"
Reading signal from "../../../3_Imaging/dwtHaar1D/data/signal.dat"
Writing result to "result.dat"
Reading reference result from "../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"
Test success!
Signal.dat
9.5012929e-001
2.3113851e-001
6.0684258e-001
4.8598247e-001
8.9129897e-001
・
・
・
Regression.gold.dat
Result.dat

dxtc.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥dxtc.exe Starting...
Image Loaded '../../../3_Imaging/dxtc/data/lena_std.ppm', 512 x 512 pixels
Running DXT Compression on 512 x 512 image...
16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...
dxtc, Throughput = 17.7004 MPixels/s, Time = 0.01481 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup =
64

dxtc.exe 1/4
Checking accuracy...
Deviation at ( 9, 1): 0.791667 rms
Deviation at ( 100, 8): 2.416667 rms
Deviation at ( 29, 10): 0.020833 rms
Deviation at ( 79, 10): 1.833333 rms
Deviation at ( 13, 11): 1.041667 rms
Deviation at ( 28, 13): 0.562500 rms
Deviation at ( 90, 13): 0.708333 rms
Deviation at ( 25, 14): 0.520833 rms
Deviation at ( 69, 14): 0.770833 rms
Deviation at ( 87, 16): 0.708333 rms
Deviation at ( 90, 17): 1.041667 rms
Deviation at ( 24, 19): 0.916667 rms
Deviation at ( 25, 19): 0.625000 rms
Deviation at ( 26, 19): 1.041667 rms
Deviation at ( 55, 20): 4.791667 rms
Deviation at ( 20, 23): 1.541667 rms
Deviation at ( 99, 23): 3.312500 rms
Deviation at ( 45, 24): 18.104166 rms

dxtc.exe 2/4
Deviation at ( 21, 30): 1.562500 rms
Deviation at ( 115, 32): 24.104166 rms
Deviation at ( 102, 33): 2.250000 rms
Deviation at ( 50, 35): 26.958334 rms
Deviation at ( 68, 35): 11.937500 rms
Deviation at ( 115, 36): 0.458333 rms
Deviation at ( 12, 38): 2.166667 rms
Deviation at ( 40, 40): 0.270833 rms
Deviation at ( 86, 43): 0.604167 rms
Deviation at ( 116, 43): 0.125000 rms
Deviation at ( 43, 44): 2.250000 rms
Deviation at ( 54, 44): 4.791667 rms
Deviation at ( 46, 46): 2.875000 rms
Deviation at ( 116, 46): 0.604167 rms
Deviation at ( 117, 48): 0.937500 rms
Deviation at ( 23, 51): 3.520833 rms
Deviation at ( 11, 52): 0.041667 rms
Deviation at ( 67, 54): 5.687500 rms
Deviation at ( 26, 55): 0.854167 rms
Deviation at ( 21, 56): 5.000000 rms
Deviation at ( 24, 56): 0.562500 rms
Deviation at ( 30, 57): 0.937500 rms
Deviation at ( 21, 59): 2.541667 rms
Deviation at ( 120, 59): 0.104167 rms
Deviation at ( 112, 60): 1.125000 rms
Deviation at ( 77, 61): 1.083333 rms

dxtc.exe 3/4
Deviation at ( 114, 62): 4.958333 rms
Deviation at ( 78, 66): 0.541667 rms
Deviation at ( 106, 68): 0.375000 rms
Deviation at ( 16, 70): 3.104167 rms
Deviation at ( 10, 71): 0.937500 rms
Deviation at ( 108, 71): 0.354167 rms
Deviation at ( 118, 72): 5.562500 rms
Deviation at ( 11, 73): 0.541667 rms
Deviation at ( 68, 74): 1.937500 rms
Deviation at ( 70, 76): 1.791667 rms
Deviation at ( 124, 76): 3.354167 rms
Deviation at ( 103, 78): 0.375000 rms
Deviation at ( 127, 78): 0.541667 rms
Deviation at ( 108, 79): 0.083333 rms
Deviation at ( 120, 81): 0.541667 rms
Deviation at ( 43, 82): 24.979166 rms
Deviation at ( 67, 82): 3.125000 rms
Deviation at ( 78, 82): 2.437500 rms
Deviation at ( 123, 84): 0.541667 rms
Deviation at ( 127, 85): 0.187500 rms
Deviation at ( 122, 87): 0.083333 rms
Deviation at ( 124, 87): 0.541667 rms
Deviation at ( 127, 88): 0.229167 rms
Deviation at ( 93, 91): 0.666667 rms
Deviation at ( 115, 93): 0.083333 rms
Deviation at ( 69, 95): 1.875000 rms
Deviation at ( 106, 95): 1.125000 rms

dxtc.exe 4/4
Deviation at ( 107, 95): 3.708333 rms
Deviation at ( 13, 96): 1.354167 rms
Deviation at ( 115, 98): 0.187500 rms
Deviation at ( 118, 98): 0.187500 rms
Deviation at ( 116, 101): 0.187500 rms
Deviation at ( 78, 105): 0.541667 rms
Deviation at ( 67, 107): 0.708333 rms
Deviation at ( 74, 107): 0.375000 rms
Deviation at ( 65, 109): 0.770833 rms
Deviation at ( 89, 109): 0.708333 rms
Deviation at ( 118, 109): 3.854167 rms
Deviation at ( 67, 110): 1.083333 rms
Deviation at ( 88, 111): 0.208333 rms
Deviation at ( 64, 113): 0.708333 rms
Deviation at ( 84, 113): 0.333333 rms
Deviation at ( 88, 113): 0.187500 rms
Deviation at ( 84, 114): 1.666667 rms
Deviation at ( 66, 115): 0.770833 rms
Deviation at ( 19, 118): 5.270833 rms
Deviation at ( 76, 121): 0.104167 rms
Deviation at ( 70, 122): 0.708333 rms
Deviation at ( 91, 122): 0.208333 rms
Deviation at ( 71, 123): 0.854167 rms
Deviation at ( 75, 123): 0.854167 rms
Deviation at ( 61, 124): 0.937500 rms
Deviation at ( 91, 124): 0.270833 rms
RMS(reference, result) = 0.015488
Test passed

Summary
GTX560, Some samples does not work fine.
→ MUST support CUDA compute capability 3.0.
→ Requires GPU devices with compute SM 3.5 or
higher.
This evaluation to be continued, For future
reference.

Nvidia® cuda™ 5 sample evaluationresult_2

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (9)

Similar to Nvidia® cuda™ 5 sample evaluationresult_2

Similar to Nvidia® cuda™ 5 sample evaluationresult_2 (20)

More from Yukio Saito

More from Yukio Saito (20)

Recently uploaded

Recently uploaded (20)

Nvidia® cuda™ 5 sample evaluationresult_2