Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

NVIDIA® CUDA™ 5.0
Sample evaluation result
PART Ⅰ

GPU: GTX 560 Ti
CPU: i5-3450S (TDP65W)
RAM: 16GB
OS: Windows 7 x64 Ultimate

Yukio Saitoh | FXFROG.com
21st/Apr/2013

INDEX
Sample binary :
1. alignedTypes.exe
2. asyncAPI.exe
3. bandwidthTest.exe
4. batchCUBLAS.exe
5. bicubicTexture.exe
6. bilateralFilter.exe
7. bindlessTexture.exe / Failure
8. binomialOptions.exe
9. BlackScholes.exe 1/2
10. boxFilter.exe
11. boxFilterNPP.exe
12. cdpAdvancedQuicksort.exe / Failure
13. cdpLUDecomposition.exe / Failure
14. cdpQuadTree.exe / Failure
15. cdpSimplePrint.exe / Failure
16. cdpSimplePrint.exe / Failure
17. cdpSimpleQuicksort.exe / Failure
18. clock.exe

Sample target path and files
• C:¥ProgramData¥NVIDIA Corporation¥CUDA
Samples¥v5.0¥bin¥win64¥Release

alignedTypes.exe 1/2
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥alignedTypes.exe] - Starting...
GPU Device 0: "GeForce GTX 560 Ti" with compute capability 2.1

[GeForce GTX 560 Ti] has 8 MP(s) x 48 (Cores/MP) = 384 (Cores)
> Compute scaling value = 1.00
> Memory Size = 49999872
Allocating memory...
Generating host input data array...
Uploading input data to GPU memory...
Testing misaligned types...
uint8...
Avg. time: 2.563287 ms / Copy throughput: 18.166525 GB/s.
TEST OK
uint16...
TEST OK
RGBA8_misaligned...
TEST OK
LA32_misaligned...
TEST OK
RGB32_misaligned...
TEST OK
RGBA32_misaligned...
TEST OK

alignedTypes.exe 2/2
Testing aligned types...
RGBA8...
TEST OK
I32...
TEST OK
LA32...
TEST OK
RGB32...
TEST OK
RGBA32...
TEST OK
RGBA32_2...
TEST OK

[alignedTypes] -> Test Results: 0 Failures

asyncAPI.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥asyncAPI.exe] - Starting...

CUDA device [GeForce GTX 560 Ti]
time spent executing by the GPU: 22.45
time spent by CPU in CUDA calls: 0.04
CPU executed 12884 iterations while waiting for GPU to finish

bandwidthTest.exe
[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: GeForce GTX 560 Ti
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6016.1

Device to Host Bandwidth, 1 Device(s)
33554432 6103.5

Device to Device Bandwidth, 1 Device(s)
33554432 108588.2

batchCUBLAS.exe 1/3
batchCUBLAS Starting...

==== Running single kernels ====

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00010011 sec GFLOPS=41.8986
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
@@@@ dgemm test OK

==== Running N=10 without streams ====

Testing sgemm
#### args: lda=128 ldb=128 ldc=128
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
@@@@ dgemm test OK

batchCUBLAS.exe 2/3
==== Running N=10 without streams ====

Testing sgemm
#### args: lda=128 ldb=128 ldc=128
@@@@ sgemm test OK
Testing dgemm
#### args: lda=128 ldb=128 ldc=128
@@@@ dgemm test OK

==== Running N=10 with streams ====

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
@@@@ sgemm test OK
Testing dgemm
#### args: lda=128 ldb=128 ldc=128
@@@@ dgemm test OK

batchCUBLAS.exe 3/3
==== Running N=10 batched ====

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
#### args: lda=128 ldb=128 ldc=128
@@@@ sgemm test OK
Testing dgemm
#### args: lda=128 ldb=128 ldc=128
@@@@ dgemm test OK

Test Summary
0 error(s)

bicubicTexture.exe 1/2
Starting bicubicTexture
[CUDA BicubicTexture] (OpenGL Mode)
CUDA device [GeForce GTX 560 Ti] has 8 Multi-Processors
Loaded 'lena_bw.pgm', 512 x 512 pixels

Controls
=/- : Zoom in/out
b : Run Benchmark g_FilterMode
c : Draw Bicubic Spline Curve
[esc] - Quit

Press number keys to change filtering g_FilterMode:

1 : nearest filtering
2 : bilinear filtering
3 : bicubic filtering
4 : fast bicubic filtering
5 : Catmull-Rom filtering

bicubicTexture.exe 2/2
[CUDA BicubicTexture] (Benchmark Mode)
time: 0.098 ms, 2673.560320 Mpixels/sec
> FilterMode[1] = Nearest
> FilterMode[2] = Bilinear
> FilterMode[3] = Bicubic
> FilterMode[4] = Fast Bicubic
> FilterMode[5] = Catmull-Rom

bilateralFilter.exe 1/2
Loading ../../../3_Imaging/bilateralFilter/data/nature_monte.bmp...
BMP width: 640
BMP height: 480
BMP file loaded successfully!
Loaded '../../../3_Imaging/bilateralFilter/data/nature_monte.bmp', 640 x 480 pixels

Found 1 CUDA Capable device(s) supporting CUDA

Device 0: "GeForce GTX 560 Ti"
CUDA Runtime Version : 5.0
CUDA Compute Capability : 2.1

Found CUDA Capable Device 0: "GeForce GTX 560 Ti"
Setting active device to 0
Using device 0: GeForce GTX 560 Ti
Running Standard Demonstration with GLUT loop...

Press '+' and '-' to change filter width
Press ']' and '[' to change number of iterations
Press 'e' and 'E' to change Euclidean delta
Press 'g' and 'G' to changle Gaussian delta
Press 'a' or 'A' to change Animation mode ON/OFF

bindlessTexture.exe / Failure
CUDA bindlessTexture Starting...

No GPU device was found that can support CUDA compute capability 3.0.

binomialOptions.exe
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥binomialOptions.exe] - Starting...

Using single precision...
Generating input data...
Running GPU binomial tree...
Options count : 512
Time steps : 2048
binomialOptionsGPU() time: 29.790300 msec
Options per second : 17186.802203
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 1.323721E-004
CPU binomial vs. Black-Scholes
L1 norm: 1.045245E-004
CPU binomial vs. GPU binomial
L1 norm: 3.391858E-005
Shutting down...
Test passed

BlackScholes.exe 1/2
[C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥BlackScholes.exe] - Starting...

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 0.806277 msec
Effective memory bandwidth: 99.221508 GB/s
Gigaoptions per second : 9.922151

BlackScholes, Throughput = 9.9222 GOptions/s, Time = 0.00081 s, Size = 8000000 options, NumDevsUsed = 1,
Workgroup = 128

BlackScholes.exe 2/2
Reading back GPU results...
Checking the results...
...running CPU calculations.

Comparing the results...
L1 norm: 1.768024E-007
Max absolute error: 1.120567E-005

Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.

[BlackScholes] - Test Summary
Test passed

boxFilter.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilter.exe Starting...

Loaded '../../../3_Imaging/boxFilter/data/lenaRGB.ppm', 1024 x 1024 pixels

Found 1 CUDA Capable device(s) supporting CUDA

Device 0: "GeForce GTX 560 Ti"
CUDA Runtime Version : 5.0
CUDA Compute Capability : 2.1

Found CUDA Capable Device 0: "GeForce GTX 560 Ti"
Setting active device to 0
Running Standard Demonstration with GLUT loop...

Press '+' and '-' to change filter width
Press ']' and '[' to change number of iterations
Press 'a' or 'A' to change animation ON/OFF

boxFilterNPP.exe
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe Starting...


cudaSetDevice GPU0 = GeForce GTX 560 Ti
NPP Library Version 5.0.35
C:¥ProgramData¥NVIDIA Corporation¥CUDA Samples¥v5.0¥bin¥win64¥Release¥boxFilterNPP.exe using GPU
<GeForce GTX 560 Ti> wi
th 8 SM(s) with Compute 2.1
boxFilterNPP opened: <../../../common/data/Lena.pgm> successfully!
Saved image: ../../../common/data/Lena_boxFilter.pgm

cdpAdvancedQuicksort.exe / Failure
GPU 0 (GeForce GTX 560 Ti) does not support CUDA Dynamic Parallelism
cdpAdvancedQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...

cdpLUDecomposition.exe / Failure
Starting LU Decomposition (CUDA Dynamic Parallelism)
GPU device GeForce GTX 560 Ti has compute capabilities (SM 2.1)
cdpLUDecomposition requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...

cdpQuadTree.exe / Failure
cdpQuadTree requires SM 3.5 or higher to use CUDA Dynamic Parallelism. Exiting...

cdpSimplePrint.exe / Failure
starting Simple Print (CUDA Dynamic Parallelism)
cdpSimplePrint requires GPU devices with compute SM 3.5 or higher. Exiting...

cdpSimpleQuicksort.exe / Failure
cdpSimpleQuicksort requires GPU devices with compute SM 3.5 or higher. Exiting...

clock.exe
CUDA Clock sample

Total clocks = 15204

Summary
GTX560, Some samples does not work fine.

→ MUST support CUDA compute capability 3.0.
→ Requires GPU devices with compute SM 3.5 or
higher.

This evaluation to be continued, For future
reference.

Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1

Similar to Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1 (20)

More from Yukio Saito

More from Yukio Saito (20)

Recently uploaded

Recently uploaded (20)

Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1