Your SlideShare is downloading. ×
Cuda 6 performance_report
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Cuda 6 performance_report

482
views

Published on

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
482
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 CUDA 6.0 Performance Report April 2014
  • 2. 2 CUDA 6 Performance Report CUDART CUDA Runtime Library cuFFT Fast Fourier Transforms Library cuBLAS Complete BLAS Library cuSPARSE Sparse Matrix Library cuRAND Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries
  • 3. 3 0 5 10 15 20 25 30 35 40 CUDA 5 CUDA 6 CUDA 5 CUDA 6 Back to Back Launches Launch and Synchronize usec 1.5x Faster 2.0x Faster CUDA 6: 2x Faster GPU Kernel Launches Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20 a<<<...>>>; b<<<...>>>; a<<<...>>>; cudaDeviceSynchronize(); Dynamic Parallel Kernel Launches
  • 4. 4 cuFFT: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts New in CUDA 6 XT interface supports dual-GPU cards (Tesla K10, GeForce GTX690, …)
  • 5. 5 cuFFT: up to 700 GFLOPS Performance may vary based on OS version and motherboard configuration 0 100 200 300 400 500 600 700 800 1 3 5 7 9 11 13 15 17 19 21 23 25 GFLOPS log2(transform_size) Single Precision 0 50 100 150 200 250 300 1 3 5 7 9 11 13 15 17 19 21 23 25 GFLOPS log2(transform_size) Double Precision • cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs
  • 6. 6 cuFFT: Consistently High Performance 0 100 200 300 400 500 600 700 800 1 100 10,000 1,000,000 100,000,000 GFLOPS Transform Size Single Precision 0 50 100 150 200 250 300 1 100 10,000 1,000,000 100,000,000 GFLOPS Transform Size Double Precision • cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on devicePerformance may vary based on OS version and motherboard configuration 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs
  • 7. 7 0 20 40 60 80 100 120 140 cuFFT cuFFT-XT ExecutionTime(ms) 0 1 2 3 4 5 6 7 8 9 10 cuFFT cuFFT-XT ExecutionTime(ms) cuFFT-XT: Boosts Performance on K10 • cuFFT 6.0 on K10, ECC ON, input and output data on devicePerformance may vary based on OS version and motherboard configuration 25% Faster 65% Faster New in CUDA 6 256x256x256 512x512x512
  • 8. 8 cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface New in CUDA 6 XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size “Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
  • 9. 9 >3 TFLOPS single-precision >1 TFLOPS double-precision Performance may vary based on OS version and motherboard configuration • cuBLAS 6.0 on K40m, ECC ON, input and output data on device • m=n=k=4096, transpose=no, side=right, fill=lower cuBLAS: 0 500 1000 1500 2000 2500 3000 3500 SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK Single Single Complex Double Double Complex GFLOPS
  • 10. 10 cuBLAS: ZGEMM 5x Faster than MKL • cuBLAS 6.0 on K40m, ECC ON, input and output data on device • MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHzPerformance may vary based on OS version and motherboard configuration 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 4000 GFLOPS Matrix Dimension (m=n=k) cuBLAS MKL
  • 11. 11 cuBLAS-XT: Multi-GPU Performance Scaling • cuBLAS-XT 6.0 on K10, ECC ON, input and output data on hostPerformance may vary based on OS version and motherboard configuration 2.2 TFLOPS 4.2 TFLOPS 6.0 TFLOPS 7.9 TFLOPS 0 1 2 3 4 5 6 7 8 1 x K10 2 x K10 3 x K10 4 x K10 New in CUDA 6 16K x 16K SGEMM on Tesla K10
  • 12. 12 cuSPARSE: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines – matrix- vector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) 1.0 2.0 3.0 4.0 y1 y2 y3 y4 alpha + beta 1.0 6.0 4.0 7.0 3.02.0 5.0 y1 y2 y3 y4 New in CUDA 6 Many improvements to triangular solvers, Incomplete-LU, and Cholesky preconditioners
  • 13. 13 cuSPARSE: 5x Faster than MKL • Average of s/c/d/z routines • cuSPARSE 6.0 on K40m, ECC ON, input and output data on device • MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/Performance may vary based on OS version and motherboard configuration 0x 1x 2x 3x 4x 5x 6x SpeedupoverMKL Sparse Matrix x Dense Vector (SpMV)
  • 14. 14 cuRAND: Random Number Generation Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 6 Mersenne Twister 19937
  • 15. 15 cuRAND: Up to 75x Faster vs. Intel MKL • cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device • MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHzPerformance may vary based on OS version and motherboard configuration 0 2 4 6 8 10 12 14 16 Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Gsamples/sec cuRAND MKL
  • 16. 16 cuRAND: High Performance RNGs Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device 0 2 4 6 8 10 12 14 16 18 XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Sobol64 Scrambled Pseudo-random Quasi-random Gsamples/sec Uniform Distribution Normal Distribution Log-Normal Distribution
  • 17. 17 NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation New in CUDA 6 Over 500 new routines, including: median filter, BGR/YUV conversion, 3D LUT color conversion, improvements to JPEG primitives, plus many more
  • 18. 18 NPP Speedup vs. Intel IPP Performance may vary based on OS version and motherboard configuration • NPP 6.0 on K40m, input and output data on device • IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz 5.7x 14.4x 17.8x 6.3x 12.9x 28.5x 0x 5x 10x 15x 20x 25x 30x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT
  • 19. 19 CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly
  • 20. 20 Thrust Performance vs. Intel TBB Performance may vary based on OS version and motherboard configuration • Thrust v1.7.1 on K40m, ECC ON, input and output data on device • TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz 0x 2x 4x 6x 8x 10x 12x 14x reduce transform scan sort Speedup Thrust vs. TBB on 32M integers 43.7x 24.8x 13.3x 5.2x 15.0x 5.8x 0x 10x 20x 30x 40x 50x char short int long float double Speedup Thrust Sort vs. TBB on 32M samples
  • 21. 21 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) •Exponentials: exp, exp2, log, log2, log10, ... •Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ... •Special functions: lgamma, tgamma, erf, erfc •Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ... •Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],... New in CUDA 6 • Over 80 new SIMD instructions • Useful for video processing: _v*2, _v*4 • Cylindrical bessel: cyl_i{0,1} • 1/hypotenuse: rhypot