SlideShare a Scribd company logo
Rob Gillen Intro to GPGPU Programing With CUDA
CodeStock is proudly partnered with: RecruitWise and Staff with Excellence - www.recruitwise.jobs Send instant feedback on this session via Twitter: Send a direct message with the room number to @CodeStock d codestock 411 This guy is Amazing! For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
Intro to GPGPU Programming with CUDA Rob Gillen
Welcome! Goals: Overview of GPGPU with CUDA “Vision Casting” for how you can use GPUs to improve your application Outline Why GPGPUs? Applications Tooling Hands-On: Matrix Multiplication Rating: http://spkr8.com/t/7714
CPU vs. GPU GPU devotes more transistors to data processing
NVIDIA Fermi ~1.5TFLOPS (SP)/~800GFLOPS (DP) 230 GB/s DRAM Bandwidth
Motivation FLoating-Point Operations per Second (FLOPS) and  memory bandwidth For the CPU and GPU
Example: Sparse Matrix-Vector CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms",  Williams et al, Supercomputing 2007
Rayleigh-Bénard Results Double precision 384 x 384 x 192 grid (max that fits in 4GB) Vertical slice of temperature at y=0 Transition from stratified (left) to turbulent (right) Regime depends on Rayleigh number: Ra = gαΔT/κν 8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
G80 Characteristics 367 GFLOPS  peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
Supercomputer Comparison
Applications Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products  These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management
*Not* for all applications SPMD (Single Program, Multiple Data) are best (data parallel) Operations need to be of sufficient size to overcome overhead Think Millions of operations.
Raytracing
NVIRT: CUDA Ray Tracing API
Tooling VS 2010 C++ (Express is OK… sortof.) NVIDIA CUDA-Capable GPU NVIDIA CUDA Toolkit (v4+) NVIDIA CUDA Tools (v4+) GPU Computing SDK NVIDIA Parallel Insight
Parallel Debugging
Parallel Analysis
VS Project Templates
VS Project Templates
Before we get too excited… Host vs Device Kernels  __global__   __device__  __host__ Thread/Block Control <<<x, y>>> Multi-dimensioned coordinate objects Memory Management/Movement Thread Management – think 1000’s or 1,000,000’s
Block IDs and Threads Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D  Simplifies memoryaddressing when processingmultidimensional data Image processing
CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocs! CUDA Thread Block Thread Id #:0 1 2 3 …          m    Thread program
Transparent Scalability Hardware is free to assigns blocks to any processor at any time A kernel scales across any number of parallel processors Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Each block can execute in any order relative to other blocks.
A Simple Running ExampleMatrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity
Programming Model:Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH timesfrom global memory N WIDTH M P WIDTH WIDTH WIDTH 27
Memory Layout of Matrix in C M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3 M M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3
Simple Matrix Multiplication (CPU) void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ {    for (int i = 0; i < Width; ++i) {‏   for (int j = 0; j < Width; ++j) {	 float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum;    }  } } N k j WIDTH M P i WIDTH k 29 WIDTH WIDTH
Simple Matrix Multiplication (GPU) void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { intsize = Width * Width * sizeof(float);  float* Md, Nd, Pd;    …   // 1. Allocate and Load M, N to device memory  cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size);
Simple Matrix Multiplication (GPU) // 2. Kernel invocation code – to be shown later      …  // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md);  cudaFree(Nd);  cudaFree(Pd); }
Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ {     // Pvalue is used to store the element of the matrix     // that is computed by the thread     float Pvalue = 0;
Kernel Function (contd.) for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue+= Melement * Nelement;    } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Nd k WIDTH tx Md Pd ty ty WIDTH tx k 33 WIDTH WIDTH
Kernel Function (full) // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ {    // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;  for (int k = 0; k < Width; ++k)‏ {      float Melement = Md[threadIdx.y*Width+k];      float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement;    } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; }
Kernel Invocation (Host Side)  // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Only One Thread Block Used Nd Grid 1 One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2)‏ 48    WIDTH Pd Md
Handling Arbitrary Sized Square Matrices Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Nd WIDTH Md Pd by You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)! TILE_WIDTH ty WIDTH bx tx 37 WIDTH WIDTH
Small Example Nd1,0 Nd0,0 Block(0,0) Block(1,0) Nd1,1 Nd0,1 P1,0 P0,0 P2,0 P3,0 Nd1,2 Nd0,2 TILE_WIDTH = 2 P0,1 P1,1 P3,1 P2,1 Nd0,3 Nd1,3 P0,2 P2,2 P3,2 P1,2 P0,3 P2,3 P3,3 P1,3 Pd1,0 Md2,0 Md1,0 Md0,0 Md3,0 Pd0,0 Pd2,0 Pd3,0 Md1,1 Md0,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd3,1 Pd2,1 Block(1,1) Block(0,1) Pd0,2 Pd2,2 Pd3,2 Pd1,2 Pd0,3 Pd2,3 Pd3,3 Pd1,3
Cleanup Topics Memory Management Pinned Memory (Zero-Transfer) Portable Pinned Memory Multi-GPU Wrappers (Python, Java, .NET) Kernels Atomics Thread Synchronization (staged reductions) NVCC
Questions? rob@gillenfamily.net@argodev http://rob.gillenfamily.netRate: http://spkr8.com/t/7714

More Related Content

What's hot

Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
Savith Satheesh
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
Shree Kumar
 
Cuda
CudaCuda
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Cuda
CudaCuda
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
 

What's hot (18)

Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Cuda
CudaCuda
Cuda
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Cuda
CudaCuda
Cuda
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
 

Similar to Intro to GPGPU Programming with Cuda

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
Ganesan Narayanasamy
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
Jungsoo Nam
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET Journal
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
datastack
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
csandit
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
csandit
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
Raymond Tay
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 

Similar to Intro to GPGPU Programming with Cuda (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 

More from Rob Gillen

CodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain SightCodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain Sight
Rob Gillen
 
What's in a password
What's in a password What's in a password
What's in a password
Rob Gillen
 
How well do you know your runtime
How well do you know your runtimeHow well do you know your runtime
How well do you know your runtime
Rob Gillen
 
Software defined radio and the hacker
Software defined radio and the hackerSoftware defined radio and the hacker
Software defined radio and the hacker
Rob Gillen
 
So whats in a password
So whats in a passwordSo whats in a password
So whats in a password
Rob Gillen
 
Hiding in plain sight
Hiding in plain sightHiding in plain sight
Hiding in plain sight
Rob Gillen
 
ETCSS: Into the Mind of a Hacker
ETCSS: Into the Mind of a HackerETCSS: Into the Mind of a Hacker
ETCSS: Into the Mind of a HackerRob Gillen
 
DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?
Rob Gillen
 
You think your WiFi is safe?
You think your WiFi is safe?You think your WiFi is safe?
You think your WiFi is safe?
Rob Gillen
 
Anatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow AttackAnatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow Attack
Rob Gillen
 
A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2Rob Gillen
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1Rob Gillen
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
Hands On with Amazon Web Services (StirTrek)
Hands On with Amazon Web Services (StirTrek)Hands On with Amazon Web Services (StirTrek)
Hands On with Amazon Web Services (StirTrek)
Rob Gillen
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
Rob Gillen
 
Amazon Web Services for the .NET Developer
Amazon Web Services for the .NET DeveloperAmazon Web Services for the .NET Developer
Amazon Web Services for the .NET Developer
Rob Gillen
 
05561 Xfer Research 02
05561 Xfer Research 0205561 Xfer Research 02
05561 Xfer Research 02
Rob Gillen
 
05561 Xfer Research 01
05561 Xfer Research 0105561 Xfer Research 01
05561 Xfer Research 01
Rob Gillen
 
05561 Xfer Consumer 01
05561 Xfer Consumer 0105561 Xfer Consumer 01
05561 Xfer Consumer 01
Rob Gillen
 

More from Rob Gillen (20)

CodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain SightCodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain Sight
 
What's in a password
What's in a password What's in a password
What's in a password
 
How well do you know your runtime
How well do you know your runtimeHow well do you know your runtime
How well do you know your runtime
 
Software defined radio and the hacker
Software defined radio and the hackerSoftware defined radio and the hacker
Software defined radio and the hacker
 
So whats in a password
So whats in a passwordSo whats in a password
So whats in a password
 
Hiding in plain sight
Hiding in plain sightHiding in plain sight
Hiding in plain sight
 
ETCSS: Into the Mind of a Hacker
ETCSS: Into the Mind of a HackerETCSS: Into the Mind of a Hacker
ETCSS: Into the Mind of a Hacker
 
DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?
 
You think your WiFi is safe?
You think your WiFi is safe?You think your WiFi is safe?
You think your WiFi is safe?
 
Anatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow AttackAnatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow Attack
 
AWS vs. Azure
AWS vs. AzureAWS vs. Azure
AWS vs. Azure
 
A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Hands On with Amazon Web Services (StirTrek)
Hands On with Amazon Web Services (StirTrek)Hands On with Amazon Web Services (StirTrek)
Hands On with Amazon Web Services (StirTrek)
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Amazon Web Services for the .NET Developer
Amazon Web Services for the .NET DeveloperAmazon Web Services for the .NET Developer
Amazon Web Services for the .NET Developer
 
05561 Xfer Research 02
05561 Xfer Research 0205561 Xfer Research 02
05561 Xfer Research 02
 
05561 Xfer Research 01
05561 Xfer Research 0105561 Xfer Research 01
05561 Xfer Research 01
 
05561 Xfer Consumer 01
05561 Xfer Consumer 0105561 Xfer Consumer 01
05561 Xfer Consumer 01
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

Intro to GPGPU Programming with Cuda

  • 1. Rob Gillen Intro to GPGPU Programing With CUDA
  • 2. CodeStock is proudly partnered with: RecruitWise and Staff with Excellence - www.recruitwise.jobs Send instant feedback on this session via Twitter: Send a direct message with the room number to @CodeStock d codestock 411 This guy is Amazing! For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
  • 3.
  • 4. Intro to GPGPU Programming with CUDA Rob Gillen
  • 5. Welcome! Goals: Overview of GPGPU with CUDA “Vision Casting” for how you can use GPUs to improve your application Outline Why GPGPUs? Applications Tooling Hands-On: Matrix Multiplication Rating: http://spkr8.com/t/7714
  • 6. CPU vs. GPU GPU devotes more transistors to data processing
  • 7. NVIDIA Fermi ~1.5TFLOPS (SP)/~800GFLOPS (DP) 230 GB/s DRAM Bandwidth
  • 8. Motivation FLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
  • 9. Example: Sparse Matrix-Vector CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
  • 10. Rayleigh-Bénard Results Double precision 384 x 384 x 192 grid (max that fits in 4GB) Vertical slice of temperature at y=0 Transition from stratified (left) to turbulent (right) Regime depends on Rayleigh number: Ra = gαΔT/κν 8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
  • 11. G80 Characteristics 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
  • 13. Applications Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management
  • 14. *Not* for all applications SPMD (Single Program, Multiple Data) are best (data parallel) Operations need to be of sufficient size to overcome overhead Think Millions of operations.
  • 16. NVIRT: CUDA Ray Tracing API
  • 17. Tooling VS 2010 C++ (Express is OK… sortof.) NVIDIA CUDA-Capable GPU NVIDIA CUDA Toolkit (v4+) NVIDIA CUDA Tools (v4+) GPU Computing SDK NVIDIA Parallel Insight
  • 22. Before we get too excited… Host vs Device Kernels __global__ __device__ __host__ Thread/Block Control <<<x, y>>> Multi-dimensioned coordinate objects Memory Management/Movement Thread Management – think 1000’s or 1,000,000’s
  • 23. Block IDs and Threads Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional data Image processing
  • 24. CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocs! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program
  • 25. Transparent Scalability Hardware is free to assigns blocks to any processor at any time A kernel scales across any number of parallel processors Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Each block can execute in any order relative to other blocks.
  • 26. A Simple Running ExampleMatrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity
  • 27. Programming Model:Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH timesfrom global memory N WIDTH M P WIDTH WIDTH WIDTH 27
  • 28. Memory Layout of Matrix in C M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3 M M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3
  • 29. Simple Matrix Multiplication (CPU) void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } } } N k j WIDTH M P i WIDTH k 29 WIDTH WIDTH
  • 30. Simple Matrix Multiplication (GPU) void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size);
  • 31. Simple Matrix Multiplication (GPU) // 2. Kernel invocation code – to be shown later … // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); }
  • 32. Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
  • 33. Kernel Function (contd.) for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue+= Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Nd k WIDTH tx Md Pd ty ty WIDTH tx k 33 WIDTH WIDTH
  • 34. Kernel Function (full) // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; }
  • 35. Kernel Invocation (Host Side) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
  • 36. Only One Thread Block Used Nd Grid 1 One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2)‏ 48 WIDTH Pd Md
  • 37. Handling Arbitrary Sized Square Matrices Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Nd WIDTH Md Pd by You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)! TILE_WIDTH ty WIDTH bx tx 37 WIDTH WIDTH
  • 38. Small Example Nd1,0 Nd0,0 Block(0,0) Block(1,0) Nd1,1 Nd0,1 P1,0 P0,0 P2,0 P3,0 Nd1,2 Nd0,2 TILE_WIDTH = 2 P0,1 P1,1 P3,1 P2,1 Nd0,3 Nd1,3 P0,2 P2,2 P3,2 P1,2 P0,3 P2,3 P3,3 P1,3 Pd1,0 Md2,0 Md1,0 Md0,0 Md3,0 Pd0,0 Pd2,0 Pd3,0 Md1,1 Md0,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd3,1 Pd2,1 Block(1,1) Block(0,1) Pd0,2 Pd2,2 Pd3,2 Pd1,2 Pd0,3 Pd2,3 Pd3,3 Pd1,3
  • 39. Cleanup Topics Memory Management Pinned Memory (Zero-Transfer) Portable Pinned Memory Multi-GPU Wrappers (Python, Java, .NET) Kernels Atomics Thread Synchronization (staged reductions) NVCC

Editor's Notes

  1. Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on research.nvidia.com.This is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  2. Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  3. Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  4. RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)