SlideShare a Scribd company logo
1 of 24
“GPU With CUDA Architecture”
Presented By-
Dhaval Kaneria (13014061010)
Guided By-
Mr. Rajesh k Navandar
Table Of Contents
• Introduction of GPU
• Performance Factors Of GPU
• GPU Pipeline
• Block Diagram Of Pipeline Process Flow
• Introduction Of CUDA
• Thread Batching
• Simple Processing Flow
• CUDA C/C++
• Applications
• The Future Scope Of CUDA Technology
• Conclusion
• References
2
Introduction of GPU
• A Graphics Processing Unit (GPU) is a microprocessor that has been designed
specifically for the processing of 3D graphics.
• The processor is built with integrated transform, lighting, triangle setup/clipping,
and rendering engines, capable of handling millions of math-intensive processes
per second.
• GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products
such as desktop PCs, portable computers, and game consoles to process real-time
3D graphics that only a few years ago were only available on high-end workstations.
• Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D
scene is redrawn. These are mathematically-intensive tasks, which otherwise,
would put quite a strain on the CPU. Lifting this burden from the CPU frees up
cycles that can be used for other jobs.
3
Performance Factors Of GPU
• Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as
high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given
to it.
• Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality
and at very high resolution.
• Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
• Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing performance of GPU.
4
GPU Pipeline
• The GPU receives geometry information from the CPU as an input and provides a
picture as an output
• The host interface is the communication bridge between the CPU and the GPU
• It receives commands from the CPU and also pulls geometry information from
system memory.
• It outputs a stream of vertices in object space with all their associated information
(normals, texture coordinates, per vertex color etc)
• The vertex processing stage receives vertices from the host interface in object
space and outputs them in screen space
• This may be a simple linear transformation, or a complex operation involving
morphing effects
host
interface
vertex
processing
triangle
setup
pixel
processing
memory
interface
Cont..
• A fragment is generated if and only if its center is inside the triangle
• Every fragment generated has its attributes computed to be the
perspective correct interpolation of the three vertices that make up the
triangle
• Each fragment provided by triangle setup is fed into fragment processing
as a set of attributes (position, normal, texcord etc), which are used to
compute the final color for this pixel Before the final write occurs, some
fragments are rejected by the zbuffer, stencil and alpha tests
6
Block Diagram Of Pipeline Process Flow
7
Cont..
• Allow shader to be applied to each vertex Transformation and other per
vertex ops
• Allow vertex shader to fetch texture data
• Cull/clip–per primitive operation and data preparation for rasterization
• Rasterization: primitive to pixel mapping
• Z culling: quick pixel elimination based on Depth
• Fragment : a candidate pixel Varying number of pixel pipelines
• SIMD processing hides texture fetch latency
8
Introduction Of CUDA
9
•CUDA aka Compute unified device architecture is parallel computing platform and
programing model which is implemented by graphics processing unit.
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which
run in parallel on many threads
• Differences between GPU and CPU threads
 GPU threads are extremely lightweight
 Very little creation overhead
 GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few
Thread Batching: Grids and Blocks
•A kernel is executed as a grid of thread
blocks
–All threads share data memory
space
•A thread block is a batch of threads that
can cooperate with each other by:
–Synchronizing their execution
•For hazard-free shared memory
accesses
–Efficiently sharing data through a
low latency shared memory
•Two threads from two different blocks
cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
Block and Thread IDs
•Threads and blocks have IDs
–So each thread can decide what data to
work on
–Block ID: 1D or 2D
–Thread ID: 1D, 2D, or 3D
•Simplifies memory
•addressing when processing
•multidimensional data
–Image processing
–Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview
•Each thread can:
–R/W per-thread registers
–R/W per-thread local memory
–R/W per-block shared memory
–R/W per-grid global memory
–Read only per-grid constant memory
–Read only per-grid texture memory
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
The host can R/W global,
constant, and texture memories
Global, Constant, and Texture Memories
•Global memory
–Main means of communicating R/W
- Data between host and device
–Contents visible to all threads
•Texture and Constant Memories
–Constants initialized by host
–Contents visible to all threads
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. CPU instruct process to GPU
3. Load GPU program and execute, caching data on chip for performance
4. Copy results from GPU memory to CPU memory
15
CUDA C/C++
16
• CUDA Language:
C with Minimal Extensions
• Philosophy: provide minimal set of extensions necessary to expose power
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // variable in per-block shared memory
• Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
• Special variables for thread identification in kernels
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
• Intrinsics that expose specific operations in kernel code
__syncthreads(); // barrier synchronization within kernel
Applications
17
•Military (lots)
•Mine planning
•Molecular dynamics
•MRI reconstruction
•Network processing
•Neural network
•Protein folding
•Quantum chemistry
•Ray tracing
•Radar
•Reservoir simulation
•Robotic vision/AI
•Robotic surgery
•Satellite data analysis
•Seismic imaging
•Surgery simulation
•3D image analysis
•Adaptive radiation therapy
•Astronomy
•Automobile vision
•Bio informatics
•Biological simulation
•Broadcast
•Computational Fluid Dynamics
•Computer Vision
•Cryptography
•CT reconstruction
•Data Mining
•Electromagnetic simulation
•Equity training
•Financial - lots of areas
•Mathematics research
Simulation Result
18
•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
19
•Valid Results from bandwidth Test CUDA Sample
20
• Create an Array at the size of BLOCKS, allocate space for the array on the device, and
call,
generateArray<<<BLOCKS,1>>>( deviceArray );.
•This function will now run in BLOCKS parallel kernels, creating the entire array in one
call .
The Future Scope Of CUDA Technology
• Currently most of research is going on general purpose GPU. As GPU have a highly-
efficient and flexible parallel programmable features, a growing number of
researchers and business organizations started to use some of the non-graphical
rendering with GPU to implement the calculations, and create a new field of study:
GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to
implement more extensive scientific computing. GPGPU has been successfully used in
algebra, fluid simulation, database applications, spectrum analysis, and other non-
graphical applications
• Region-based Software Virtual Memory (RSVM), a software virtual memory running
on both CPU and GPU in a distributed and cooperative way.
• Size reduction
• Cooling technique
21
Conclusion.
• CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
• CUDA on GPUs can achieve great results on data parallel computations with a
few simple performance optimization strategies:
• Structure your application and select execution configurations to maximize
exploitation of the GPU’s parallel capabilities.
• Minimize CPU ↔GPU data transfers.
• Coalesce global memory accesses.
• Take advantage of shared memory.
• Minimize divergent warps.
• Minimize use of low-throughput instructions.
22
References
1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation
of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -
2004
2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel
Processing in CPU-GPU Collaborative Environment ”,CSIT-2008
3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory
for GPU’,IEEE-2013
4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia
Corporation
5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation
6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer
Technology
23
Thank-You
24

More Related Content

What's hot

Machine learning seminar ppt
Machine learning seminar pptMachine learning seminar ppt
Machine learning seminar pptRAHUL DANGWAL
 
Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing UnitKamran Ashraf
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSGayathri Gaayu
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)sohaib_alam
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Antonios Katsarakis
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCLUnai Lopez-Novoa
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDAJens RĂźhmkorf
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Jafar Khan
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
Soft computing
Soft computingSoft computing
Soft computingganeshpaul6
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 

What's hot (20)

Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Embedded systems basics
Embedded systems basicsEmbedded systems basics
Embedded systems basics
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Machine learning seminar ppt
Machine learning seminar pptMachine learning seminar ppt
Machine learning seminar ppt
 
Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing Unit
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Unit 1 chapter 1 Design and Analysis of Algorithms
Unit 1   chapter 1 Design and Analysis of AlgorithmsUnit 1   chapter 1 Design and Analysis of Algorithms
Unit 1 chapter 1 Design and Analysis of Algorithms
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Soft computing
Soft computingSoft computing
Soft computing
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 

Viewers also liked

Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLJanakiRam Raghumandala
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIAVirginia Grubert
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
Mobile gpu cloud computing
Mobile gpu cloud computing Mobile gpu cloud computing
Mobile gpu cloud computing marwa Ayad Mohamed
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programmingkim.mens
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unitDayakar Siddula
 
Introduction of Xcode
Introduction of XcodeIntroduction of Xcode
Introduction of XcodeDhaval Kaneria
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now MarketingGil Reich
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application DevelopmentDhaval Kaneria
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with JavaKelum Senanayake
 

Viewers also liked (20)

GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
Mobile gpu cloud computing
Mobile gpu cloud computing Mobile gpu cloud computing
Mobile gpu cloud computing
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programming
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
nvidia-intro
nvidia-intronvidia-intro
nvidia-intro
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unit
 
Introduction of Xcode
Introduction of XcodeIntroduction of Xcode
Introduction of Xcode
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now Marketing
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application Development
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Siri Vs Google Now
Siri Vs Google NowSiri Vs Google Now
Siri Vs Google Now
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 

Similar to Gpu with cuda architecture

Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfDavidsonJebaseelan1
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018Prabindh Sundareson
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic WorkingNived R Nambiar
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessorsArsalan Qureshi
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Matthias Trapp
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 

Similar to Gpu with cuda architecture (20)

Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessors
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 

More from Dhaval Kaneria

Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
open source hardware
open source hardwareopen source hardware
open source hardwareDhaval Kaneria
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Dhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Dhaval Kaneria
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processorDhaval Kaneria
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture Dhaval Kaneria
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYDhaval Kaneria
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceDhaval Kaneria
 
Network security
Network securityNetwork security
Network securityDhaval Kaneria
 
Token bus standard
Token bus standardToken bus standard
Token bus standardDhaval Kaneria
 

More from Dhaval Kaneria (18)

Swine flu
Swine flu Swine flu
Swine flu
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
HDMI
HDMIHDMI
HDMI
 
Hdmi
HdmiHdmi
Hdmi
 
open source hardware
open source hardwareopen source hardware
open source hardware
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
 
VERILOG CODE
VERILOG CODEVERILOG CODE
VERILOG CODE
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processor
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGY
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute Difference
 
Mems technology
Mems technologyMems technology
Mems technology
 
Network security
Network securityNetwork security
Network security
 
Token bus standard
Token bus standardToken bus standard
Token bus standard
 

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Gpu with cuda architecture

  • 1. “GPU With CUDA Architecture” Presented By- Dhaval Kaneria (13014061010) Guided By- Mr. Rajesh k Navandar
  • 2. Table Of Contents • Introduction of GPU • Performance Factors Of GPU • GPU Pipeline • Block Diagram Of Pipeline Process Flow • Introduction Of CUDA • Thread Batching • Simple Processing Flow • CUDA C/C++ • Applications • The Future Scope Of CUDA Technology • Conclusion • References 2
  • 3. Introduction of GPU • A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics. • The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second. • GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations. • Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs. 3
  • 4. Performance Factors Of GPU • Fill Rate: It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it. • Memory Bandwidth: It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management: The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly. • Hidden Surface removal: A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU. 4
  • 5. GPU Pipeline • The GPU receives geometry information from the CPU as an input and provides a picture as an output • The host interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory. • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects host interface vertex processing triangle setup pixel processing memory interface
  • 6. Cont.. • A fragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle • Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests 6
  • 7. Block Diagram Of Pipeline Process Flow 7
  • 8. Cont.. • Allow shader to be applied to each vertex Transformation and other per vertex ops • Allow vertex shader to fetch texture data • Cull/clip–per primitive operation and data preparation for rasterization • Rasterization: primitive to pixel mapping • Z culling: quick pixel elimination based on Depth • Fragment : a candidate pixel Varying number of pixel pipelines • SIMD processing hides texture fetch latency 8
  • 9. Introduction Of CUDA 9 •CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit.
  • 10. CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few
  • 11. Thread Batching: Grids and Blocks •A kernel is executed as a grid of thread blocks –All threads share data memory space •A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution •For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory •Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 12. Block and Thread IDs •Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D •Simplifies memory •addressing when processing •multidimensional data –Image processing –Solving PDEs on volumes Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 13. CUDA Device Memory Space Overview •Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories
  • 14. Global, Constant, and Texture Memories •Global memory –Main means of communicating R/W - Data between host and device –Contents visible to all threads •Texture and Constant Memories –Constants initialized by host –Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA
  • 15. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. CPU instruct process to GPU 3. Load GPU program and execute, caching data on chip for performance 4. Copy results from GPU memory to CPU memory 15
  • 16. CUDA C/C++ 16 • CUDA Language: C with Minimal Extensions • Philosophy: provide minimal set of extensions necessary to expose power • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; • Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel
  • 17. Applications 17 •Military (lots) •Mine planning •Molecular dynamics •MRI reconstruction •Network processing •Neural network •Protein folding •Quantum chemistry •Ray tracing •Radar •Reservoir simulation •Robotic vision/AI •Robotic surgery •Satellite data analysis •Seismic imaging •Surgery simulation •3D image analysis •Adaptive radiation therapy •Astronomy •Automobile vision •Bio informatics •Biological simulation •Broadcast •Computational Fluid Dynamics •Computer Vision •Cryptography •CT reconstruction •Data Mining •Electromagnetic simulation •Equity training •Financial - lots of areas •Mathematics research
  • 18. Simulation Result 18 •If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
  • 19. 19 •Valid Results from bandwidth Test CUDA Sample
  • 20. 20 • Create an Array at the size of BLOCKS, allocate space for the array on the device, and call, generateArray<<<BLOCKS,1>>>( deviceArray );. •This function will now run in BLOCKS parallel kernels, creating the entire array in one call .
  • 21. The Future Scope Of CUDA Technology • Currently most of research is going on general purpose GPU. As GPU have a highly- efficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other non- graphical applications • Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. • Size reduction • Cooling technique 21
  • 22. Conclusion. • CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C • CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies: • Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities. • Minimize CPU ↔GPU data transfers. • Coalesce global memory accesses. • Take advantage of shared memory. • Minimize divergent warps. • Minimize use of low-throughput instructions. 22
  • 23. References 1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year - 2004 2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-2008 3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-2013 4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation 5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation 6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology 23

Editor's Notes

  1. 1
  2. Global, constant, and texture memory spaces are persistent across kernels called by the same application.
  3. 15
  4. 17