SlideShare a Scribd company logo
1 of 22
NVIDIA GPU Architecture
B y :
A n e e z a I m t i a z ( 0 4 7 )
F a t i m a Q a y y u m ( 0 1 1 )
M a h n o o r S h a u k a t ( 0 2 0 )
S y e d a A m m a r a B a t o o l ( 0 4 0 )
H a f s a Z u l i f i q a r ( 0 5 3 )
GPU (Graphic
Processing Unit)
o A Graphics Processing Unit
(GPU) also known as a Video
Processing Unit (VPU) is an
electronic circuit which rapidly
manipulates memory to
accelerate image
creation/processing to be
displayed on a display device.
o The term GPU was given by
NVIDIA in 1999 with their
release of the GeForce 256 as
“The world’s first GPU”.
2
GPU vs. CPU
o Performs Task
Parallelism.
o Have less cores but
high clock speed.
o Uses External RAM
which is slow but large
in size.
o Has high cache
memory.
CPU
o Performs data
parallelism.
o Have more cores
but low clock speed.
o Uses VRAM which
is fast but small in
size.
o Has low cache
memory.
GPU
Graphic Pipelining 4
o Vertex Shader: Provides location of vertices in a 3D space.
o Generating Primitives: Making polygons using vertex in 3D space.
o Rasterization: Process of filing triangular geometries with dots or pixels.
o Pixel Shader: Defines each pixel with attributes such as light and color.
o Testing and Mixing: Here the 3D objects are tested for shadow effects and
also the Anti-Aliasing (AA)
NVIDIA
o NVIDIA Corporation is an American global
technology founded in 1933.
o NVIADIA manufactures graphics processing
units (GPUs) the art and science of computer
graphics.
o With their invention of the GPU the engine of
modern visual computing the field has
expanded to encompass video games, movie
production, product design, medical
diagnosis and scientific research.
NVIDIA Technologies 6
3D
Vision
NVIDIA
Battery
Boost
DSR
G-SYNC
SLI
NV-
Link
Optimus
Mult
i-
GPU
Architecture OverviewCuda Core
SM’s
Schedular &
Dispatcher
Register & L1 Cache
o Fermi is the codename for a GPU micro
architecture developed by NVIDIA, first released to
retail in April 2010.
o Successor to the Tesla.
o Primary micro architecture used in the GeForce
400 series and GeForce 500 series.
o It was followed by Kepler.
Fermi Graphic Processing Units (GPUs) feature 3.0 billion
transistors
o Streaming Multiprocessor (SM): composed of 32 CUDA
o GigaThread global scheduler: distributes thread blocks to
SM thread schedulers and manages the context switches
between threads during execution
o Host interface: connects the GPU to the CPU via a PCI-
Express v2 bus
o DRAM: supported up to 6GB of GDDR5 DRAM memory
Fermi Architecture
Load/Store Units (LD/ST):
o Each SM has 16 load/store units, allowing
source and destination addresses to be
calculated for sixteen threads per clock.
o Supporting units load and store the data
at each address to cache or DRAM.
Special Function Units (SFU):
o SFU execute transcendental instructions
such as sin, cosine, reciprocal, and square
root.
o Each SFU executes one instruction per
thread, per clock.
Parallel Tessellation Engines
o Traditional GPU designs use a single
geometry engine to perform tessellation.
o This approach is analogous to early GPU
designs which used a single pixel pipeline
to perform pixel shading.
o But in GTX 480 the tessellation
architecture is parallel.
o The result is a breakthrough in
tessellation performance at up to two billion
triangles per second.
Third Generation Streaming Multiprocessor
SM introduces several
architectural innovations that
improve
 performance and
 accuracy
Each of Fermi’s SMs contains 32
CUDA processors. By employing a
flexible scalar architecture, CUDA
cores achieve full performance on a
variety of workloads such as textures,
shadow maps, and complex shaders.
Each CUDA processor has a fully
pipelined integer arithmetic logic unit
(ALU) and floating point unit (FPU).
Fermi applies this high standard of
precision for all workloads, i.e.
games, video transcoding, or
desktop applications.
 The result is consistently high
performance in current as well as
future games.
 Fermi’s third generation SM also
improves execution efficiency through
improved scheduling.
1
3
2
4
Dual Warp Scheduler
o The SM schedules threads in groups of 32
parallel threads called warps.
o Each SM has two warp schedulers and two
instruction dispatch units, allowing two warps to be
issued and executed concurrently.
o As warps execute independently, Fermi’s
scheduler does not need to check for
dependencies from within the instruction stream.
o Using this elegant model of dual-issue, Fermi
achieves near peak hardware performance.
Second Generation Parallel Thread Execution ISA
o Fermi is the first architecture to support the new
Parallel Thread eXecution (PTX) 2.0 instruction set.
o PTX is a low level virtual machine and ISA designed
to support the operations of a parallel thread processor.
o At program install time, PTX instructions are
translated to machine instructions by the GPU driver.
oThe primary goals of PTX are:
Provide a stable
ISA that spans
multiple GPU
generations
Achieve full GPU
performance in
compiled
applications
Provide a scalable
programming model that
spans GPU sizes from a few
cores to many parallel cores
Provide a machine-
independent ISA for C,
C++, Fortran, and
other compiler targets.
Facilitate hand-
coding of libraries
and performance
kernels
Fermi Memory Hierarchy 14
1
2
3
4
 Large and Unified Register File
(32768 Registers)
 128KB Register File per SM
SM’s Register Files
 Configurable 64KB Memory
 Shared Multi-Threads & L1
Private
 Very low latency (20-30
cycles)
 High bandwidth (1,000+ GB/s)
L1 Cache/ Shared Memory
 768KB Unified Cache
 Shared among SMs
 ECC protected
 Fast Atomic Memory
Operations
L2 Cache
 Accessed by GPU and CPU
 Six 64-bit DRAM channels
 Up to 6GB GDDR5 Memory
 Higher latency (400-800
cycles)
 Throughput: up to 177 GB/s
Global Memory
Memory Architecture: 15
Different CPU threads
can work on different
instructions (addition,
multiplication)
concurrently. But all 32
threads in a warp can
execute only same
instruction concurrently.
GPUs follow concept of
‘Single Instruction,
Multiple Data’(SIMD).
Here the green colored
area is the GPU. So
notice that a GPU has its
own memory on board.
This “GPU Memory”
can be from 768
megabytes to 6
gigabytes of GDDR5
memory.
But the memory
bandwidth of GPUs
are much higher than
memory bandwidth of
System Memory.
L1 cache on GPUs are
not coherent. Meaning
that, two different L1
caches can not work
together on same
memory location.
GPU Bandwidth
o High bandwidth between main
memory is required to support multiple
cores.
o GPU memory systems are designed for
data throughput with wide memory
buses.
o Much larger bandwidth than typical
CPUs typically 6 to 8 times
New Render Output Units with Improved Anti-aliasing
Fermi’s Render Output
(ROP) subsystem has been
redesigned for improved
throughput and efficiency.
One Fermi ROP partition
contains eight ROP units, a
twofold improvement over
prior architectures.
8x antialiasing, an
expensive operation on
prior generation GPUs, is
now much faster thanks to
improved memory
compression and a larger
framebuffer.
Along with performance
improvements, image
quality is also improved.
Fermi supports 32x
coverage sampling
antialiasing (CSAA), the
highest sample antialiasing
mode on any GPU.
First GPU with ECC Memory Support
o Fermi is the first GPU to support Error Correcting Code (ECC) based protection of data in
memory.
o ECC was requested by GPU computing users to enhance data integrity in high
performance computing environments.
o ECC is a highly desired feature in areas:
Medical
imaging
Large-scale cluster
computing
Cont...
o Naturally occurring radiation can cause a bit stored in memory to be altered,
resulting in a soft error. ECC technology detects and corrects single-bit soft
errors before they affect the system.
o Because the probability of such radiation induced errors increase linearly
with the number of installed systems, ECC is an essential requirement in
large cluster installations.
All NVIDIA GPUs include
support for the PCI Express
standard
for CRC check with retry at
the data link layer. Fermi also
supports the similar GDDR5
standard
for CRC check with retry (aka
“EDC”) during transmission
of data across the memory
bus.
Fermi supports Single-Error Correct Double-Error
Detect (SECDED)
SECDED ECC ensures
that all double bit errors and many multi-bit errors
are also be detected and reported so that
the program can be re-run rather than being allowed
to continue executing with bad data.
Fermi’s register files, shared
memories, L1 caches, L2 cache,
and DRAM memory are ECC
protected, making it not only the
most powerful GPU for HPC
applications, but also the most
reliable. In addition, Fermi
supports industry standards for
checking of data during
transmission from chip to chip.
Applications
Used for parallel computing for high calculation intensive tasks.
Digital Image
Processing
Statistical
Physics
Physics
Simulation
Analog Signal
Processing
Fast
Fourier
Transform
Fuzzy
Logics
Thank You
Reference:
1. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture
2. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
_Architecture_Whitepaper.pdf

More Related Content

What's hot

Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systemsjpaugh
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Graphics Processing Unit - GPU
Graphics Processing Unit - GPUGraphics Processing Unit - GPU
Graphics Processing Unit - GPUChetan Gole
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_FinalGopi Krishnamurthy
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit pptNitesh Dubey
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentationJosiah Lund
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)Akhila Dakshina
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingTushar Swami
 

What's hot (20)

Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
CUDA
CUDACUDA
CUDA
 
USB Drivers
USB DriversUSB Drivers
USB Drivers
 
GPU
GPUGPU
GPU
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Graphics Processing Unit - GPU
Graphics Processing Unit - GPUGraphics Processing Unit - GPU
Graphics Processing Unit - GPU
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
Embedded Linux on ARM
 

Similar to GPU Architecture NVIDIA (GTX GeForce 480)

Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdfDr. Manjunatha. P
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Sumit Khanka
 
Computer specifications
Computer specificationsComputer specifications
Computer specificationsLeonel Rivas
 
FZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator CardFZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator CardLinda Zhang
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxPranita602627
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale devicenie, jack
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3senayteklay
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 

Similar to GPU Architecture NVIDIA (GTX GeForce 480) (20)

Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Hardware (2)
Hardware (2)Hardware (2)
Hardware (2)
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdf
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.
 
Computer specifications
Computer specificationsComputer specifications
Computer specifications
 
UNIT 2 P1
UNIT 2 P1UNIT 2 P1
UNIT 2 P1
 
FZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator CardFZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator Card
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
P1 unit 2
P1 unit 2P1 unit 2
P1 unit 2
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale device
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Chipsets amd
Chipsets amdChipsets amd
Chipsets amd
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 

More from Fatima Qayyum

Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionFatima Qayyum
 
DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)Fatima Qayyum
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...Fatima Qayyum
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackFatima Qayyum
 
Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs Fatima Qayyum
 
Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document) Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document) Fatima Qayyum
 
Artificial Intelligence presentation
Artificial Intelligence presentation Artificial Intelligence presentation
Artificial Intelligence presentation Fatima Qayyum
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating SystemFatima Qayyum
 
Define & Undefine in SQL
Define & Undefine in SQLDefine & Undefine in SQL
Define & Undefine in SQLFatima Qayyum
 
Security System using XOR & NOR
Security System using XOR & NOR Security System using XOR & NOR
Security System using XOR & NOR Fatima Qayyum
 
Communication skills (English) 3
Communication skills (English) 3Communication skills (English) 3
Communication skills (English) 3Fatima Qayyum
 
Creativity and arts presentation (1)
Creativity and arts presentation (1)Creativity and arts presentation (1)
Creativity and arts presentation (1)Fatima Qayyum
 
World religon (islam & judaism)
World religon (islam & judaism)World religon (islam & judaism)
World religon (islam & judaism)Fatima Qayyum
 
Communication Skills
Communication SkillsCommunication Skills
Communication SkillsFatima Qayyum
 

More from Fatima Qayyum (18)

Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower Recognition
 
DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning Attack
 
Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs
 
Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document) Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document)
 
Stress managment
Stress managmentStress managment
Stress managment
 
Waterfall model
Waterfall modelWaterfall model
Waterfall model
 
Artificial Intelligence presentation
Artificial Intelligence presentation Artificial Intelligence presentation
Artificial Intelligence presentation
 
Subnetting
SubnettingSubnetting
Subnetting
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating System
 
Define & Undefine in SQL
Define & Undefine in SQLDefine & Undefine in SQL
Define & Undefine in SQL
 
Security System using XOR & NOR
Security System using XOR & NOR Security System using XOR & NOR
Security System using XOR & NOR
 
Communication skills (English) 3
Communication skills (English) 3Communication skills (English) 3
Communication skills (English) 3
 
Creativity and arts presentation (1)
Creativity and arts presentation (1)Creativity and arts presentation (1)
Creativity and arts presentation (1)
 
BCD Adder
BCD AdderBCD Adder
BCD Adder
 
World religon (islam & judaism)
World religon (islam & judaism)World religon (islam & judaism)
World religon (islam & judaism)
 
Communication Skills
Communication SkillsCommunication Skills
Communication Skills
 

Recently uploaded

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 

Recently uploaded (20)

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 

GPU Architecture NVIDIA (GTX GeForce 480)

  • 1. NVIDIA GPU Architecture B y : A n e e z a I m t i a z ( 0 4 7 ) F a t i m a Q a y y u m ( 0 1 1 ) M a h n o o r S h a u k a t ( 0 2 0 ) S y e d a A m m a r a B a t o o l ( 0 4 0 ) H a f s a Z u l i f i q a r ( 0 5 3 )
  • 2. GPU (Graphic Processing Unit) o A Graphics Processing Unit (GPU) also known as a Video Processing Unit (VPU) is an electronic circuit which rapidly manipulates memory to accelerate image creation/processing to be displayed on a display device. o The term GPU was given by NVIDIA in 1999 with their release of the GeForce 256 as “The world’s first GPU”. 2
  • 3. GPU vs. CPU o Performs Task Parallelism. o Have less cores but high clock speed. o Uses External RAM which is slow but large in size. o Has high cache memory. CPU o Performs data parallelism. o Have more cores but low clock speed. o Uses VRAM which is fast but small in size. o Has low cache memory. GPU
  • 4. Graphic Pipelining 4 o Vertex Shader: Provides location of vertices in a 3D space. o Generating Primitives: Making polygons using vertex in 3D space. o Rasterization: Process of filing triangular geometries with dots or pixels. o Pixel Shader: Defines each pixel with attributes such as light and color. o Testing and Mixing: Here the 3D objects are tested for shadow effects and also the Anti-Aliasing (AA)
  • 5. NVIDIA o NVIDIA Corporation is an American global technology founded in 1933. o NVIADIA manufactures graphics processing units (GPUs) the art and science of computer graphics. o With their invention of the GPU the engine of modern visual computing the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research.
  • 7. Architecture OverviewCuda Core SM’s Schedular & Dispatcher Register & L1 Cache
  • 8. o Fermi is the codename for a GPU micro architecture developed by NVIDIA, first released to retail in April 2010. o Successor to the Tesla. o Primary micro architecture used in the GeForce 400 series and GeForce 500 series. o It was followed by Kepler. Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors o Streaming Multiprocessor (SM): composed of 32 CUDA o GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution o Host interface: connects the GPU to the CPU via a PCI- Express v2 bus o DRAM: supported up to 6GB of GDDR5 DRAM memory Fermi Architecture
  • 9. Load/Store Units (LD/ST): o Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. o Supporting units load and store the data at each address to cache or DRAM. Special Function Units (SFU): o SFU execute transcendental instructions such as sin, cosine, reciprocal, and square root. o Each SFU executes one instruction per thread, per clock.
  • 10. Parallel Tessellation Engines o Traditional GPU designs use a single geometry engine to perform tessellation. o This approach is analogous to early GPU designs which used a single pixel pipeline to perform pixel shading. o But in GTX 480 the tessellation architecture is parallel. o The result is a breakthrough in tessellation performance at up to two billion triangles per second.
  • 11. Third Generation Streaming Multiprocessor SM introduces several architectural innovations that improve  performance and  accuracy Each of Fermi’s SMs contains 32 CUDA processors. By employing a flexible scalar architecture, CUDA cores achieve full performance on a variety of workloads such as textures, shadow maps, and complex shaders. Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Fermi applies this high standard of precision for all workloads, i.e. games, video transcoding, or desktop applications.  The result is consistently high performance in current as well as future games.  Fermi’s third generation SM also improves execution efficiency through improved scheduling. 1 3 2 4
  • 12. Dual Warp Scheduler o The SM schedules threads in groups of 32 parallel threads called warps. o Each SM has two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. o As warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream. o Using this elegant model of dual-issue, Fermi achieves near peak hardware performance.
  • 13. Second Generation Parallel Thread Execution ISA o Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. o PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. o At program install time, PTX instructions are translated to machine instructions by the GPU driver. oThe primary goals of PTX are: Provide a stable ISA that spans multiple GPU generations Achieve full GPU performance in compiled applications Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores Provide a machine- independent ISA for C, C++, Fortran, and other compiler targets. Facilitate hand- coding of libraries and performance kernels
  • 14. Fermi Memory Hierarchy 14 1 2 3 4  Large and Unified Register File (32768 Registers)  128KB Register File per SM SM’s Register Files  Configurable 64KB Memory  Shared Multi-Threads & L1 Private  Very low latency (20-30 cycles)  High bandwidth (1,000+ GB/s) L1 Cache/ Shared Memory  768KB Unified Cache  Shared among SMs  ECC protected  Fast Atomic Memory Operations L2 Cache  Accessed by GPU and CPU  Six 64-bit DRAM channels  Up to 6GB GDDR5 Memory  Higher latency (400-800 cycles)  Throughput: up to 177 GB/s Global Memory
  • 15. Memory Architecture: 15 Different CPU threads can work on different instructions (addition, multiplication) concurrently. But all 32 threads in a warp can execute only same instruction concurrently. GPUs follow concept of ‘Single Instruction, Multiple Data’(SIMD). Here the green colored area is the GPU. So notice that a GPU has its own memory on board. This “GPU Memory” can be from 768 megabytes to 6 gigabytes of GDDR5 memory. But the memory bandwidth of GPUs are much higher than memory bandwidth of System Memory. L1 cache on GPUs are not coherent. Meaning that, two different L1 caches can not work together on same memory location.
  • 16. GPU Bandwidth o High bandwidth between main memory is required to support multiple cores. o GPU memory systems are designed for data throughput with wide memory buses. o Much larger bandwidth than typical CPUs typically 6 to 8 times
  • 17. New Render Output Units with Improved Anti-aliasing Fermi’s Render Output (ROP) subsystem has been redesigned for improved throughput and efficiency. One Fermi ROP partition contains eight ROP units, a twofold improvement over prior architectures. 8x antialiasing, an expensive operation on prior generation GPUs, is now much faster thanks to improved memory compression and a larger framebuffer. Along with performance improvements, image quality is also improved. Fermi supports 32x coverage sampling antialiasing (CSAA), the highest sample antialiasing mode on any GPU.
  • 18. First GPU with ECC Memory Support o Fermi is the first GPU to support Error Correcting Code (ECC) based protection of data in memory. o ECC was requested by GPU computing users to enhance data integrity in high performance computing environments. o ECC is a highly desired feature in areas: Medical imaging Large-scale cluster computing
  • 19. Cont... o Naturally occurring radiation can cause a bit stored in memory to be altered, resulting in a soft error. ECC technology detects and corrects single-bit soft errors before they affect the system. o Because the probability of such radiation induced errors increase linearly with the number of installed systems, ECC is an essential requirement in large cluster installations.
  • 20. All NVIDIA GPUs include support for the PCI Express standard for CRC check with retry at the data link layer. Fermi also supports the similar GDDR5 standard for CRC check with retry (aka “EDC”) during transmission of data across the memory bus. Fermi supports Single-Error Correct Double-Error Detect (SECDED) SECDED ECC ensures that all double bit errors and many multi-bit errors are also be detected and reported so that the program can be re-run rather than being allowed to continue executing with bad data. Fermi’s register files, shared memories, L1 caches, L2 cache, and DRAM memory are ECC protected, making it not only the most powerful GPU for HPC applications, but also the most reliable. In addition, Fermi supports industry standards for checking of data during transmission from chip to chip.
  • 21. Applications Used for parallel computing for high calculation intensive tasks. Digital Image Processing Statistical Physics Physics Simulation Analog Signal Processing Fast Fourier Transform Fuzzy Logics
  • 22. Thank You Reference: 1. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture 2. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute _Architecture_Whitepaper.pdf

Editor's Notes

  1. Image source:- https://pixabay.com/en/computer-keyboard-apple-laptop-2563737/
  2. Image source:- https://pixabay.com/en/architecture-building-business-2179108/
  3. Image source:- https://pixabay.com/en/technology-industry-big-data-cpu-3092486/
  4. Image source:- https://pixabay.com/en/coding-programming-working-macbook-924920/