SlideShare a Scribd company logo
NVIDIA GPU Architecture
B y :
A n e e z a I m t i a z ( 0 4 7 )
F a t i m a Q a y y u m ( 0 1 1 )
M a h n o o r S h a u k a t ( 0 2 0 )
S y e d a A m m a r a B a t o o l ( 0 4 0 )
H a f s a Z u l i f i q a r ( 0 5 3 )
GPU (Graphic
Processing Unit)
o A Graphics Processing Unit
(GPU) also known as a Video
Processing Unit (VPU) is an
electronic circuit which rapidly
manipulates memory to
accelerate image
creation/processing to be
displayed on a display device.
o The term GPU was given by
NVIDIA in 1999 with their
release of the GeForce 256 as
“The world’s first GPU”.
2
GPU vs. CPU
o Performs Task
Parallelism.
o Have less cores but
high clock speed.
o Uses External RAM
which is slow but large
in size.
o Has high cache
memory.
CPU
o Performs data
parallelism.
o Have more cores
but low clock speed.
o Uses VRAM which
is fast but small in
size.
o Has low cache
memory.
GPU
Graphic Pipelining 4
o Vertex Shader: Provides location of vertices in a 3D space.
o Generating Primitives: Making polygons using vertex in 3D space.
o Rasterization: Process of filing triangular geometries with dots or pixels.
o Pixel Shader: Defines each pixel with attributes such as light and color.
o Testing and Mixing: Here the 3D objects are tested for shadow effects and
also the Anti-Aliasing (AA)
NVIDIA
o NVIDIA Corporation is an American global
technology founded in 1933.
o NVIADIA manufactures graphics processing
units (GPUs) the art and science of computer
graphics.
o With their invention of the GPU the engine of
modern visual computing the field has
expanded to encompass video games, movie
production, product design, medical
diagnosis and scientific research.
NVIDIA Technologies 6
3D
Vision
NVIDIA
Battery
Boost
DSR
G-SYNC
SLI
NV-
Link
Optimus
Mult
i-
GPU
Architecture OverviewCuda Core
SM’s
Schedular &
Dispatcher
Register & L1 Cache
o Fermi is the codename for a GPU micro
architecture developed by NVIDIA, first released to
retail in April 2010.
o Successor to the Tesla.
o Primary micro architecture used in the GeForce
400 series and GeForce 500 series.
o It was followed by Kepler.
Fermi Graphic Processing Units (GPUs) feature 3.0 billion
transistors
o Streaming Multiprocessor (SM): composed of 32 CUDA
o GigaThread global scheduler: distributes thread blocks to
SM thread schedulers and manages the context switches
between threads during execution
o Host interface: connects the GPU to the CPU via a PCI-
Express v2 bus
o DRAM: supported up to 6GB of GDDR5 DRAM memory
Fermi Architecture
Load/Store Units (LD/ST):
o Each SM has 16 load/store units, allowing
source and destination addresses to be
calculated for sixteen threads per clock.
o Supporting units load and store the data
at each address to cache or DRAM.
Special Function Units (SFU):
o SFU execute transcendental instructions
such as sin, cosine, reciprocal, and square
root.
o Each SFU executes one instruction per
thread, per clock.
Parallel Tessellation Engines
o Traditional GPU designs use a single
geometry engine to perform tessellation.
o This approach is analogous to early GPU
designs which used a single pixel pipeline
to perform pixel shading.
o But in GTX 480 the tessellation
architecture is parallel.
o The result is a breakthrough in
tessellation performance at up to two billion
triangles per second.
Third Generation Streaming Multiprocessor
SM introduces several
architectural innovations that
improve
 performance and
 accuracy
Each of Fermi’s SMs contains 32
CUDA processors. By employing a
flexible scalar architecture, CUDA
cores achieve full performance on a
variety of workloads such as textures,
shadow maps, and complex shaders.
Each CUDA processor has a fully
pipelined integer arithmetic logic unit
(ALU) and floating point unit (FPU).
Fermi applies this high standard of
precision for all workloads, i.e.
games, video transcoding, or
desktop applications.
 The result is consistently high
performance in current as well as
future games.
 Fermi’s third generation SM also
improves execution efficiency through
improved scheduling.
1
3
2
4
Dual Warp Scheduler
o The SM schedules threads in groups of 32
parallel threads called warps.
o Each SM has two warp schedulers and two
instruction dispatch units, allowing two warps to be
issued and executed concurrently.
o As warps execute independently, Fermi’s
scheduler does not need to check for
dependencies from within the instruction stream.
o Using this elegant model of dual-issue, Fermi
achieves near peak hardware performance.
Second Generation Parallel Thread Execution ISA
o Fermi is the first architecture to support the new
Parallel Thread eXecution (PTX) 2.0 instruction set.
o PTX is a low level virtual machine and ISA designed
to support the operations of a parallel thread processor.
o At program install time, PTX instructions are
translated to machine instructions by the GPU driver.
oThe primary goals of PTX are:
Provide a stable
ISA that spans
multiple GPU
generations
Achieve full GPU
performance in
compiled
applications
Provide a scalable
programming model that
spans GPU sizes from a few
cores to many parallel cores
Provide a machine-
independent ISA for C,
C++, Fortran, and
other compiler targets.
Facilitate hand-
coding of libraries
and performance
kernels
Fermi Memory Hierarchy 14
1
2
3
4
 Large and Unified Register File
(32768 Registers)
 128KB Register File per SM
SM’s Register Files
 Configurable 64KB Memory
 Shared Multi-Threads & L1
Private
 Very low latency (20-30
cycles)
 High bandwidth (1,000+ GB/s)
L1 Cache/ Shared Memory
 768KB Unified Cache
 Shared among SMs
 ECC protected
 Fast Atomic Memory
Operations
L2 Cache
 Accessed by GPU and CPU
 Six 64-bit DRAM channels
 Up to 6GB GDDR5 Memory
 Higher latency (400-800
cycles)
 Throughput: up to 177 GB/s
Global Memory
Memory Architecture: 15
Different CPU threads
can work on different
instructions (addition,
multiplication)
concurrently. But all 32
threads in a warp can
execute only same
instruction concurrently.
GPUs follow concept of
‘Single Instruction,
Multiple Data’(SIMD).
Here the green colored
area is the GPU. So
notice that a GPU has its
own memory on board.
This “GPU Memory”
can be from 768
megabytes to 6
gigabytes of GDDR5
memory.
But the memory
bandwidth of GPUs
are much higher than
memory bandwidth of
System Memory.
L1 cache on GPUs are
not coherent. Meaning
that, two different L1
caches can not work
together on same
memory location.
GPU Bandwidth
o High bandwidth between main
memory is required to support multiple
cores.
o GPU memory systems are designed for
data throughput with wide memory
buses.
o Much larger bandwidth than typical
CPUs typically 6 to 8 times
New Render Output Units with Improved Anti-aliasing
Fermi’s Render Output
(ROP) subsystem has been
redesigned for improved
throughput and efficiency.
One Fermi ROP partition
contains eight ROP units, a
twofold improvement over
prior architectures.
8x antialiasing, an
expensive operation on
prior generation GPUs, is
now much faster thanks to
improved memory
compression and a larger
framebuffer.
Along with performance
improvements, image
quality is also improved.
Fermi supports 32x
coverage sampling
antialiasing (CSAA), the
highest sample antialiasing
mode on any GPU.
First GPU with ECC Memory Support
o Fermi is the first GPU to support Error Correcting Code (ECC) based protection of data in
memory.
o ECC was requested by GPU computing users to enhance data integrity in high
performance computing environments.
o ECC is a highly desired feature in areas:
Medical
imaging
Large-scale cluster
computing
Cont...
o Naturally occurring radiation can cause a bit stored in memory to be altered,
resulting in a soft error. ECC technology detects and corrects single-bit soft
errors before they affect the system.
o Because the probability of such radiation induced errors increase linearly
with the number of installed systems, ECC is an essential requirement in
large cluster installations.
All NVIDIA GPUs include
support for the PCI Express
standard
for CRC check with retry at
the data link layer. Fermi also
supports the similar GDDR5
standard
for CRC check with retry (aka
“EDC”) during transmission
of data across the memory
bus.
Fermi supports Single-Error Correct Double-Error
Detect (SECDED)
SECDED ECC ensures
that all double bit errors and many multi-bit errors
are also be detected and reported so that
the program can be re-run rather than being allowed
to continue executing with bad data.
Fermi’s register files, shared
memories, L1 caches, L2 cache,
and DRAM memory are ECC
protected, making it not only the
most powerful GPU for HPC
applications, but also the most
reliable. In addition, Fermi
supports industry standards for
checking of data during
transmission from chip to chip.
Applications
Used for parallel computing for high calculation intensive tasks.
Digital Image
Processing
Statistical
Physics
Physics
Simulation
Analog Signal
Processing
Fast
Fourier
Transform
Fuzzy
Logics
Thank You
Reference:
1. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture
2. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
_Architecture_Whitepaper.pdf

More Related Content

What's hot

Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
Nived R Nambiar
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
Adithya Bhat
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
Syed Zaid Irshad
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
Josiah Lund
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
Hossam Adel
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
self employed
 
Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
Mauryasuraj98
 
Gpu
GpuGpu
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
Grigory Sapunov
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
Amal R
 
High performance computing
High performance computingHigh performance computing
High performance computing
punjab engineering college, chandigarh
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 

What's hot (20)

Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
 
Gpu
GpuGpu
Gpu
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 

Similar to GPU Architecture NVIDIA (GTX GeForce 480)

Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Hardware (2)
Hardware (2)Hardware (2)
Hardware (2)
Coky Fauzi Alfi
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdf
Dr. Manjunatha. P
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.
Sumit Khanka
 
Computer specifications
Computer specificationsComputer specifications
Computer specifications
Leonel Rivas
 
FZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator CardFZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator Card
Linda Zhang
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
Pranita602627
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale device
nie, jack
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
senayteklay
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similar to GPU Architecture NVIDIA (GTX GeForce 480) (20)

Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Hardware (2)
Hardware (2)Hardware (2)
Hardware (2)
 
Advances in Computer Hardware.pdf
Advances in Computer Hardware.pdfAdvances in Computer Hardware.pdf
Advances in Computer Hardware.pdf
 
Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.Ivy bridge vs Sandy bridge Micro-architecture.
Ivy bridge vs Sandy bridge Micro-architecture.
 
Computer specifications
Computer specificationsComputer specifications
Computer specifications
 
UNIT 2 P1
UNIT 2 P1UNIT 2 P1
UNIT 2 P1
 
FZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator CardFZ3 Card - Deep Learning Accelerator Card
FZ3 Card - Deep Learning Accelerator Card
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
P1 unit 2
P1 unit 2P1 unit 2
P1 unit 2
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale device
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Chipsets amd
Chipsets amdChipsets amd
Chipsets amd
 

More from Fatima Qayyum

Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower Recognition
Fatima Qayyum
 
DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)
Fatima Qayyum
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
Fatima Qayyum
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning Attack
Fatima Qayyum
 
Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs
Fatima Qayyum
 
Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document) Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document)
Fatima Qayyum
 
Stress managment
Stress managmentStress managment
Stress managment
Fatima Qayyum
 
Waterfall model
Waterfall modelWaterfall model
Waterfall model
Fatima Qayyum
 
Artificial Intelligence presentation
Artificial Intelligence presentation Artificial Intelligence presentation
Artificial Intelligence presentation
Fatima Qayyum
 
Subnetting
SubnettingSubnetting
Subnetting
Fatima Qayyum
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating System
Fatima Qayyum
 
Define & Undefine in SQL
Define & Undefine in SQLDefine & Undefine in SQL
Define & Undefine in SQL
Fatima Qayyum
 
Security System using XOR & NOR
Security System using XOR & NOR Security System using XOR & NOR
Security System using XOR & NOR
Fatima Qayyum
 
Communication skills (English) 3
Communication skills (English) 3Communication skills (English) 3
Communication skills (English) 3
Fatima Qayyum
 
Creativity and arts presentation (1)
Creativity and arts presentation (1)Creativity and arts presentation (1)
Creativity and arts presentation (1)
Fatima Qayyum
 
BCD Adder
BCD AdderBCD Adder
BCD Adder
Fatima Qayyum
 
World religon (islam & judaism)
World religon (islam & judaism)World religon (islam & judaism)
World religon (islam & judaism)
Fatima Qayyum
 
Communication Skills
Communication SkillsCommunication Skills
Communication Skills
Fatima Qayyum
 

More from Fatima Qayyum (18)

Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower Recognition
 
DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)DNS spoofing/poisoning Attack Report (Word Document)
DNS spoofing/poisoning Attack Report (Word Document)
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning Attack
 
Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs Gamification of Internet Security by Next Generation CAPTCHAs
Gamification of Internet Security by Next Generation CAPTCHAs
 
Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document) Srs (Software Requirement Specification Document)
Srs (Software Requirement Specification Document)
 
Stress managment
Stress managmentStress managment
Stress managment
 
Waterfall model
Waterfall modelWaterfall model
Waterfall model
 
Artificial Intelligence presentation
Artificial Intelligence presentation Artificial Intelligence presentation
Artificial Intelligence presentation
 
Subnetting
SubnettingSubnetting
Subnetting
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating System
 
Define & Undefine in SQL
Define & Undefine in SQLDefine & Undefine in SQL
Define & Undefine in SQL
 
Security System using XOR & NOR
Security System using XOR & NOR Security System using XOR & NOR
Security System using XOR & NOR
 
Communication skills (English) 3
Communication skills (English) 3Communication skills (English) 3
Communication skills (English) 3
 
Creativity and arts presentation (1)
Creativity and arts presentation (1)Creativity and arts presentation (1)
Creativity and arts presentation (1)
 
BCD Adder
BCD AdderBCD Adder
BCD Adder
 
World religon (islam & judaism)
World religon (islam & judaism)World religon (islam & judaism)
World religon (islam & judaism)
 
Communication Skills
Communication SkillsCommunication Skills
Communication Skills
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
obonagu
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
RicletoEspinosa1
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
iemerc2024
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Self-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptxSelf-Control of Emotions by Slidesgo.pptx
Self-Control of Emotions by Slidesgo.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 

GPU Architecture NVIDIA (GTX GeForce 480)

  • 1. NVIDIA GPU Architecture B y : A n e e z a I m t i a z ( 0 4 7 ) F a t i m a Q a y y u m ( 0 1 1 ) M a h n o o r S h a u k a t ( 0 2 0 ) S y e d a A m m a r a B a t o o l ( 0 4 0 ) H a f s a Z u l i f i q a r ( 0 5 3 )
  • 2. GPU (Graphic Processing Unit) o A Graphics Processing Unit (GPU) also known as a Video Processing Unit (VPU) is an electronic circuit which rapidly manipulates memory to accelerate image creation/processing to be displayed on a display device. o The term GPU was given by NVIDIA in 1999 with their release of the GeForce 256 as “The world’s first GPU”. 2
  • 3. GPU vs. CPU o Performs Task Parallelism. o Have less cores but high clock speed. o Uses External RAM which is slow but large in size. o Has high cache memory. CPU o Performs data parallelism. o Have more cores but low clock speed. o Uses VRAM which is fast but small in size. o Has low cache memory. GPU
  • 4. Graphic Pipelining 4 o Vertex Shader: Provides location of vertices in a 3D space. o Generating Primitives: Making polygons using vertex in 3D space. o Rasterization: Process of filing triangular geometries with dots or pixels. o Pixel Shader: Defines each pixel with attributes such as light and color. o Testing and Mixing: Here the 3D objects are tested for shadow effects and also the Anti-Aliasing (AA)
  • 5. NVIDIA o NVIDIA Corporation is an American global technology founded in 1933. o NVIADIA manufactures graphics processing units (GPUs) the art and science of computer graphics. o With their invention of the GPU the engine of modern visual computing the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research.
  • 7. Architecture OverviewCuda Core SM’s Schedular & Dispatcher Register & L1 Cache
  • 8. o Fermi is the codename for a GPU micro architecture developed by NVIDIA, first released to retail in April 2010. o Successor to the Tesla. o Primary micro architecture used in the GeForce 400 series and GeForce 500 series. o It was followed by Kepler. Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors o Streaming Multiprocessor (SM): composed of 32 CUDA o GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution o Host interface: connects the GPU to the CPU via a PCI- Express v2 bus o DRAM: supported up to 6GB of GDDR5 DRAM memory Fermi Architecture
  • 9. Load/Store Units (LD/ST): o Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. o Supporting units load and store the data at each address to cache or DRAM. Special Function Units (SFU): o SFU execute transcendental instructions such as sin, cosine, reciprocal, and square root. o Each SFU executes one instruction per thread, per clock.
  • 10. Parallel Tessellation Engines o Traditional GPU designs use a single geometry engine to perform tessellation. o This approach is analogous to early GPU designs which used a single pixel pipeline to perform pixel shading. o But in GTX 480 the tessellation architecture is parallel. o The result is a breakthrough in tessellation performance at up to two billion triangles per second.
  • 11. Third Generation Streaming Multiprocessor SM introduces several architectural innovations that improve  performance and  accuracy Each of Fermi’s SMs contains 32 CUDA processors. By employing a flexible scalar architecture, CUDA cores achieve full performance on a variety of workloads such as textures, shadow maps, and complex shaders. Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Fermi applies this high standard of precision for all workloads, i.e. games, video transcoding, or desktop applications.  The result is consistently high performance in current as well as future games.  Fermi’s third generation SM also improves execution efficiency through improved scheduling. 1 3 2 4
  • 12. Dual Warp Scheduler o The SM schedules threads in groups of 32 parallel threads called warps. o Each SM has two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. o As warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream. o Using this elegant model of dual-issue, Fermi achieves near peak hardware performance.
  • 13. Second Generation Parallel Thread Execution ISA o Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. o PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. o At program install time, PTX instructions are translated to machine instructions by the GPU driver. oThe primary goals of PTX are: Provide a stable ISA that spans multiple GPU generations Achieve full GPU performance in compiled applications Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores Provide a machine- independent ISA for C, C++, Fortran, and other compiler targets. Facilitate hand- coding of libraries and performance kernels
  • 14. Fermi Memory Hierarchy 14 1 2 3 4  Large and Unified Register File (32768 Registers)  128KB Register File per SM SM’s Register Files  Configurable 64KB Memory  Shared Multi-Threads & L1 Private  Very low latency (20-30 cycles)  High bandwidth (1,000+ GB/s) L1 Cache/ Shared Memory  768KB Unified Cache  Shared among SMs  ECC protected  Fast Atomic Memory Operations L2 Cache  Accessed by GPU and CPU  Six 64-bit DRAM channels  Up to 6GB GDDR5 Memory  Higher latency (400-800 cycles)  Throughput: up to 177 GB/s Global Memory
  • 15. Memory Architecture: 15 Different CPU threads can work on different instructions (addition, multiplication) concurrently. But all 32 threads in a warp can execute only same instruction concurrently. GPUs follow concept of ‘Single Instruction, Multiple Data’(SIMD). Here the green colored area is the GPU. So notice that a GPU has its own memory on board. This “GPU Memory” can be from 768 megabytes to 6 gigabytes of GDDR5 memory. But the memory bandwidth of GPUs are much higher than memory bandwidth of System Memory. L1 cache on GPUs are not coherent. Meaning that, two different L1 caches can not work together on same memory location.
  • 16. GPU Bandwidth o High bandwidth between main memory is required to support multiple cores. o GPU memory systems are designed for data throughput with wide memory buses. o Much larger bandwidth than typical CPUs typically 6 to 8 times
  • 17. New Render Output Units with Improved Anti-aliasing Fermi’s Render Output (ROP) subsystem has been redesigned for improved throughput and efficiency. One Fermi ROP partition contains eight ROP units, a twofold improvement over prior architectures. 8x antialiasing, an expensive operation on prior generation GPUs, is now much faster thanks to improved memory compression and a larger framebuffer. Along with performance improvements, image quality is also improved. Fermi supports 32x coverage sampling antialiasing (CSAA), the highest sample antialiasing mode on any GPU.
  • 18. First GPU with ECC Memory Support o Fermi is the first GPU to support Error Correcting Code (ECC) based protection of data in memory. o ECC was requested by GPU computing users to enhance data integrity in high performance computing environments. o ECC is a highly desired feature in areas: Medical imaging Large-scale cluster computing
  • 19. Cont... o Naturally occurring radiation can cause a bit stored in memory to be altered, resulting in a soft error. ECC technology detects and corrects single-bit soft errors before they affect the system. o Because the probability of such radiation induced errors increase linearly with the number of installed systems, ECC is an essential requirement in large cluster installations.
  • 20. All NVIDIA GPUs include support for the PCI Express standard for CRC check with retry at the data link layer. Fermi also supports the similar GDDR5 standard for CRC check with retry (aka “EDC”) during transmission of data across the memory bus. Fermi supports Single-Error Correct Double-Error Detect (SECDED) SECDED ECC ensures that all double bit errors and many multi-bit errors are also be detected and reported so that the program can be re-run rather than being allowed to continue executing with bad data. Fermi’s register files, shared memories, L1 caches, L2 cache, and DRAM memory are ECC protected, making it not only the most powerful GPU for HPC applications, but also the most reliable. In addition, Fermi supports industry standards for checking of data during transmission from chip to chip.
  • 21. Applications Used for parallel computing for high calculation intensive tasks. Digital Image Processing Statistical Physics Physics Simulation Analog Signal Processing Fast Fourier Transform Fuzzy Logics
  • 22. Thank You Reference: 1. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture 2. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute _Architecture_Whitepaper.pdf

Editor's Notes

  1. Image source:- https://pixabay.com/en/computer-keyboard-apple-laptop-2563737/
  2. Image source:- https://pixabay.com/en/architecture-building-business-2179108/
  3. Image source:- https://pixabay.com/en/technology-industry-big-data-cpu-3092486/
  4. Image source:- https://pixabay.com/en/coding-programming-working-macbook-924920/