1. NVIDIA GPU Architecture
B y :
A n e e z a I m t i a z ( 0 4 7 )
F a t i m a Q a y y u m ( 0 1 1 )
M a h n o o r S h a u k a t ( 0 2 0 )
S y e d a A m m a r a B a t o o l ( 0 4 0 )
H a f s a Z u l i f i q a r ( 0 5 3 )
2. GPU (Graphic
Processing Unit)
o A Graphics Processing Unit
(GPU) also known as a Video
Processing Unit (VPU) is an
electronic circuit which rapidly
manipulates memory to
accelerate image
creation/processing to be
displayed on a display device.
o The term GPU was given by
NVIDIA in 1999 with their
release of the GeForce 256 as
“The world’s first GPU”.
2
3. GPU vs. CPU
o Performs Task
Parallelism.
o Have less cores but
high clock speed.
o Uses External RAM
which is slow but large
in size.
o Has high cache
memory.
CPU
o Performs data
parallelism.
o Have more cores
but low clock speed.
o Uses VRAM which
is fast but small in
size.
o Has low cache
memory.
GPU
4. Graphic Pipelining 4
o Vertex Shader: Provides location of vertices in a 3D space.
o Generating Primitives: Making polygons using vertex in 3D space.
o Rasterization: Process of filing triangular geometries with dots or pixels.
o Pixel Shader: Defines each pixel with attributes such as light and color.
o Testing and Mixing: Here the 3D objects are tested for shadow effects and
also the Anti-Aliasing (AA)
5. NVIDIA
o NVIDIA Corporation is an American global
technology founded in 1933.
o NVIADIA manufactures graphics processing
units (GPUs) the art and science of computer
graphics.
o With their invention of the GPU the engine of
modern visual computing the field has
expanded to encompass video games, movie
production, product design, medical
diagnosis and scientific research.
8. o Fermi is the codename for a GPU micro
architecture developed by NVIDIA, first released to
retail in April 2010.
o Successor to the Tesla.
o Primary micro architecture used in the GeForce
400 series and GeForce 500 series.
o It was followed by Kepler.
Fermi Graphic Processing Units (GPUs) feature 3.0 billion
transistors
o Streaming Multiprocessor (SM): composed of 32 CUDA
o GigaThread global scheduler: distributes thread blocks to
SM thread schedulers and manages the context switches
between threads during execution
o Host interface: connects the GPU to the CPU via a PCI-
Express v2 bus
o DRAM: supported up to 6GB of GDDR5 DRAM memory
Fermi Architecture
9. Load/Store Units (LD/ST):
o Each SM has 16 load/store units, allowing
source and destination addresses to be
calculated for sixteen threads per clock.
o Supporting units load and store the data
at each address to cache or DRAM.
Special Function Units (SFU):
o SFU execute transcendental instructions
such as sin, cosine, reciprocal, and square
root.
o Each SFU executes one instruction per
thread, per clock.
10. Parallel Tessellation Engines
o Traditional GPU designs use a single
geometry engine to perform tessellation.
o This approach is analogous to early GPU
designs which used a single pixel pipeline
to perform pixel shading.
o But in GTX 480 the tessellation
architecture is parallel.
o The result is a breakthrough in
tessellation performance at up to two billion
triangles per second.
11. Third Generation Streaming Multiprocessor
SM introduces several
architectural innovations that
improve
performance and
accuracy
Each of Fermi’s SMs contains 32
CUDA processors. By employing a
flexible scalar architecture, CUDA
cores achieve full performance on a
variety of workloads such as textures,
shadow maps, and complex shaders.
Each CUDA processor has a fully
pipelined integer arithmetic logic unit
(ALU) and floating point unit (FPU).
Fermi applies this high standard of
precision for all workloads, i.e.
games, video transcoding, or
desktop applications.
The result is consistently high
performance in current as well as
future games.
Fermi’s third generation SM also
improves execution efficiency through
improved scheduling.
1
3
2
4
12. Dual Warp Scheduler
o The SM schedules threads in groups of 32
parallel threads called warps.
o Each SM has two warp schedulers and two
instruction dispatch units, allowing two warps to be
issued and executed concurrently.
o As warps execute independently, Fermi’s
scheduler does not need to check for
dependencies from within the instruction stream.
o Using this elegant model of dual-issue, Fermi
achieves near peak hardware performance.
13. Second Generation Parallel Thread Execution ISA
o Fermi is the first architecture to support the new
Parallel Thread eXecution (PTX) 2.0 instruction set.
o PTX is a low level virtual machine and ISA designed
to support the operations of a parallel thread processor.
o At program install time, PTX instructions are
translated to machine instructions by the GPU driver.
oThe primary goals of PTX are:
Provide a stable
ISA that spans
multiple GPU
generations
Achieve full GPU
performance in
compiled
applications
Provide a scalable
programming model that
spans GPU sizes from a few
cores to many parallel cores
Provide a machine-
independent ISA for C,
C++, Fortran, and
other compiler targets.
Facilitate hand-
coding of libraries
and performance
kernels
14. Fermi Memory Hierarchy 14
1
2
3
4
Large and Unified Register File
(32768 Registers)
128KB Register File per SM
SM’s Register Files
Configurable 64KB Memory
Shared Multi-Threads & L1
Private
Very low latency (20-30
cycles)
High bandwidth (1,000+ GB/s)
L1 Cache/ Shared Memory
768KB Unified Cache
Shared among SMs
ECC protected
Fast Atomic Memory
Operations
L2 Cache
Accessed by GPU and CPU
Six 64-bit DRAM channels
Up to 6GB GDDR5 Memory
Higher latency (400-800
cycles)
Throughput: up to 177 GB/s
Global Memory
15. Memory Architecture: 15
Different CPU threads
can work on different
instructions (addition,
multiplication)
concurrently. But all 32
threads in a warp can
execute only same
instruction concurrently.
GPUs follow concept of
‘Single Instruction,
Multiple Data’(SIMD).
Here the green colored
area is the GPU. So
notice that a GPU has its
own memory on board.
This “GPU Memory”
can be from 768
megabytes to 6
gigabytes of GDDR5
memory.
But the memory
bandwidth of GPUs
are much higher than
memory bandwidth of
System Memory.
L1 cache on GPUs are
not coherent. Meaning
that, two different L1
caches can not work
together on same
memory location.
16. GPU Bandwidth
o High bandwidth between main
memory is required to support multiple
cores.
o GPU memory systems are designed for
data throughput with wide memory
buses.
o Much larger bandwidth than typical
CPUs typically 6 to 8 times
17. New Render Output Units with Improved Anti-aliasing
Fermi’s Render Output
(ROP) subsystem has been
redesigned for improved
throughput and efficiency.
One Fermi ROP partition
contains eight ROP units, a
twofold improvement over
prior architectures.
8x antialiasing, an
expensive operation on
prior generation GPUs, is
now much faster thanks to
improved memory
compression and a larger
framebuffer.
Along with performance
improvements, image
quality is also improved.
Fermi supports 32x
coverage sampling
antialiasing (CSAA), the
highest sample antialiasing
mode on any GPU.
18. First GPU with ECC Memory Support
o Fermi is the first GPU to support Error Correcting Code (ECC) based protection of data in
memory.
o ECC was requested by GPU computing users to enhance data integrity in high
performance computing environments.
o ECC is a highly desired feature in areas:
Medical
imaging
Large-scale cluster
computing
19. Cont...
o Naturally occurring radiation can cause a bit stored in memory to be altered,
resulting in a soft error. ECC technology detects and corrects single-bit soft
errors before they affect the system.
o Because the probability of such radiation induced errors increase linearly
with the number of installed systems, ECC is an essential requirement in
large cluster installations.
20. All NVIDIA GPUs include
support for the PCI Express
standard
for CRC check with retry at
the data link layer. Fermi also
supports the similar GDDR5
standard
for CRC check with retry (aka
“EDC”) during transmission
of data across the memory
bus.
Fermi supports Single-Error Correct Double-Error
Detect (SECDED)
SECDED ECC ensures
that all double bit errors and many multi-bit errors
are also be detected and reported so that
the program can be re-run rather than being allowed
to continue executing with bad data.
Fermi’s register files, shared
memories, L1 caches, L2 cache,
and DRAM memory are ECC
protected, making it not only the
most powerful GPU for HPC
applications, but also the most
reliable. In addition, Fermi
supports industry standards for
checking of data during
transmission from chip to chip.
21. Applications
Used for parallel computing for high calculation intensive tasks.
Digital Image
Processing
Statistical
Physics
Physics
Simulation
Analog Signal
Processing
Fast
Fourier
Transform
Fuzzy
Logics