Graphics Processing Unit
A Graphics Processing Unit (GPU)
also known as a Video Processing
Unit (VPU) is an electronic circuit
which rapidly manipulates memory to
accelerate image
creation/processing to be displayed
on a display device.
The term GPU was given by NVIDIA
in 1999 with their release of the
GeForce 256 as “The world’s first
GPU”.
Modern GPUs uses the
concept of parallel
processing which makes
them efficient than CPUs in
processing large blocks of
data.
GPUs uses the process of rendering in which each pixel of
the display screen
(e.g. 1920x1080=2073600 pixels) is given a value of
texture, lighting and location on the screen. Using these
parameters, the GPUs make 3D images on a 2D screen.
The GPUs have there own dedicated RAM or Virtual RAM
or VRAM which stores information about each pixel (its
color, location and lighting) and also can store frame
buffers i.e. a full image which is going to be projected on
the screen.
GPU vs. CPU
Graphics Processing
Unit
• Performs Data Parallelism.
• Have more cores but low
clock speed.
• Uses VRAM which is fast
but small in size.
• Has low cache memory.
Central Processing Unit
• Performs Task Parallelism.
• Have less cores but high
clock speed.
• Uses External RAM which is
slow but large in size.
• Has high cache memory.
ARCHITECT
URE
Streaming Multiprocessors and
CUDA Cores
A Streaming Multiprocessor is the
main computation unit of a GPU. It
consists more smaller processing
units called Stream Processors or
CUDA Cores.
CUDA is the term coined by NVIDIA
which stands for Compute Unified
Device Architecture. A CUDA core
consists of further two units, a
Floating Point Unit which computes
floating point data and an Integer Unit
which computes integer data.
There are 16 Load/Store Unit per SM
There are also four Special Function Units which executes
transcendental instructions such as sine, cosine, reciprocal
& square root.
Each SM uses two Warp Scheduler & two Instruction
Dispatch Units, allowing two warps to be issued & executed
concurrently.
One of the feature of SM is the Fused Multiply-Add (FMA)
instruction. A frequently used sequence of operation in
computer graphics and in linear algebra is to multiply two
numbers and add the product to a third number.
Prior to FMA, this was achieved by Multiply-Add (MAD)
instruction in which performs multiplication with truncation,
followed by the addition.
The FMA maintains full precision after multiplication for both
CUDA Processing Flow
1) The main memory (RAM)
copies data onto the GPU
memory (VRAM) and the data
waits there for further
instructions.
2) Now the CPU instructs the
GPU to compute the data. At
this point, the GPU picks up
the data and throws it onto the
SPs or CUDA Cores.
3) The data processing occur in
such a way that the same kind
of instructions are executed
in parallel saving processing
time.
Graphics Pipelining
Vertex Shader: Provides location of vertices in a 3D space.
Generating Primitives: Making polygons using vertex in
3D space.
Rasterization: Process of filing triangular geometries with
dots or pixels.
Pixel Shader: Defines each pixel with attributes such as
light and color.
Testing and Mixing: Here the 3D objects are tested for
shadow effects and also the Anti-Aliasing (AA) and
Components of a Modern GPUs
Shader Processing Units (CUDA Cores) : A CUDA Core is
basically a parallel processor which computes graphics
calculations.
Texture Mapping Unit (TMU) : A TMU is used to resize,
rotate or distort a bitmap image to be placed on a 3D model
as texture. The measure of how fast a GPU can map a texture
onto a 3D model is given by the Texture Fill Rate.
Raster Operation Pipeline (ROP) : The final step of
rendering is done by the ROP. The ROP takes up pixel & texel
information and process it to give a final texture and depth
value to a pixel. The measure of how quickly a ROP can do
Nvidia Pascal
GP100
The Most Advanced
Datacenter Accelerator
Pascal Features
• Fabricated using 16nm FinFET technology by TSMC.
• New Chip-on-Wafer-on-Substrate (CoWoS) HBM2 high
bandwidth memory.
• NVLink for high bandwidth interconnect between P100
chips.
• Unified Memory, Compute Preemption and New AI
Algorithms.
NVLink
NVLink is a communication
protocol developed by Nvidia.
NVLink specifies a point-to-point
interconnect between a CPU and
a GPU and also between a GPU
and another GPU.
With NVLink a GPU can access
system memory at a rate of 160
GB/s which was 120 GB/s in
previous Maxwell architecture.
HBM 2
HBM stands for High Bandwidth Memory in which one
or more memory dies are vertically stacked over each
other whereas traditionally discrete memory chips were
soldered around the GPU chip.
The HBM2 provides a stack of 8 GB of DRAM with 180
GB/s of data transfer rates which was earlier limited to 2
GB of stacks at 125GB/s in HBM1.
Unified Memory, Compute Preemption
and New AI Algorithms
Unified Memory: Unified memory is an essential part of
CUDA programming model which greatly simplifies
programming and porting of applications to GPU by
providing a single unified virtual address space for
accessing all CPU and GPU memory in a system.
Compute Preemption: This feature allows the compute tasks
running on GPU to be interrupted at instruction level solving
the problem of long running or ill-behaved applications
which causes the system to became unresponsive while it
waits for the task to get completed.
New AI Algorithms: The Pascal features new AI technology
Applications
Mainly used for parallel computing for high calculation
intensive tasks.
Used in many Physics Simulations, Statistical Physics,
Fast Fourier Transforms, Fuzzy Logics, Analog Signal
Processing and Digital Image Processing.
The new Pascal Architecture features Deep Learning where
neural learning process of a human brain is modeled through
which the system continuously learning, getting smarter
and delivering more accurate results over time.
FUNAC, a leading brand in manufacturing Robots
demonstrated an assembly line robot powered by Pascal GPU
Questions?

Nvidia (History, GPU Architecture and New Pascal Architecture)

  • 2.
    Graphics Processing Unit AGraphics Processing Unit (GPU) also known as a Video Processing Unit (VPU) is an electronic circuit which rapidly manipulates memory to accelerate image creation/processing to be displayed on a display device. The term GPU was given by NVIDIA in 1999 with their release of the GeForce 256 as “The world’s first GPU”.
  • 3.
    Modern GPUs usesthe concept of parallel processing which makes them efficient than CPUs in processing large blocks of data. GPUs uses the process of rendering in which each pixel of the display screen (e.g. 1920x1080=2073600 pixels) is given a value of texture, lighting and location on the screen. Using these parameters, the GPUs make 3D images on a 2D screen. The GPUs have there own dedicated RAM or Virtual RAM or VRAM which stores information about each pixel (its color, location and lighting) and also can store frame buffers i.e. a full image which is going to be projected on the screen.
  • 4.
    GPU vs. CPU GraphicsProcessing Unit • Performs Data Parallelism. • Have more cores but low clock speed. • Uses VRAM which is fast but small in size. • Has low cache memory. Central Processing Unit • Performs Task Parallelism. • Have less cores but high clock speed. • Uses External RAM which is slow but large in size. • Has high cache memory.
  • 5.
  • 6.
    Streaming Multiprocessors and CUDACores A Streaming Multiprocessor is the main computation unit of a GPU. It consists more smaller processing units called Stream Processors or CUDA Cores. CUDA is the term coined by NVIDIA which stands for Compute Unified Device Architecture. A CUDA core consists of further two units, a Floating Point Unit which computes floating point data and an Integer Unit which computes integer data. There are 16 Load/Store Unit per SM
  • 7.
    There are alsofour Special Function Units which executes transcendental instructions such as sine, cosine, reciprocal & square root. Each SM uses two Warp Scheduler & two Instruction Dispatch Units, allowing two warps to be issued & executed concurrently. One of the feature of SM is the Fused Multiply-Add (FMA) instruction. A frequently used sequence of operation in computer graphics and in linear algebra is to multiply two numbers and add the product to a third number. Prior to FMA, this was achieved by Multiply-Add (MAD) instruction in which performs multiplication with truncation, followed by the addition. The FMA maintains full precision after multiplication for both
  • 8.
    CUDA Processing Flow 1)The main memory (RAM) copies data onto the GPU memory (VRAM) and the data waits there for further instructions. 2) Now the CPU instructs the GPU to compute the data. At this point, the GPU picks up the data and throws it onto the SPs or CUDA Cores. 3) The data processing occur in such a way that the same kind of instructions are executed in parallel saving processing time.
  • 9.
    Graphics Pipelining Vertex Shader:Provides location of vertices in a 3D space. Generating Primitives: Making polygons using vertex in 3D space. Rasterization: Process of filing triangular geometries with dots or pixels. Pixel Shader: Defines each pixel with attributes such as light and color. Testing and Mixing: Here the 3D objects are tested for shadow effects and also the Anti-Aliasing (AA) and
  • 10.
    Components of aModern GPUs Shader Processing Units (CUDA Cores) : A CUDA Core is basically a parallel processor which computes graphics calculations. Texture Mapping Unit (TMU) : A TMU is used to resize, rotate or distort a bitmap image to be placed on a 3D model as texture. The measure of how fast a GPU can map a texture onto a 3D model is given by the Texture Fill Rate. Raster Operation Pipeline (ROP) : The final step of rendering is done by the ROP. The ROP takes up pixel & texel information and process it to give a final texture and depth value to a pixel. The measure of how quickly a ROP can do
  • 11.
    Nvidia Pascal GP100 The MostAdvanced Datacenter Accelerator
  • 12.
    Pascal Features • Fabricatedusing 16nm FinFET technology by TSMC. • New Chip-on-Wafer-on-Substrate (CoWoS) HBM2 high bandwidth memory. • NVLink for high bandwidth interconnect between P100 chips. • Unified Memory, Compute Preemption and New AI Algorithms.
  • 13.
    NVLink NVLink is acommunication protocol developed by Nvidia. NVLink specifies a point-to-point interconnect between a CPU and a GPU and also between a GPU and another GPU. With NVLink a GPU can access system memory at a rate of 160 GB/s which was 120 GB/s in previous Maxwell architecture.
  • 14.
    HBM 2 HBM standsfor High Bandwidth Memory in which one or more memory dies are vertically stacked over each other whereas traditionally discrete memory chips were soldered around the GPU chip. The HBM2 provides a stack of 8 GB of DRAM with 180 GB/s of data transfer rates which was earlier limited to 2 GB of stacks at 125GB/s in HBM1.
  • 15.
    Unified Memory, ComputePreemption and New AI Algorithms Unified Memory: Unified memory is an essential part of CUDA programming model which greatly simplifies programming and porting of applications to GPU by providing a single unified virtual address space for accessing all CPU and GPU memory in a system. Compute Preemption: This feature allows the compute tasks running on GPU to be interrupted at instruction level solving the problem of long running or ill-behaved applications which causes the system to became unresponsive while it waits for the task to get completed. New AI Algorithms: The Pascal features new AI technology
  • 16.
    Applications Mainly used forparallel computing for high calculation intensive tasks. Used in many Physics Simulations, Statistical Physics, Fast Fourier Transforms, Fuzzy Logics, Analog Signal Processing and Digital Image Processing. The new Pascal Architecture features Deep Learning where neural learning process of a human brain is modeled through which the system continuously learning, getting smarter and delivering more accurate results over time. FUNAC, a leading brand in manufacturing Robots demonstrated an assembly line robot powered by Pascal GPU
  • 17.